make from hub import work#3
Merged
Merged
Conversation
ayushtues
pushed a commit
to ayushtues/diffusers
that referenced
this pull request
Jun 19, 2023
Fix attention weights loading
williamberman
pushed a commit
to williamberman/diffusers
that referenced
this pull request
Sep 18, 2023
Fix code quality
yiyixuxu
pushed a commit
that referenced
this pull request
Jan 21, 2024
* fix bugs in repository consistency
yuyanpeng-google
referenced
this pull request
in yuyanpeng-google/diffusers
Oct 30, 2025
requirements.txt
sayakpaul
pushed a commit
that referenced
this pull request
Nov 25, 2025
small edits to the pipeline and conversion
yiyixuxu
pushed a commit
that referenced
this pull request
Jan 15, 2026
* initial commit * initial commit * remove remote text encoder * initial commit * initial commit * initial commit * revert * img2img fix * text encoder + tokenizer * text encoder + tokenizer * update readme * guidance * guidance * guidance * test * test * revert changes not needed for the non klein model * Update examples/dreambooth/train_dreambooth_lora_flux2_klein.py Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> * fix guidance * fix validation * fix validation * fix validation * fix path * space --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
yiyixuxu
added a commit
that referenced
this pull request
Jan 15, 2026
* flux2-klein * Apply suggestions from code review Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> * Klein tests (#2) * tests * up * tests * up * support step-distilled * Apply suggestions from code review Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> * doc string etc * style * more * copies * klein lora training scripts (#3) * initial commit * initial commit * remove remote text encoder * initial commit * initial commit * initial commit * revert * img2img fix * text encoder + tokenizer * text encoder + tokenizer * update readme * guidance * guidance * guidance * test * test * revert changes not needed for the non klein model * Update examples/dreambooth/train_dreambooth_lora_flux2_klein.py Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> * fix guidance * fix validation * fix validation * fix validation * fix path * space --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> * style * Update src/diffusers/pipelines/flux2/pipeline_flux2_klein.py * Apply style fixes * auto pipeline --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> Co-authored-by: Linoy Tsaban <57615435+linoytsaban@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
This was referenced May 14, 2026
Enderfga
added a commit
to Enderfga/diffusers
that referenced
this pull request
May 21, 2026
Finding huggingface#1 — attention_kwargs plumbing: Both transformers now decorate forward() with @apply_lora_scale('attention_kwargs') (matches Wan); pipelines forward attention_kwargs to the transformer + encode_kv_cache, and the unused parameter is dropped from the inner _forward_train / _forward_cache / _forward_inference signatures. Pipeline docstrings updated to the standard wording. Finding huggingface#2 — naming: Rename far_cfg -> layout_cfg in the bidi transformer (the bidi path is not FAR; the FAR transformer keeps far_cfg, which is accurate there). Finding huggingface#3 — scheduler state machine: Add _step_index, _begin_index, step_index property, begin_index property, set_begin_index(), _init_step_index(). step() lazily initializes and advances the counter so downstream callbacks / composable schedulers can observe rollout progress. Sigma resolution remains a pure function of (timestep, r_timestep) — calling step() twice with identical args still returns identical prev_sample (idempotent). Finding huggingface#4 — redundant @torch.no_grad(): Drop the redundant decorators on bidi pipeline's encode_video and FAR pipeline's encode_kv_cache (callers are already in __call__'s no-grad scope). Finding huggingface#5 — dead code: Remove the unreachable temb.ndim == 2 else branch from the bidi transformer's output-norm path (condition_embedder.forward always returns a 3D temb). Finding huggingface#6 — private rename: forward_far_patchify[_inference] -> _forward_far_patchify[_inference] (only called internally by _forward_train / _forward_cache / _forward_inference). Finding huggingface#7 — pipeline comment numbering: Bidi + FAR pipelines renumber steps so the # 4. slot is no longer skipped. Finding huggingface#8 — mask-mod comment numbering: _build_causal_mask numbered comments now run 1) 2) 3) ... (was 1) 3) 4) ...). Tests: - New test_step_index_advances + test_set_begin_index_anchors_step_index in the scheduler test file exercise the new state machine. - All existing pipeline / transformer / scheduler tests still pass (85 passed, 85 skipped on CPU). Bit-exact: 8-step rollout vs the previous formulation, max abs diff = 0.0 (the new sigma-lookup is byte-identical to t/num_train_timesteps on this schedule).
Enderfga
added a commit
to Enderfga/diffusers
that referenced
this pull request
May 22, 2026
…anup dg845 blocking suggestion (r3287274209): - scheduling_flow_map_euler_discrete.py:185 — use `working_sigmas.new_zeros(1)` instead of `torch.zeros(1, dtype=...)` so the appended terminal sigma inherits both device and dtype from working_sigmas. The current working_sigmas always starts on CPU so the device mismatch is latent, but new_zeros is the correct defensive pattern and matches how the published FAR test fixtures run on CUDA. Claude bot final-review follow-ups: - transformer_anyflow_far.py: drop three stale `# step 3: generate attention mask` comments left over from the original numbered-step structure (bot huggingface#6). - pipeline_anyflow_far.py: annotate `encode_video` with `# Copied from diffusers.pipelines.anyflow.pipeline_anyflow.AnyFlowPipeline.encode_video` and align docstring + inline comment so `make fix-copies` keeps them in sync (bot huggingface#3). Skipped (not real / judgment-call): - bot huggingface#2 (private rename of `_forward_far_patchify*`) — already done in 84605d5; bot was looking at a stale snapshot. - bot huggingface#4 (check_inputs `# Copied from`) — FAR's check_inputs has an extra `(num_frames - 1) % 4 == 0` constraint that doesn't map onto the bidi version, so a clean `# Copied from` link would require restructuring. Bot called it a consistency nit; leaving as-is. - bot huggingface#5 (`encode_kv_cache` → `_encode_kv_cache`) — bot itself flagged this as judgment-call territory; the helper is a coherent operation that advanced inference callers may want to invoke directly.
dg845
added a commit
that referenced
this pull request
May 22, 2026
…ausal) (#13745) * [Pipelines] AnyFlow: scaffold pipelines/anyflow + register all top-level imports This is the lazy-loader scaffolding only. Body files (pipeline_anyflow.py, pipeline_anyflow_causal.py, transformer_anyflow.py, scheduling_flow_map_euler_discrete.py) come in subsequent commits. * [Schedulers] AnyFlow: add FlowMapEulerDiscreteScheduler The flow-map scheduler advances samples from timestep t to caller-provided target r in a single Euler step, supporting any-step sampling on flow-map- distilled checkpoints. It is a general-purpose scheduler — not specific to the AnyFlow checkpoints. Tests: 12 standalone tests covering instantiation, set_timesteps endpoints, shift identity/monotonicity, step shape preservation, zero-interval identity, one-shot sampling, train weight schemes, scale_noise endpoints. Docs: api/schedulers/flow_map_euler_discrete.md * [Models] AnyFlow: add AnyFlowTransformer3DModel A 3D DiT extending the v0.35.1 Wan2.1 backbone with two config-toggled modules: * FAR causal blocks (init_far_model=True): block-sparse causal attention via flex_attention + compressed-frame patch embedding for frame-level autoregressive generation (Gu et al., 2025, arXiv:2503.19325). * Dual-timestep flow-map embedding (init_flowmap_model=True): adds a delta timestep embedder enabling flow-map sampling z_t -> z_r over arbitrary intervals (AnyFlow). With both flags off, the model reduces to stock Wan2.1. The class is intentionally self-contained rather than annotated with '# Copied from diffusers.models.transformers.transformer_wan' because upstream Wan has been refactored extensively since v0.35.1 (new WanAttention class, different processor architecture). Tests: 9 unit tests covering construction in 3 modes, bidi forward shape and determinism, return_dict variants, save/load round-trip with and without init_far_model, gradient checkpointing toggle. Docs: api/models/anyflow_transformer3d.md * [Pipelines] AnyFlow: add AnyFlowPipeline and AnyFlowCausalPipeline * AnyFlowPipeline (pipeline_anyflow.py, ~590 LOC): bidirectional T2V using flow-map sampling. Loads checkpoints from nvidia/AnyFlow-Wan2.1-T2V-{1.3B,14B}. * AnyFlowCausalPipeline (pipeline_anyflow_causal.py, ~700 LOC): FAR-based causal pipeline supporting T2V/I2V/TV2V via task_type kwarg. Loads checkpoints from nvidia/AnyFlow-FAR-Wan2.1-{1.3B,14B}-Diffusers. Both pipelines reuse stock WanLoraLoaderMixin, AutoencoderKLWan, UMT5EncoderModel, and AutoTokenizer from upstream. The transformer is the AnyFlowTransformer3DModel introduced in the previous commit. The scheduler is FlowMapEulerDiscreteScheduler. Tests: * tests/pipelines/anyflow/test_anyflow.py: PipelineTesterMixin fast tests + slow integration test against nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers. * tests/pipelines/anyflow/test_anyflow_causal.py: same structure for FAR variant. Reference slices for slow integration tests are deferred to Phase 7 (Final quality pass) where the user runs them on a real GPU. * [Docs] AnyFlow: add main pipeline documentation page Modeled on the Helios pipeline doc (PR #13208). Sections: paper link + abstract, supported checkpoints table, memory/speed optimization tabs, T2V/I2V/TV2V examples for both bidirectional and causal variants, autodoc trailers. * [Auto/Scripts] AnyFlow: register AutoPipelineForText2Video + add conversion script * Register AnyFlowPipeline in AUTO_TEXT2VIDEO_PIPELINES_MAPPING. * AnyFlowCausalPipeline is intentionally NOT registered for AutoPipeline because its task switch (t2v / i2v / tv2v) is too rich for a single auto-resolve key. * scripts/convert_anyflow_to_diffusers.py: convert .pt training checkpoints (with 'ema' state dict) into a diffusers save_pretrained layout. Supports all 4 released NVIDIA AnyFlow variants. Replaces the omegaconf-based config in the upstream repo with argparse to match other diffusers conversion scripts. * [Quality] AnyFlow: ruff-format + regenerated dummy stubs * ruff format pass on all 5 source files (long lines + trailing comma fixes) * check_dummies.py --fix_and_overwrite regenerated: - dummy_pt_objects.py: AnyFlowTransformer3DModel + FlowMapEulerDiscreteScheduler - dummy_torch_and_transformers_objects.py: AnyFlowPipeline + AnyFlowCausalPipeline Local fast tests: 21/21 passed - 12 scheduler tests (FlowMapEulerDiscreteScheduler) - 9 transformer tests (AnyFlowTransformer3DModel construction + bidi forward + save/load) The pipeline fast tests in tests/pipelines/anyflow/ require a local dev install that matches the diffusers main branch's transformers >= compatibility floor. The reference slices for slow integration tests (real GPU + 1.3B/14B checkpoints) are intentionally left as TODO stubs to be captured by the user on a real GPU machine before opening the PR. * [AnyFlow] address review feedback: bug fixes + DMD wording + EN/ZH tutorials Critical bug fixes (verified against precision-validation review): * pipeline_anyflow.py / pipeline_anyflow_causal.py: replace hardcoded transformer_dtype = torch.bfloat16 with self.transformer.dtype, so pipe.to("cpu") and PipelineTesterMixin save/load tests do not crash on a dtype mismatch in the patch_embedding conv3d. * transformer_anyflow.py: drop the duplicate `base = base = ...` assignment in _build_causal_mask (was a copy-paste typo carried over from FAR-Dev). * transformer_anyflow.py: drop unused `q_is_context` / `k_is_context` locals and the `# noqa: F841` markers that were silencing the dead-store warning. * transformer_anyflow.py: remove `CacheMixin` from the inheritance list — the pipeline manages KV cache directly, the mixin's interface is unused. * transformer_anyflow.py: guard the module-level `torch.compile(flex_attention)` with try/except so the file imports cleanly on CPU CI / no-Triton machines. * convert_anyflow_to_diffusers.py: replace ad-hoc print warnings with the stdlib logger (warning_once-style) and a module-level basicConfig. Documentation accuracy: * AnyFlowCausalPipeline class docstring + main pipeline doc + EN/ZH tutorial: drop the fictitious `task_type` / `image` / `video` arguments and document the real API: pass `context_sequence={"raw": tensor}` (or `{"latent": ...}`) to switch between T2V (None) / I2V (1-frame) / TV2V (4n+1-frame) modes. * Pipeline class docstrings + main doc: explicitly describe AnyFlow's two-stage LoRA distillation including DMD reverse-divergence supervision with Flow-Map backward simulation in stage 2 (was previously implicit). * training_rollout: add detailed docstring explaining its role as the 3-segment Flow-Map backward simulation entry point used during DMD training. * Long-form tutorial doc `using-diffusers/anyflow.md` (EN, 239 LOC) and Chinese mirror `docs/source/zh/using-diffusers/anyflow.md` (224 LOC) added and registered in both `_toctree.yml` files. Tests: * Skip `test_attention_slicing_forward_pass` in both pipeline test classes with a clear rationale (custom attention processor does not support slicing). * All 21 standalone tests still pass (12 scheduler + 9 transformer). Quality gates: * `ruff check` clean across all AnyFlow files. * `ruff format --check` reports 6 files already formatted. * `python utils/check_copies.py` reports no diff. Out of scope for this commit (deferred until reviewer feedback): * Splitting AnyFlowTransformer3DModel into bidi + causal subclasses * Unifying _forward_inference / _forward_cache return types * Migrating model tests from plain unittest to BaseModelTesterConfig + mixins * HF model card / config.json metadata updates on the nvidia/* repos (push to Hub manually before opening the PR) * [AnyFlow] rename Causal->FAR + explicit forward signature + dataclass output Round 2 of review feedback. Three groups of changes; transformer state-dict keys, module hierarchy, and tensor flow are unchanged so the H200 bit-exact validation remains valid. A. Pipeline rename (mechanical, no behavior change): * Class: AnyFlowCausalPipeline -> AnyFlowFARPipeline (Causal in diffusers usually means an attention mask; AnyFlow's variant is FAR autoregressive, so the FAR name is more specific and matches the paper). * File: pipeline_anyflow_causal.py -> pipeline_anyflow_far.py (git mv). * Test file: test_anyflow_causal.py -> test_anyflow_far.py (git mv). * All references updated in src/, tests/, docs/, scripts/, plus stale anyflowcausalpipeline anchor links in tutorial markdown. B. Pipeline test bug fixes (closes 19 fast-test failures reported by precision-validation reviewer): * pipeline_anyflow.py / pipeline_anyflow_far.py: __call__ now sets self._num_timesteps = num_inference_steps before the rollout, so the PipelineTesterMixin callback tests can read pipe.num_timesteps. * tests/pipelines/anyflow/test_anyflow_far.py: drop the fictitious task_type="t2v" kwarg that crashed every causal fast test (the FAR pipeline selects mode via context_sequence, not a task_type arg). C. Transformer architecture cleanups (review-driven, no tensor changes): * Replace forward(*args, **kwargs) dispatcher with an explicit signature listing every supported kwarg (hidden_states, timestep, r_timestep, encoder_hidden_states, encoder_hidden_states_image, chunk_partition, clean_hidden_states, clean_timestep, kv_cache, kv_cache_flag, is_causal, attention_kwargs, return_dict). Helps IDE / type-checker / torch.compile tracing. * Drop SimpleNamespace returns. Add AnyFlowFARTransformerOutput (BaseOutput dataclass with sample + kv_cache fields) for the two causal paths that need to also propagate kv_cache (_forward_inference and the newly return_dict-aware _forward_cache). _forward_train and _forward_bidirection now consistently return Transformer2DModelOutput. Pipeline call sites already use return_dict=False with positional unpacking, so the fix is transparent there. Out of scope (deferred until canonical-org HF metadata sync): * Splitting AnyFlowTransformer3DModel into a bidi class plus an AnyFlowFARTransformer3DModel subclass — touches register_to_config keys and would require updating model_index.json on every released checkpoint. * Promoting chunk_partition from register_to_config to a forward-time argument (same reason). * Renaming training_rollout to _denoise — would break callers in the FAR-Dev on-policy trainer that produced the released checkpoints. Local fast tests: 21/21 still pass (12 scheduler + 9 transformer). ruff check, ruff format, and check_copies.py are all clean. * [AnyFlow] wire callback_on_step_end through inference_range + add chunk_partition to FAR fast-test fixture Two root causes for the 19 remaining PipelineTesterMixin failures, identified by the H200 reviewer: 1. callback_on_step_end was accepted by __call__ but never invoked. Both pipelines pass it through to training_rollout (and FAR additionally through inference()), and inference_range now fires it after scheduler.step in the standard inference branch: if callback_on_step_end is not None: callback_kwargs = {k: locals()[k] for k in callback_on_step_end_tensor_inputs} callback_outputs = callback_on_step_end(self, i, t, callback_kwargs) latents = callback_outputs.pop("latents", latents) prompt_embeds = ... negative_prompt_embeds = ... `nonlocal prompt_embeds, negative_prompt_embeds` lets the callback rewrite the closure-captured embeddings, matching upstream WanPipeline semantics. The 3-segment grad_timestep training rollout does not invoke the callback; it is intentionally training-only. 2. tests/pipelines/anyflow/test_anyflow_far.py::get_dummy_components built the dummy transformer without a `chunk_partition`, leaving it None on the model config and crashing the pipeline at `sum(self.transformer.config.chunk_partition)`. Set `chunk_partition=[1, 1, 1]` in the fixture (3 chunks of 1 latent frame each, matching the test's num_frames=9 -> 3 latent frames). Local fast tests: 21/21 still pass. ruff check, ruff format, and check_copies.py are all clean. * [AnyFlow] Phase 2: split transformer + drop chunk_partition from config + rename helpers Major architectural refactor that aligns the integration with diffusers conventions ahead of the canonical-org Hub upload. State-dict keys, module hierarchy, and tensor flow are unchanged so the H200 bit-exact validation remains valid; only the on-disk transformer/config.json fields move. Changes: 1. **Sibling transformer classes** replace the flag-driven single class: * AnyFlowTransformer3DModel — bidirectional only. Drops compressed_patch_size / full_chunk_limit / init_far_model / init_flowmap_model / chunk_partition kwargs (always-on for AnyFlow distilled checkpoints). * AnyFlowFARTransformer3DModel — adds far_patch_embedding + the 3 FAR forward paths (train / cache-prefill / autoregressive inference). * AnyFlowTimeTextImageEmbedding (the legacy single-time embedder used only by the old setup_flowmap_model bootstrap) is removed; both classes now build AnyFlowDualTimestepTextImageEmbedding directly in __init__. * setup_flowmap_model / setup_far_model methods are removed; weight warm-start for far_patch_embedding (trilinear interpolation from patch_embedding) moves into AnyFlowFARTransformer3DModel.__init__. 2. **chunk_partition** is no longer a model config field. The FAR pipeline owns the schedule: * AnyFlowFARPipeline.default_chunk_partition = [1, 3, 3, 3, 3, 3, 3, 2] matches the released 81-frame NVIDIA checkpoints. * AnyFlowFARPipeline.__call__ / _denoise_rollout accept a chunk_partition argument that overrides the default for non-default num_frames. 3. **training_rollout -> _denoise_rollout** rename across both pipelines and all English / Chinese docs that referenced it. Signals the method is internal to the pipeline driver, not a public training API. 4. **Conversion script + tests + docs + registries**: * scripts/convert_anyflow_to_diffusers.py: VARIANTS dict picks the right transformer class per variant; init_far_model / init_flowmap_model / chunk_partition kwargs are removed from the from_pretrained call. * Transformer test file split into AnyFlowTransformer3DModelTest and AnyFlowFARTransformer3DModelTest classes. * Pipeline test fixtures use the right class and pass chunk_partition via get_dummy_inputs (3-frame schedule [1, 1, 1] for the 9-frame test). * New docs page docs/source/en/api/models/anyflow_far_transformer3d.md; anyflow_transformer3d.md rewritten for the bidi-only class. * AnyFlowFARTransformer3DModel registered in src/diffusers/__init__.py, src/diffusers/models/__init__.py, models/transformers/__init__.py and the dummy_pt_objects.py stubs. * docs/source/en/_toctree.yml: new entry for the FAR transformer page. 5. **Cleanups**: * Pipeline __call__ no longer passes is_causal=False to the bidi forward (the bidi class doesn't accept it). * Pipeline class docstrings drop stale references to init_*_model flags. Local tests: 22/22 pass (12 scheduler + 10 transformer covering both classes). ruff check / format / check_copies clean. Hub artifacts (model_index.json, transformer/config.json, scheduler config) need to be regenerated for the released checkpoints; the HF update guide will be delivered separately. * [AnyFlow] Phase 3: convention compliance against .ai/AGENTS.md + .ai/models.md Hard violations (per official diffusers guidelines): * drop einops dependency — replace 25+ rearrange() calls with native permute/reshape/unflatten in transformer + both pipelines * device-gate torch.float64 — apply_rotary_emb and AnyFlowRotaryPosEmbed now fall back to float32 / complex64 on MPS / NPU; freqs are lazily rebuilt per-device via _build_freqs (matches transformer_wan / transformer_flux pattern) * migrate attention to dispatch_attention_fn — replace direct F.scaled_dot_product_attention calls with dispatch_attention_fn (works with sage / flash / native backends); introduce AnyFlowAttention( AttentionModuleMixin) with _default_processor_cls / _available_processors; rename processors to AnyFlowAttnProcessor / AnyFlowCrossAttnProcessor and declare _attention_backend / _parallel_config class attrs * drop dead config fields — qk_norm and added_kv_proj_dim are pruned from both transformer __init__ signatures and AnyFlowTransformerBlock; AnyFlowAttention is hardcoded to rms-norm-across-heads (the only scheme the released checkpoints use) and has no add_k_proj path (T2V only) * add _repeated_blocks = ["AnyFlowTransformerBlock"] to both transformer classes for compile_repeated_blocks() support (matches Wan) * annotate prepare_latents with `# Copied from diffusers.pipelines.wan. pipeline_wan.WanPipeline.prepare_latents`; the pipeline-side rearrange to (B, T, C, H, W) layout is moved to the call site State-dict keys are preserved (legacy Attention had identical to_q / to_k / to_v / to_out / norm_q / norm_k naming), so existing AnyFlow checkpoints load bit-exactly into the new AnyFlowAttention class. The HF Hub config-update guide is updated correspondingly: transformer/ config.json now drops qk_norm and added_kv_proj_dim alongside the previous init_far_model / init_flowmap_model / chunk_partition removals. 22 fast CPU tests still pass; ruff format / ruff check / check_copies all clean. * [AnyFlow] FAR fast-test compat: rope 0-dim guard + flex_attention CPU/head-dim fallbacks + KV-cache dtype + num_timesteps Phase 3 migrated bidi + cross-attention to dispatch_attention_fn but the FAR causal path still calls flex_attention directly, which has hard requirements (CPU compile, head_dim >= 16) that fail on PipelineTesterMixin's tiny dummy components. Real ckpts (head_dim=128, CUDA) never hit these branches; bit-exact numerical equivalence with FAR-Dev preserved on all 4 released ckpts (forward 0.00e+00, backward kernel-nondet only, ratio 1.000). Code fixes: 1. AnyFlowRotaryPosEmbed._forward_compressed_frame / _forward_full_frame now short-circuit to an empty tensor when num_frames / height / width is 0. PipelineTesterMixin's dummy VAE has scale_factor_spatial=8, so a 16x16 raw spatial input becomes a 2x2 latent which then floors to 0 against compressed_patch_size=(1, 4, 4); the original `freqs[:0].view(0, k, 1, -1)` reshape was ambiguous in that regime. 2. flex_attention dispatch: split the module-load `torch.compile(flex_attention, dynamic=True)` into `_flex_attention_eager` (always available) plus `_flex_attention_compiled`, with a tiny wrapper that picks compiled for CUDA tensors and eager for CPU. Avoids torch._inductor C++ codegen failures that broke fast tests after `pipe.to("cpu")`. CUDA performance unchanged (L10 benchmark: 0.0% delta on bidi 1.3B fwd, 0.0% delta on FAR causal 1.3B fwd). 3. AnyFlowAttnProcessor (FAR causal branch): when head_dim < 16 (flex_attention's hard minimum) zero-pad q/k/v's last dim to 16 and pass `scale=1/sqrt(original_head_dim)` to flex_attention. Padded value rows contribute 0, so trimming the output back is mathematically equivalent. Released ckpts use head_dim=128 so the branch is never taken in production. 4. pipeline_anyflow_far.encode_kv_cache: replace the hardcoded `latents.to(torch.bfloat16)` with `self.transformer.dtype`. The hardcoded bf16 crashed conv3d on dummy fp32 components ("Input type (BFloat16) and bias type (float) should be the same"); real bf16 ckpts are unaffected. 5. pipeline_anyflow_far._denoise_rollout sets `self._num_timesteps = (len(chunk_partition) - num_context_chunks) * num_inference_steps` before the chunk loop, so PipelineTesterMixin.test_callback_cfg's `pipe.num_timesteps`-based assertion matches the actual number of callback fires (chunks * NFE) instead of the previous hardcoded num_inference_steps. Tests: * test_callback_inputs cannot pass without changing FAR's chunk-wise output semantics — it zeroes latents on the final step and asserts the *entire* output buffer is zero, but only the active chunk's slice is overwritten in a chunk-wise rollout. Marked `@unittest.skip` with a detailed rationale; callback functionality itself is still covered by test_callback_cfg. * Full pytest run on tests/pipelines/anyflow/ + tests/models/transformers/test_models_transformer_anyflow.py + tests/schedulers/test_scheduler_flow_map_euler_discrete.py: 81 passed, 0 failed, 11 skipped. Quality gates: * `ruff check` and `ruff format --check` clean across all AnyFlow files. * `python utils/check_copies.py` clean. * `python utils/check_dummies.py` clean. * [AnyFlow] docs/code: paper-release tidy-up User-facing alignment with the official HF Hub model card and the day-of-announcement materials at https://huggingface.co/collections/nvidia/anyflow. * Fill in the arXiv identifier 2605.13724 (5 paper links + 2 BibTeX entries). * Rename TV2V → V2V across docs + pipeline_anyflow{,_far}.py so the diffusers copy uses the same Video-to-Video terminology as the official model card. * Add the [nvidia/anyflow](https://huggingface.co/collections/nvidia/anyflow) HF collection link to the three tutorial intros. * Drop the temporary "guyuchao/* staging" tip from the EN tutorial / API page / ZH tutorial — the nvidia/AnyFlow-*-Diffusers repos are now live. * Wire up NVlabs/AnyFlow (training code) and nvlabs.github.io/AnyFlow (project page) in place of the prior <github-org> / <project-page-url> placeholders. * Cite the authors (Yuchao Gu, Guian Fang et al.) and NUS ShowLab × NVIDIA affiliation in the main tutorial, API pipeline page, and both transformer model pages; BibTeX uses the standard `and others` to elide the full list until the next pass. Working tree, CI gates, and tests after the change: ruff format --check ✓ ruff check ✓ python utils/check_copies.py ✓ python utils/check_dummies.py ✓ pytest tests/models + tests/schedulers (22 fast) ✓ No production code logic changes — only docstring wording inside pipeline files (TV2V → V2V). * [AnyFlow] docs: drop in official BibTeX (full author list) Replace the placeholder ``@article{gu2026anyflow, author = {Gu, Yuchao and Fang, Guian and others}, ...}`` block in both the English and Chinese tutorials with the canonical ``@misc{gu2026anyflowanystepvideodiffusion, ...}`` form from arxiv.org/abs/2605.13724, which lists all seven authors: Yuchao Gu, Guian Fang, Yuxin Jiang, Weijia Mao, Song Han, Han Cai, Mike Zheng Shou. Docs-only. * [AnyFlow] align with diffusers conventions + drop training-only code Scheduler - FlowMapEulerDiscreteScheduler.step now returns a FlowMapEulerDiscreteSchedulerOutput dataclass (or tuple with return_dict=False) and uses the conventional positional order (model_output, timestep, sample, r_timestep). - Drop training-only helpers: adaptive_weighting, set_train_weight, get_train_weight, linear_timesteps_weights, and the weight_type config field. - Add scale_model_input no-op for API parity; raise ValueError on missing r_timestep. Transformer - Remove gate_track debug write inside AnyFlowDualTimestepTextImageEmbedding.forward_timestep. - Compile flex_attention lazily on first CUDA call instead of at import time. - Replace assert with ValueError in build_block_mask. - Resolve <arxiv-id> placeholders to 2605.13724. Pipelines (AnyFlowPipeline + AnyFlowFARPipeline) - Add EXAMPLE_DOC_STRING + @replace_example_docstring and full __call__ docstrings covering every argument. - Move use_mean_velocity from __init__ to __call__ so save/load round-trips. - Drop _denoise_rollout's grad_timestep branch (DMD on-policy training rollout), the inner inference_range closure, and the redundant negative-prompt concat. - Replace asserts with ValueError; wire show_progress to tqdm; rename inference -> _inference; remove dead current_timestep property. - Update scheduler.step call sites to the new signature. - Trim class docstrings to inference-only language. Pipeline output - Add Apache 2.0 license header; switch to relative import. Auto pipeline / conversion script - Register AnyFlowFARPipeline in AUTO_IMAGE2VIDEO_PIPELINES_MAPPING and AUTO_VIDEO2VIDEO_PIPELINES_MAPPING. - Document the weights_only=False requirement in the conversion script. Tests - Scheduler tests use the new step signature and verify the Output dataclass contract. - Drop the four obsolete training-weight tests; drop weight_type kwarg from pipeline test fixtures; remove internal milestone names from TODO comments. Docs - Resolve <arxiv-id> in the scheduler docs page. - Trim DMD / on-policy distillation language in EN/ZH tutorials and the pipelines page; the paper abstract quote is preserved verbatim. * [AnyFlow] split FAR causal transformer into transformer_anyflow_far.py Per @dg845's review on #13745: extract FAR causal modules into a dedicated sibling file so each transformer variant reads in isolation. Shared submodules are duplicated via `# Copied from` so `make fix-copies` keeps both in sync. - `transformer_anyflow.py`: bidi-only. `AnyFlowAttnProcessor` no longer carries the flex/KV-cache branch (was: dispatch in one branch, bare flex_attention in the other); `AnyFlowRotaryPosEmbed` drops the compressed-frame helpers and the `is_causal` arg; `AnyFlowDualTimestepTextImageEmbedding` drops its causal branch. `AnyFlowTransformerBlock` keeps a single class with a new `is_causal: bool = False` ctor flag that selects the self-attn processor — the forward path is identical in both modes, only the processor differs. - `transformer_anyflow_far.py`: new. Contains `AnyFlowFARTransformerOutput`, `AnyFlowCausalAttnProcessor` (routed through `dispatch_attention_fn(backend= "flex")` with a clear ValueError when a non-flex backend is configured; the BlockMask is consumed only by the flex backend in `_native_flex_attention`), `AnyFlowDualTimestepTextImageEmbeddingCausal`, `AnyFlowCausalRotaryPosEmbed`, `AnyFlowFARTransformer3DModel`, and `# Copied from` clones of the shared shared `AnyFlowAttention`/`AnyFlowCrossAttnProcessor`/`AnyFlowImageEmbedding`/ `AnyFlowTransformerBlock`/`AnyFlowAttnProcessor` modules. Verified bit-exact against the pre-refactor branch on H200 (float32): - bidi: L2 = 0.000e+00, max|Δ| = 0.000e+00 - FAR : L2 = 4.772e-06, max|Δ| = 3.576e-07 The FAR delta is fp32 accumulation noise from the dispatch path permuting (B,L,H,D) ↔ (B,H,L,D) around the same `flex_attention` kernel. Addresses review comments at transformer_anyflow.py:215, :261, :450, :622, :671, :958. * [AnyFlow] pipeline cleanup: video_processor, encode_video, inline rollout, kwarg rename Per @dg845's review on #13745, applied to both bidi `AnyFlowPipeline` and causal `AnyFlowFARPipeline`: - Use `self.video_processor.preprocess_video(...)` instead of the manual `* 2 - 1` normalize. - Merge `vae_encode` + `encode_latents` + `_normalize_latents` into a single `encode_video` method, mirroring `WanImageToVideoPipeline.encode_image`'s flat structure. - Inline `_denoise_rollout` into `AnyFlowPipeline.__call__`. For the FAR pipeline, inline both `_denoise_rollout` and `_inference` as a nested loop (outer over chunks, inner over denoising steps), mirroring `WanAnimatePipeline.__call__`. `encode_kv_cache` is intentionally kept as a method — it is one transformer call with a different `kv_cache_flag` mode (cache-write), and inlining it would interleave two distinct forward semantics in the same loop body and lose readability. - Rename `context_sequence` → `video` (pixel-space) + `video_latents` (pre-encoded), matching `WanVideoToVideoPipeline`. For the FAR pipeline, the old `{"raw"/"latent"}` dict form is replaced by the two kwargs. Mutually-exclusive validation raises `ValueError`. Addresses review comments at pipeline_anyflow.py:358, :372, :393, :473 and pipeline_anyflow_far.py:395, :489, :675. * [AnyFlow] scheduler: N-length timesteps + step defaults r_timestep Per @dg845's review on #13745: - `set_timesteps(N)` now produces `N` timesteps backed by an internal `sigmas[N+1]` linspace, matching `FlowMatchEulerDiscreteScheduler.set_ timesteps`. The final sigma (== 0) is the implicit r-endpoint of the last step; the pipeline rollouts iterate `for i, t in enumerate(timesteps)` without the old `[:-1]` slicing. - `step(r_timestep=None)` now defaults to the next timestep on the schedule (resolved via fp-tolerant `argmin` over `sigmas[:-1]`), instead of raising. Any-step sampling is preserved when `r_timestep` is explicit. The raise stays only for the case where the caller passes a `timestep` value that isn't on the schedule and provides no `r_timestep` — there's no sensible default in that case. - Build sigmas in float64 on CPU then move to the target device, with a float32 downcast for MPS / NPU (float64 isn't supported on those backends). Pipeline rollout loops updated to compute `r = sigmas[i + 1] * num_train_ timesteps` for the model's `r_timestep` input and pass `r_timestep=None` to `scheduler.step` (which resolves it from the schedule internally). Addresses review comments at scheduling_flow_map_euler_discrete.py:107 and :148. * [AnyFlow] tests: regenerate via generate_model_tests.py; split bidi/FAR files Per @dg845's review on #13745: replaced the hand-rolled transformer tests with the standard mixin-based suite produced by `utils/generate_model_tests .py`, and split the FAR causal model tests into their own file to mirror the transformer file split. - `tests/models/transformers/test_models_transformer_anyflow.py`: regenerated bidi suite. Pulls in `ModelTesterMixin`, `MemoryTesterMixin`, `TrainingTesterMixin`, `AttentionTesterMixin`, `TorchCompileTesterMixin` via `BaseModelTesterConfig`, with `get_init_dict()` / `get_dummy_inputs()` filled in for the small bidi config used in CI. - `tests/models/transformers/test_models_transformer_anyflow_far.py`: new. Same mixin set (TorchCompile is intentionally skipped — FAR's `_build_causal_mask` uses `flex_attention.create_block_mask(_compile=False)` which conflicts with the standard compile tester's assumptions; the bidi file covers compile, FAR is bit-exact-validated end-to-end on H200 via the pipeline replay). Also carries an `AnyFlowCausalAttnProcessor` smoke test that exercises the backend gate (non-flex backends must raise) and asserts the `AnyFlowFARTransformerOutput` dataclass exposes the expected fields. Addresses review comments at test_models_transformer_anyflow.py:71 and :128. * [AnyFlow] docs: update for video / video_latents kwarg rename Following the pipeline kwarg refactor in e9d50b2, sweep the user-facing docs to reflect the new API: - `docs/source/en/api/pipelines/anyflow.md`: T2V / I2V / V2V code examples now use `video=` instead of `context_sequence={"raw": ...}`. The "Generation with AnyFlow (FAR Causal)" intro describes the new mutually-exclusive `video` / `video_latents` selector. - `docs/source/en/using-diffusers/anyflow.md`: the scenario selector table, the "Image-to-video and video-to-video" walkthrough, and the closing note about pre-encoded latents are all updated. `vae_encode` references are replaced with `encode_video`. * [AnyFlow] tests: skip FAR training tests on CPU (flex backward); align scheduler tests with N-length timesteps - TestAnyFlowFARTransformer3DTraining: skip test_training / test_training_with_ema / test_gradient_checkpointing_equivalence on CPU. FAR causal self-attn uses torch.nn.attention.flex_attention whose backward kernel is GPU-only. - test_scheduler_flow_map_euler_discrete: assert timesteps is N-length (not N+1) and the sigma=0 r-endpoint lives in self.sigmas[-1]; test_step_one_shot_sampling now exercises r_timestep=None (resolved from sigmas) since N=1 has no timesteps[1]. * [AnyFlow] docs: complete forward() Args: sections for check_forward_call_docstrings main #13758 added utils/check_forward_call_docstrings.py which requires every signature arg to appear as its own `name (...):` entry under Args:. Expand the bidi and FAR transformer forward docstrings to list each parameter individually. * [AnyFlow] apply 5/21 review suggestions (A: 1-click) FAR transformer: - AnyFlowCausalAttnProcessor: default _attention_backend = 'flex' (was None); remove None from _SUPPORTED_BACKENDS. None previously fell through to SDPA which silently ignored the BlockMask; failing loudly is the right default. - dispatch_attention_fn call: read self._attention_backend instead of hardcoded 'flex', so '_native_flex' selection works. - _build_freqs / _forward_full_frame: add '# Copied from' to bidi RoPE. Pipelines: - bidi + FAR docstrings: video shape (B, C, T, H, W) -> (B, T, C, H, W) to match VideoProcessor.preprocess_video. - FAR EXAMPLE_DOC_STRING: single-frame I2V tensor wrap uses unsqueeze(1) for the T axis instead of unsqueeze(2). - FAR encode_video: drop duplicated @torch.no_grad() decorator. Tests: - test_anyflow / test_anyflow_far: lift the test_save_load_optional_components skip (the test actually passes). - FAR processor smoke test: assert default backend is 'flex' (was 'None'). * [AnyFlow] apply 5/21 review suggestions (B: refactors) Pipelines: - check_inputs accepts video / video_latents and raises early on: (a) mutual exclusion (was checked late in __call__); (b) FAR's (num_frames - 1) % 4 == 0 constraint. __call__ no longer carries duplicate validation. - FAR pipeline: drop the show_progress kwarg and replace the single tqdm with nested progress bars in the LLaDA-2 pattern — outer 'Chunks' (position=0) and per-chunk inner 'Inference Steps' (position=1, leave=False) — both picking up DiffusionPipeline._progress_bar_config (so set_progress_bar_config controls them, including disable=None). Scheduler: - step() resolves source and target sigmas by indexing self.sigmas via the new index_for_timestep(), instead of dividing the input timesteps by num_train_timesteps. This keeps the math correct for any future schedule whose timestep/sigma relationship is non-linear. For an off-schedule r_timestep the code falls back to r / num_train_timesteps, so explicit any-step sampling outside the schedule still works (and t off-schedule with r=None still raises a clear ValueError, as before). Numerical equivalence: for the shipped linspace+shift schedule the two formulations are bit-identical (verified: max abs diff = 0.0 over an N=8, shift=5 schedule). * [AnyFlow] apply Claude bot review (5/21): 8 findings beyond dg845's list Finding #1 — attention_kwargs plumbing: Both transformers now decorate forward() with @apply_lora_scale('attention_kwargs') (matches Wan); pipelines forward attention_kwargs to the transformer + encode_kv_cache, and the unused parameter is dropped from the inner _forward_train / _forward_cache / _forward_inference signatures. Pipeline docstrings updated to the standard wording. Finding #2 — naming: Rename far_cfg -> layout_cfg in the bidi transformer (the bidi path is not FAR; the FAR transformer keeps far_cfg, which is accurate there). Finding #3 — scheduler state machine: Add _step_index, _begin_index, step_index property, begin_index property, set_begin_index(), _init_step_index(). step() lazily initializes and advances the counter so downstream callbacks / composable schedulers can observe rollout progress. Sigma resolution remains a pure function of (timestep, r_timestep) — calling step() twice with identical args still returns identical prev_sample (idempotent). Finding #4 — redundant @torch.no_grad(): Drop the redundant decorators on bidi pipeline's encode_video and FAR pipeline's encode_kv_cache (callers are already in __call__'s no-grad scope). Finding #5 — dead code: Remove the unreachable temb.ndim == 2 else branch from the bidi transformer's output-norm path (condition_embedder.forward always returns a 3D temb). Finding #6 — private rename: forward_far_patchify[_inference] -> _forward_far_patchify[_inference] (only called internally by _forward_train / _forward_cache / _forward_inference). Finding #7 — pipeline comment numbering: Bidi + FAR pipelines renumber steps so the # 4. slot is no longer skipped. Finding #8 — mask-mod comment numbering: _build_causal_mask numbered comments now run 1) 2) 3) ... (was 1) 3) 4) ...). Tests: - New test_step_index_advances + test_set_begin_index_anchors_step_index in the scheduler test file exercise the new state machine. - All existing pipeline / transformer / scheduler tests still pass (85 passed, 85 skipped on CPU). Bit-exact: 8-step rollout vs the previous formulation, max abs diff = 0.0 (the new sigma-lookup is byte-identical to t/num_train_timesteps on this schedule). * [AnyFlow] scheduler: honour off-schedule any-step in _init_step_index; drop dead _resolve_next_timestep Audit caught two issues in the previous scheduler commit: 1. The new state machine raised in _init_step_index whenever the first timestep wasn't on the active schedule, contradicting the documented contract that step() falls back to t/num_train_timesteps for off-schedule any-step sampling. The fall-back numerics were intact but they were unreachable — the init check fired first. Fix: _init_step_index now initializes _step_index to 0 when the timestep is off-schedule (still a valid observable counter for callbacks). step()'s sigma resolution is untouched, so on-schedule rollouts stay bit-exact and off-schedule any-step sampling actually runs again. Regression test: test_step_off_schedule_anystep_supported. 2. _resolve_next_timestep had no remaining callers after the step() rewrite inlined the same lookup. Removed (private helper, no external API). * [AnyFlow] docs: align user guides with video shape + kwarg fixes - en api/pipelines/anyflow.md: video shape (B, C, T, H, W) -> (B, T, C, H, W); example tensor wrap uses unsqueeze(0).unsqueeze(1) and permute(0, 3, 1, 2) to match VideoProcessor.preprocess_video's 5D contract. - zh using-diffusers/anyflow.md: same shape fixes; also flip the I2V / V2V examples from the obsolete context_sequence={...} dict to the current video= / video_latents= kwargs; helper to_video_tensor returns (1, T, C, H, W); add a note about mutual exclusion. * [AnyFlow] tests: drop @slow integration test scaffolds for initial PR .ai/skills/model-integration/SKILL.md is explicit: 'No integration / slow tests in the initial PR — don't add anything gated on @slow / RUN_SLOW=1 yet.' Our two integration test classes were shape-only assertions with TODOs for a future numeric reference, so dropping them loses no actual coverage — the relevant rollouts are covered by H200 bit-exact replay outside the pytest suite. Can land a follow-up PR after merge with proper numeric reference slices once the maintainer is comfortable enabling slow tests. * Apply style fixes * [AnyFlow] apply 5/22 dg845 review: comment cleanups + custom sigmas/timesteps schedule dg845 third pass — 7 of 9 comments applied; the 8th (custom sigmas/timesteps support) matches FlowMatchEulerDiscreteScheduler conventions; the 9th (_build_causal_mask refactor) is explicitly marked non-blocking and deferred to a follow-up that also re-enables TorchCompileTesterMixin. Comment cleanups: - transformer_anyflow.py:704 temb output-norm comment: drop redundant 'no ndim==2 branch'. - pipeline_anyflow.py:550 denoise loop comment: '# 6. Denoising loop'. - pipeline_anyflow_far.py:684 denoise loop comment: '# 8. Denoising loop (outer over chunks, inner over timesteps).'. - pipeline_anyflow_far.py:702 drop trailing inline comment on `timesteps = scheduler.timesteps`. - scheduling_flow_map_euler_discrete.py: clearer wording on the off-schedule `r_timestep` error. Custom schedule support: - FlowMapEulerDiscreteScheduler.set_timesteps gains `sigmas` and `timesteps` kwargs mirroring FlowMatchEulerDiscreteScheduler. Default behaviour is unchanged (linspace + shift); the validation + length-N → length-N+1 terminal-0 append are shared with the default path so on-schedule rollouts stay bit-exact. - AnyFlowPipeline.__call__ and AnyFlowFARPipeline.__call__ accept `sigmas` and `timesteps` kwargs, override num_inference_steps from their length, and forward to set_timesteps (matches LTX2Pipeline pattern). - New scheduler tests: test_set_timesteps_custom_sigmas and test_set_timesteps_custom_timesteps cover both override paths. Dtype skip on save/load: - TestAnyFlowTransformer3D and TestAnyFlowFARTransformer3D now skip test_from_save_pretrained_dtype_inference (parametrized over fp16/bf16), mirroring WanTransformer3DModel's skip — the test's tolerance requirements are too high for meaningful signal under AnyFlow's flow-map mixed-precision sampling. * [AnyFlow] docs: apply hf-doc-builder line wrap (max_len 119) CI doc-builder style check flagged 3 files with docstring lines >119 chars. Ran 'doc-builder style src/diffusers docs/source --max_len 119' to autoformat; content unchanged, line wrapping only. * [AnyFlow] apply 5/22 follow-up review: new_zeros terminal sigma + cleanup dg845 blocking suggestion (r3287274209): - scheduling_flow_map_euler_discrete.py:185 — use `working_sigmas.new_zeros(1)` instead of `torch.zeros(1, dtype=...)` so the appended terminal sigma inherits both device and dtype from working_sigmas. The current working_sigmas always starts on CPU so the device mismatch is latent, but new_zeros is the correct defensive pattern and matches how the published FAR test fixtures run on CUDA. Claude bot final-review follow-ups: - transformer_anyflow_far.py: drop three stale `# step 3: generate attention mask` comments left over from the original numbered-step structure (bot #6). - pipeline_anyflow_far.py: annotate `encode_video` with `# Copied from diffusers.pipelines.anyflow.pipeline_anyflow.AnyFlowPipeline.encode_video` and align docstring + inline comment so `make fix-copies` keeps them in sync (bot #3). Skipped (not real / judgment-call): - bot #2 (private rename of `_forward_far_patchify*`) — already done in 84605d5; bot was looking at a stale snapshot. - bot #4 (check_inputs `# Copied from`) — FAR's check_inputs has an extra `(num_frames - 1) % 4 == 0` constraint that doesn't map onto the bidi version, so a clean `# Copied from` link would require restructuring. Bot called it a consistency nit; leaving as-is. - bot #5 (`encode_kv_cache` → `_encode_kv_cache`) — bot itself flagged this as judgment-call territory; the helper is a coherent operation that advanced inference callers may want to invoke directly. --------- Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
yiyixuxu
added a commit
that referenced
this pull request
May 28, 2026
* Adding Cosmos 3
* removed dead code
* Change customer TimeEmbedding Layer to DIffusers Time Embedding
* removed dependency on hugging face transformers
* refactor 1
* Fixed Attention Pattern
* Removed from Pretrain overrides
* Removing normalization from the audio Tokenizer
* fixed diffusers checkpoint
* fixed video save uint conversion
* added forward hook for cpu offload case
* removed dead params for sound tokenizer
* renaming audio encoder for readability
* ruff format
* Fix checkpoint conversion script for sound tokenizer
* Audio Decoder trim and removing some dead code
* removing dead sequence packing code
* refactor pipeline to diffusers style formatting
* removing use of cosmos3 audio encoder
* Revert "removing use of cosmos3 audio encoder"
This reverts commit 1b8b99a95e5acc749b5c100637ed382de138c606.
* refactor audio encoder
* inline remaining sequence packing functions and lint
* Removed GenerationDataClean class and Action logic
* inlined default args
* removed dead code and refactoring
* drop pipeline-helper @no_grad, inline derive helper, move guidance check to check_inputs
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* drop unused list bookkeeping from PackedSequence
attn_modes was never read; sample_lens collapses into the existing
sequence_length int (we only pack a single sample at a time); split_lens
collapses into a single und_len int (only split_lens[0] was ever read).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* extract pack-time build state from PackedSequence dataclass
curr, _use_mrope, _mrope_temporal_offset, _mrope_reset_spatial were
transient counters used while building the joint sequence, not part of
the finalized output. Thread them through _pack_*_tokens as positional
args/returns so the dataclass only carries fields the pipeline actually
reads back.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* drop private-API isinstance/shape asserts
These were build-mode guards on PackedSequence internals and shape
checks on tensors the pipeline itself constructs, both flagged as
noise in private code per reviewer.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* build PackedSequence tensors on target device, drop to_cuda
Thread device through pack_input_sequence and _pack_*_tokens helpers
so all torch.tensor/zeros/arange calls land on the target device
directly. Move CPU-side mRoPE tensors over with .to(device) at the
append site. Pass device to finalize so list-to-tensor conversion lands
on device too. Delete PackedSequence.to_cuda() and the helper
_modality_to_cuda; drop the corresponding call sites in __call__.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* drop SequencePlan, skip_text_tokens, and bos_token_id branch
SequencePlan's has_text and has_vision were True at every construction
site and has_sound was derivable from x0_tokens_sound is not None.
condition_frame_indexes_vision is now passed directly as a List[int]
arg to pack_input_sequence. Removed the skip_text_tokens flag (never
True) and the dead bos_token_id shift branch in _pack_text_tokens
(special_tokens never carries one in this pipeline).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* use retrieve_latents helper in _encode_video
Copied from stable_diffusion_img2img matching cosmos2_5 convention so
make fix-copies keeps it synced. Functionally identical to the prior
.latent_dist.mode() call but handles latent_dist/latents attribute
variants uniformly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* drop get_data_and_condition + data_batch dict scaffolding
normalize_video_databatch_inplace, augment_image_dim_inplace and
remove_padding_from_latent were no-ops once is_preprocessed=True
(always set by this pipeline) and the pipeline never pads.
get_data_and_condition just orchestrated those plus a never-taken
multi-vision branch.
Replaced the whole chain with a few lines inline in prepare_latents:
build vision_tensor on device, call _encode_video, set fps_vision.
prepare_latents no longer needs input_caption_key, input_video_key,
input_image_key, or the prompt kwarg.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* drop _load_image_as_tensor; use VideoProcessor for conditioning frame
Image loading is the caller's responsibility (load_image from
diffusers.utils), matching the cosmos2_5 example. The pipeline registers
a VideoProcessor in __init__ and calls preprocess() to resize + normalize
caller-supplied PIL / np / tensor inputs to [1, 3, H, W] in [-1, 1].
prepare_latents fills the temporal dim in two lines (single frame at
t=0, repeat-pad the rest) preserving the prior i2v behavior.
Inference script updated to call load_image() before passing to the
pipeline.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* move encode/decode helpers + transformer forward into Cosmos3OmniTransformer
Pull encode_text / encode_vision / encode_sound_tokens / decode_vision /
decode_sound_tokens (and their pure-tensor helpers patchify_and_pack_latents,
unpatchify_and_unpack_latents, apply_timestep_embeds_to_noisy_tokens,
_pack_sound_latents, _unpack_sound_latents) from the pipeline into
Cosmos3OmniTransformer as methods. The transformer's forward(packed_seq)
now runs the full per-step pass: encode text/vision/sound, rotary +
layer loop, decode vision/sound — returns (preds_vision, preds_sound).
The pipeline's CFG loop drops the encode_*/decode_* method calls and
the manual und/gen split/concat; each pass is now a single
self.transformer(packed_seq) call. No self.transformer.{embed_tokens,
vae2llm, llm2vae, sound2llm, llm2sound, time_embedder, time_proj,
sound_modality_embed} access remains in the pipeline.
Cosmos3VLTextModel is kept as a structural wrapper for now — flattening
it would break the published checkpoint layout. Tracked separately.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* remove Cosmos3VLTextModel; flatten transformer layout
embed_tokens / layers / norm / norm_moe_gen / rotary_emb are now direct
attributes of Cosmos3OmniTransformer. The converter strips the leading
`model.` prefix from the source language_model state-dict so new
conversions land at the flat layout natively.
Published Hub artifact (nvidia/Cosmos3-Nano) needs its transformer
safetensors + index.json re-keyed with the same prefix strip before
this code can load it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* move save_img_or_video/save_wav to cosmos/export_utils.py
Per reviewer guidance, custom video/audio export helpers belong in a
pipeline-local export_utils.py (mirroring pipelines/ltx2/export_utils.py)
rather than living inside the pipeline file. Pipeline imports trim the
now-unused pathlib/numpy/export_to_video; inference example updated to
import from the new location.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* drop @torch.no_grad on pipeline __call__
Diffusers pipeline convention: __call__ does not wear a torch.no_grad
decorator; the responsibility for grad context sits with the caller.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* follow standard transformer conventions in Cosmos3OmniTransformer
- Declare _no_split_modules, _repeated_blocks, _skip_layerwise_casting_patterns,
_keep_in_fp32_modules, _supports_gradient_checkpointing on the transformer.
- Wire self.gradient_checkpointing + the _gradient_checkpointing_func branch in
forward so the flag is honest (models.md gotcha #3).
- Add PeftAdapterMixin and AttentionMixin to the mixin set so LoRA loading and
the attention-backend setters work.
- CosmosAttnProcessor3_0 now declares _attention_backend / _parallel_config and
forwards them to dispatch_attention_fn, matching the pattern in models.md
and transformer_wan.py.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* restore @torch.no_grad() on Cosmos3OmniDiffusersPipeline.__call__
Reverting bea4eecf9 — removing the decorator causes GPU OOM during inference
because the autograd graph accumulates across the full denoising loop (35
steps × dual cond/uncond passes × full transformer). pipelines.md gotcha #2
documents this exact failure mode and the convention is upheld by every
sibling pipeline (pipeline_flux.py:652, pipeline_qwenimage.py:462,
pipeline_wan.py:381, pipeline_ltx.py:535).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* drop training-only dummy projection paths in transformer decode helpers
_decode_vision and _decode_sound both guarded a "no noisy tokens" branch
that ran zeroed projections to keep the autograd graph intact. Those
branches only fire when a pure-conditioning step has no MSE-loss tokens,
which never happens in the inference pipeline — every workflow has at
least one noisy vision frame, and _decode_sound is gated on has_sound
which itself requires noisy sound tokens. Deleting per CLAUDE.md:
"delete training-time code paths… only keep the inference path."
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* inline modality encode/decode helpers into transformer forward
models.md "Coding style": all layer calls should be visible directly in
forward — avoid helper functions that hide nn.Module calls. Inlines
_encode_text / _encode_vision / _encode_sound / _decode_vision /
_decode_sound into forward so embed_tokens, vae2llm, sound2llm, llm2vae,
llm2sound, and time_embedder are all visible at the call site. Pure-
tensor helpers (_patchify_and_pack_latents, _unpatchify_and_unpack_latents,
_pack_sound_latents, _unpack_sound_latents, _apply_timestep_embeds_to_noisy_tokens)
stay as methods since they don't hide layer state.
Also drops the inference-unreachable guards while collapsing the helpers:
the "vision is None" / "sound is None" / "mse_loss_indexes.numel() > 0"
branches never fire because the pipeline always packs vision, only routes
to the sound branch when sound is present, and condition_frame_indexes
never covers the entire stream.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* trim dead flags and idioms in Cosmos3 pipeline __call__
- Drop assert config.use_moe — use_moe is never read by the model and
asserts vanish under python -O; if !use_moe is unsupported the place
to surface it is check_inputs, not a stripped assert.
- Delete the joint_attn_implementation == "flex" path entirely (the
include_end_of_generation_token branch in pack_input_sequence, the
include_eog hoist, and both call-site kwargs). The published config
is "two_way"; the flex branch and the end_of_generation special token
were dead under every shipped checkpoint.
- Drop torch._inductor.cudagraph_mark_step_begin() from the step loop —
cudagraph stepping belongs in the caller's torch.compile wrapper, not
the inference pipeline.
- Replace four int(torch.prod(torch.tensor(shape))) idioms with
math.prod(shape) — no tensor allocation, no .item() sync, and math
is already imported.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* introduce Cosmos3Condition for prepare_latents return shape
Replace the 7-tuple return from prepare_latents with a Cosmos3Condition
dataclass that carries the encoded conditioning latents (vision + optional
sound), their fps tensors, the conditioning frame indices, and the
num_vision_items count. The denoising loop and _postprocess_latents now
read these as named attributes instead of positional tuple unpacking.
Addresses reviewer thread huggingface/diffusers-new-model-addition-cosmos#1
comment 3278569263 ("create something like Cosmos3Condition class for
condition input").
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* restructure pack helpers as data-returning pipeline methods
Replace the four module-level pack functions (_pack_text_tokens,
_pack_vision_tokens, _pack_sound_tokens, pack_input_sequence) with
methods on Cosmos3OmniDiffusersPipeline. The three per-segment methods
now build and return their own data (text_ids/mrope_ids tuple for text,
a populated ModalityData for vision/sound) instead of mutating a shared
PackedSequence builder; pack_input_sequence orchestrates them.
Other cleanups along the way:
- Drop dead branches that the published config never exercises:
use_mrope=False (model is always unified_3d_mrope), has_generation=False
(always True), multi-vision-items (num_vision_items always 1), and
the curr_rope_id non-mrope path.
- Move the bf16 cast of per-step noisy tokens to before pack_input_sequence
so the build-then-mutate pattern on packed_seq.vision.tokens disappears.
- Drop the latent_patch_size / config hoists in __call__ that are now
read directly inside the pack methods.
Addresses reviewer threads
huggingface/diffusers-new-model-addition-cosmos#1 comments 3278871807,
3278908766, 3278918514.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* collapse builder-pattern PackedSequence/ModalityData into flat dataclasses
ModalityData and PackedSequence carried a list-or-tensor union for every
field so they could double as builders during packing, with finalize()
converting the lists to tensors at the end. Now that the pack methods
each build their segment in one shot, finalize() is just a list->tensor
conversion the pack methods can do themselves.
- Rename ModalityData to _ModalityData (internal) with all-tensor fields
(lists only for per-item entries like tokens / condition_mask).
- Rename PackedSequence to Cosmos3PackedSequence and drop its finalize()
method; fields are direct tensors at construction time.
- _pack_vision_tokens / _pack_sound_tokens now build finalized
_ModalityData directly via torch.arange / torch.tensor / torch.full.
- pack_input_sequence builds the final Cosmos3PackedSequence in one
return statement; no more two-stage build-then-finalize.
- Update the transformer's forward docstring to reference the new name.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* exclude dtype from saved transformer config via ignore_for_config
ModelMixin.from_pretrained injects dtype into init_dict
(configuration_utils.py:289) whenever it appears in the loader's
unused_kwargs, so Cosmos3OmniTransformer.__init__ has to accept dtype.
But the default @register_to_config decorator was also serializing it
into config.json on every save — leaving a stray "dtype": "bfloat16"
key that doesn't describe the architecture, just the load-time runtime
preference.
Adding ignore_for_config = ["dtype"] keeps the decorator from
registering dtype while still accepting it on the init signature.
New saves omit dtype; existing checkpoints that have it log a
warning at load time but the value is re-injected and __init__
ignores it, so loading is unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* add Cosmos3 pipeline and transformer docs pages; retire JSON example inputs
- New docs/source/en/api/pipelines/cosmos3.md with copy-pasteable
text-to-image, text-to-video, image-to-video, and text-to-video-with-sound
example snippets. The docs page is now the canonical reference for
application code instead of the JSON-driven inputs/ directory.
- New docs/source/en/api/models/cosmos3_omni_transformer.md describing the
MoT dual-pathway architecture and showing a from_pretrained snippet.
- Wire both pages into docs/source/en/_toctree.yml.
- Export Cosmos3OmniDiffusersPipeline from the top-level diffusers package
(matching every sibling pipeline) and add the corresponding dummy class
for torch/transformers-unavailable environments.
- Delete examples/cosmos3/inputs/omni/{t2i,t2v,i2v}.json and rewrite
inference_cosmos3.py to take --prompt / --vision-path / --num-frames
directly as CLI args. The script stays as a development smoke-test
runner; canonical usage now lives in the docs.
- Refresh examples/cosmos3/README.md to point at the docs page and
reflect the new CLI surface.
Addresses reviewer thread
huggingface/diffusers-new-model-addition-cosmos#1 comment 3278258567.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* restore full prompts in cosmos3 docs from original example JSONs
The text-to-video and image-to-video example prompts were shortened
when porting from examples/cosmos3/inputs/omni/*.json into the docs
page. Restore them verbatim from the JSONs so the docs reflect the
prompts the model was actually demonstrated against, and so users
copying from the docs get the same conditioning the example was tuned
for.
Also align the text-to-video-with-sound example: it now reuses the
exact same prompt as the text-to-video block with only enable_sound=True
added, instead of a hand-written waterfall prompt.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Renamed Cosmos3 module attributes
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* bugfix
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Removed unnecessary helper function; added extra comments
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Moved from encode_prompt to tokenize_prompt
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Bring back video system prompt
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Remove multi frame conditioning for now
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Removed default negative prompts from code
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Remove Cosmos3condition; simplify sequence pack
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Clean up multiple parameters
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Simplify decode video; remove remainings of batching
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* simple renames
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Refactored schedulers for sound
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Remove unnecessary autocast
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Update sound example
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Simplify loops in transformer_cosmos3.py
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Remove unused config attributes
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Cleanup audio decoder
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Reuse encoder_video from LTX2 for Cosmos3
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Fixed a few nits
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Moved to RMSNorm for Cosmos3
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Remove meta_tensor usage
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Improved rope handling
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Improved prompt templates
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Added extra docs for templatete
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* remove dataclasses
* Cleanup after merging
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Added guardrails
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* bugfixed guardrails
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Bugfix guardrails v2
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Simplified input_timestep
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Add TODO
Co-authored-by: YiYi Xu <yixu310@gmail.com>
* simplify conditional mask generation
* Inlined _postprocess_latents
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* removed pack_input_sequence helper
* restore export utils with deprecation warning
* moved sampling rate to pipeline attribute
* inlined sound and image condition mask
* seperating static and timestep based sound and vision token packing
* unpack transformer args
* enabled selection of cosmos3 super
* Update src/diffusers/pipelines/cosmos/pipeline_cosmos3_omni.py
Co-authored-by: YiYi Xu <yixu310@gmail.com>
* Update src/diffusers/pipelines/cosmos/pipeline_cosmos3_omni.py
Co-authored-by: YiYi Xu <yixu310@gmail.com>
* Apply suggestions from code review
Co-authored-by: YiYi Xu <yixu310@gmail.com>
* fixed sound and vision conditioning use from prepare latents
* ruff format and doc builder
* ran fix copies
* move typing to python3.10
* move special token application from pack_text_tokens to tokenize_prompt
* rename packing methods to process methods
* remove guidance_scale check
* fix nits
* respect vae dtype in the pipeline
* use vae dtype for vae normalization stats
* skip CFG if guidance_scale is 1
* Remove unnecessary parameter
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* fix CFG for sound
* bugfix for sound CFG
* ruff format
* Fix apply_chat_template return dict arg to return BatchEncoding
* added option to select attention processor
* docs: refresh Cosmos 3 pipeline intro
Replace the terse architectural lede with the launch-style positioning
(unified WFM for Physical AI, consolidating Predict/Reason/Transfer
into one omni-model) and split out "What's new" and "Available
checkpoints" sections so the page leads with capability rather than
repo IDs.
* docs: document Cosmos3OmniPipeline.__call__ arguments
Add the missing Args block to Cosmos3OmniPipeline.__call__ so
utils/check_forward_call_docstrings.py passes — covers all 21
parameters from prompt through enable_safety_check, plus a Returns
section describing the output dataclass.
* style: doc-builder reflow on Cosmos3OmniPipeline.__call__ docstring
---------
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Co-authored-by: Yuliya Zhautouskaya <yzhautouskay@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Maciej Bala <mbala@nvidia.com>
Co-authored-by: Dima Zhylko <dzhylko@nvidia.com>
Co-authored-by: YiYi Xu <yixu310@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.