Adding Cosmos 3 to Diffusers#13818
Conversation
This reverts commit 1b8b99a95e5acc749b5c100637ed382de138c606.
…eck to check_inputs Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
attn_modes was never read; sample_lens collapses into the existing sequence_length int (we only pack a single sample at a time); split_lens collapses into a single und_len int (only split_lens[0] was ever read). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
curr, _use_mrope, _mrope_temporal_offset, _mrope_reset_spatial were transient counters used while building the joint sequence, not part of the finalized output. Thread them through _pack_*_tokens as positional args/returns so the dataclass only carries fields the pipeline actually reads back. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
These were build-mode guards on PackedSequence internals and shape checks on tensors the pipeline itself constructs, both flagged as noise in private code per reviewer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Thread device through pack_input_sequence and _pack_*_tokens helpers so all torch.tensor/zeros/arange calls land on the target device directly. Move CPU-side mRoPE tensors over with .to(device) at the append site. Pass device to finalize so list-to-tensor conversion lands on device too. Delete PackedSequence.to_cuda() and the helper _modality_to_cuda; drop the corresponding call sites in __call__. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Replace the terse architectural lede with the launch-style positioning (unified WFM for Physical AI, consolidating Predict/Reason/Transfer into one omni-model) and split out "What's new" and "Available checkpoints" sections so the page leads with capability rather than repo IDs.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
yiyixuxu
left a comment
There was a problem hiding this comment.
thanks, I somehow missed the guidance_scale
our pipeline has a do_classifier_freee_guidance property to indicate if we want to apply CFG or not
https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cosmos/pipeline_cosmos2_5_predict.py#L529
let me know if there is a reason we didn't do that for cosmos3
|
can you run |
Add the missing Args block to Cosmos3OmniPipeline.__call__ so utils/check_forward_call_docstrings.py passes — covers all 21 parameters from prompt through enable_safety_check, plus a Returns section describing the output dataclass.
|
@bot /style |
Can we also get this model into ComfyUI please so we can use it in our workflows? |
What does this PR do?
Adds the Cosmos 3 omni pipeline to Diffusers — NVIDIA's unified world foundation model for Physical AI. Cosmos 3 is a
single Mixture-of-Transformers (MoT) model that combines world generation, physical reasoning, and action generation
into one forward pass, replacing the separate Predict / Reason / Transfer models from earlier Cosmos releases.
The integration ships:
Cosmos3OmniPipeline— one pipeline class supporting four workflows:Cosmos3OmniTransformer— MoT backbone running a Qwen-style causal "understanding" stream in parallel with abi-directional "generation" stream over video + (optional) sound latents, joined by a 3D multimodal RoPE.
Cosmos3AVAEAudioTokenizer— decoder-only audio tokenizer (Oobleck-style Snake1d + weight-norm conv stack, inlined forself-containment).
AutoencoderKLWanfor the video VAE andUniPCMultistepSchedulerfor diffusion — no new scheduler required.CosmosSafetyChecker(from cosmos_guardrail) wired up by default per the NVIDIA Open Model License.Disable-at-construction (enable_safety_checker=False) and disable-per-call (
enable_safety_check=False) flags exist fortest/dev workflows.
Refactor opportunistically included: promoted LTX2's PyAV-based
encode_videohelper todiffusers.utils.export_utilssoCosmos 3 (and any future pipeline that needs to mux audio into MP4) can reuse it.
diffusers.pipelines.ltx2.export_utils.encode_videoremains as a deprecation shim with a 0.40.0 removal target, soexisting user code keeps working.
Docs page
(docs/source/en/api/pipelines/cosmos3.md)covers all four workflows with per-model tabs (Nano / Super), plus metadata-template and safety-checker controls. A minimal smoke-test runner lives atexamples/cosmos3/inference_cosmos3.pywith a--model {nano,super}flag.