Adding Cosmos 3 to Diffusers by atharvajoshi10 · Pull Request #13818 · huggingface/diffusers

atharvajoshi10 · 2026-05-27T21:18:03Z

What does this PR do?

Adds the Cosmos 3 omni pipeline to Diffusers — NVIDIA's unified world foundation model for Physical AI. Cosmos 3 is a
single Mixture-of-Transformers (MoT) model that combines world generation, physical reasoning, and action generation
into one forward pass, replacing the separate Predict / Reason / Transfer models from earlier Cosmos releases.

The integration ships:

Cosmos3OmniPipeline — one pipeline class supporting four workflows:
- text-to-image (num_frames=1)
- text-to-video
- image-to-video
- text+image-to-video-with-sound (when the checkpoint carries a sound tokenizer; enable_sound=True)
Cosmos3OmniTransformer — MoT backbone running a Qwen-style causal "understanding" stream in parallel with a
bi-directional "generation" stream over video + (optional) sound latents, joined by a 3D multimodal RoPE.
Cosmos3AVAEAudioTokenizer — decoder-only audio tokenizer (Oobleck-style Snake1d + weight-norm conv stack, inlined for
self-containment).
Reuses AutoencoderKLWan for the video VAE and UniPCMultistepScheduler for diffusion — no new scheduler required.
CosmosSafetyChecker (from cosmos_guardrail) wired up by default per the NVIDIA Open Model License.
Disable-at-construction (enable_safety_checker=False) and disable-per-call (enable_safety_check=False) flags exist for
test/dev workflows.
Two checkpoints on the Hub: nvidia/Cosmos3-Nano (smaller, faster) and nvidia/Cosmos3-Super (larger, higher quality).

Refactor opportunistically included: promoted LTX2's PyAV-based encode_video helper to diffusers.utils.export_utils so
Cosmos 3 (and any future pipeline that needs to mux audio into MP4) can reuse it.
diffusers.pipelines.ltx2.export_utils.encode_video remains as a deprecation shim with a 0.40.0 removal target, so
existing user code keeps working.

Docs page (docs/source/en/api/pipelines/cosmos3.md) covers all four workflows with per-model tabs (Nano / Super), plus metadata-template and safety-checker controls. A minimal smoke-test runner lives at examples/cosmos3/inference_cosmos3.py with a --model {nano,super} flag.

This reverts commit 1b8b99a95e5acc749b5c100637ed382de138c606.

…eck to check_inputs Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

attn_modes was never read; sample_lens collapses into the existing sequence_length int (we only pack a single sample at a time); split_lens collapses into a single und_len int (only split_lens[0] was ever read). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

curr, _use_mrope, _mrope_temporal_offset, _mrope_reset_spatial were transient counters used while building the joint sequence, not part of the finalized output. Thread them through _pack_*_tokens as positional args/returns so the dataclass only carries fields the pipeline actually reads back. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

These were build-mode guards on PackedSequence internals and shape checks on tensors the pipeline itself constructs, both flagged as noise in private code per reviewer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Thread device through pack_input_sequence and _pack_*_tokens helpers so all torch.tensor/zeros/arange calls land on the target device directly. Move CPU-side mRoPE tensors over with .to(device) at the append site. Pass device to finalize so list-to-tensor conversion lands on device too. Delete PackedSequence.to_cuda() and the helper _modality_to_cuda; drop the corresponding call sites in __call__. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Maciej Bala <mbala@nvidia.com>

Replace the terse architectural lede with the launch-style positioning (unified WFM for Physical AI, consolidating Predict/Reason/Transfer into one omni-model) and split out "What's new" and "Available checkpoints" sections so the page leads with capability rather than repo IDs.

HuggingFaceDocBuilderDev · 2026-05-27T21:50:50Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

yiyixuxu

thanks, I somehow missed the guidance_scale
our pipeline has a do_classifier_freee_guidance property to indicate if we want to apply CFG or not
https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cosmos/pipeline_cosmos2_5_predict.py#L529
let me know if there is a reason we didn't do that for cosmos3

yiyixuxu · 2026-05-27T21:52:53Z

can you run make fix-copies?

Add the missing Args block to Cosmos3OmniPipeline.__call__ so utils/check_forward_call_docstrings.py passes — covers all 21 parameters from prompt through enable_safety_check, plus a Returns section describing the output dataclass.

yiyixuxu · 2026-05-27T22:23:16Z

@bot /style

MeiYi-dev · 2026-06-05T08:17:34Z

What does this PR do?

Adds the Cosmos 3 omni pipeline to Diffusers — NVIDIA's unified world foundation model for Physical AI. Cosmos 3 is a single Mixture-of-Transformers (MoT) model that combines world generation, physical reasoning, and action generation into one forward pass, replacing the separate Predict / Reason / Transfer models from earlier Cosmos releases.

The integration ships:

Cosmos3OmniPipeline — one pipeline class supporting four workflows:

text-to-image (num_frames=1)

text-to-video

image-to-video

text+image-to-video-with-sound (when the checkpoint carries a sound tokenizer; enable_sound=True)

Cosmos3OmniTransformer — MoT backbone running a Qwen-style causal "understanding" stream in parallel with a
bi-directional "generation" stream over video + (optional) sound latents, joined by a 3D multimodal RoPE.

Cosmos3AVAEAudioTokenizer — decoder-only audio tokenizer (Oobleck-style Snake1d + weight-norm conv stack, inlined for
self-containment).

Reuses AutoencoderKLWan for the video VAE and UniPCMultistepScheduler for diffusion — no new scheduler required.

CosmosSafetyChecker (from cosmos_guardrail) wired up by default per the NVIDIA Open Model License.
Disable-at-construction (enable_safety_checker=False) and disable-per-call (enable_safety_check=False) flags exist for
test/dev workflows.

Two checkpoints on the Hub: nvidia/Cosmos3-Nano (smaller, faster) and nvidia/Cosmos3-Super (larger, higher quality).

Refactor opportunistically included: promoted LTX2's PyAV-based encode_video helper to diffusers.utils.export_utils so Cosmos 3 (and any future pipeline that needs to mux audio into MP4) can reuse it. diffusers.pipelines.ltx2.export_utils.encode_video remains as a deprecation shim with a 0.40.0 removal target, so existing user code keeps working.

Docs page (docs/source/en/api/pipelines/cosmos3.md) covers all four workflows with per-model tabs (Nano / Super), plus metadata-template and safety-checker controls. A minimal smoke-test runner lives at examples/cosmos3/inference_cosmos3.py with a --model {nano,super} flag.

Can we also get this model into ComfyUI please so we can use it in our workflows?

atharvajoshi10 and others added 30 commits May 27, 2026 13:54

Adding Cosmos 3

ee8931b

removed dead code

da55428

Change customer TimeEmbedding Layer to DIffusers Time Embedding

786dbd4

removed dependency on hugging face transformers

2415b39

refactor 1

5d4d453

Fixed Attention Pattern

059644d

Removed from Pretrain overrides

277ef7b

Removing normalization from the audio Tokenizer

2aa39f7

fixed diffusers checkpoint

1b73e0b

fixed video save uint conversion

dc6460f

added forward hook for cpu offload case

6cd13c9

removed dead params for sound tokenizer

3c5b60e

renaming audio encoder for readability

807344f

ruff format

722f4ee

Fix checkpoint conversion script for sound tokenizer

0d5391a

Audio Decoder trim and removing some dead code

f28e468

removing dead sequence packing code

409a3a4

refactor pipeline to diffusers style formatting

5f5f72a

removing use of cosmos3 audio encoder

da0b661

Revert "removing use of cosmos3 audio encoder"

d774d04

This reverts commit 1b8b99a95e5acc749b5c100637ed382de138c606.

refactor audio encoder

b367226

inline remaining sequence packing functions and lint

fda144f

Removed GenerationDataClean class and Action logic

9529ac5

inlined default args

0e8e1ae

removed dead code and refactoring

a008d0c

drop pipeline-helper @no_grad, inline derive helper, move guidance ch…

2ab7f6d

…eck to check_inputs Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

drop private-API isinstance/shape asserts

721bcb7

These were build-mode guards on PackedSequence internals and shape checks on tensors the pipeline itself constructs, both flagged as noise in private code per reviewer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Zhylkaaa and others added 13 commits May 27, 2026 14:00

rename packing methods to process methods

03811a6

remove guidance_scale check

ac6006f

fix nits

2b3b006

respect vae dtype in the pipeline

de6ad6d

use vae dtype for vae normalization stats

8808a5b

skip CFG if guidance_scale is 1

8153623

Remove unnecessary parameter

94bc454

Signed-off-by: Maciej Bala <mbala@nvidia.com>

fix CFG for sound

d1084cb

bugfix for sound CFG

017f628

ruff format

83e87ea

Fix apply_chat_template return dict arg to return BatchEncoding

d2c92e3

added option to select attention processor

9f99529

github-actions Bot added documentation Improvements or additions to documentation models utils pipelines examples size/L PR with diff > 200 LOC labels May 27, 2026

atharvajoshi10 changed the title ~~Adding Cosmos 3 Model~~ Adding Cosmos 3 to Diffusers May 27, 2026

yiyixuxu approved these changes May 27, 2026

View reviewed changes

Comment thread src/diffusers/pipelines/cosmos/pipeline_cosmos3_omni.py

Comment thread src/diffusers/pipelines/cosmos/pipeline_cosmos3_omni.py

Comment thread src/diffusers/pipelines/cosmos/pipeline_cosmos3_omni.py

style: doc-builder reflow on Cosmos3OmniPipeline.__call__ docstring

c2eb4cb

yiyixuxu merged commit a1c7df4 into huggingface:main May 28, 2026
20 of 49 checks passed

atharvajoshi10 deleted the cosmos3/video-gen-with-sound branch June 1, 2026 18:29

HaomingSong mentioned this pull request Jun 9, 2026

feat(policies): add Cosmos3 drafts huggingface/lerobot#3745

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Cosmos 3 to Diffusers#13818

Adding Cosmos 3 to Diffusers#13818
yiyixuxu merged 112 commits into
huggingface:mainfrom
atharvajoshi10:cosmos3/video-gen-with-sound

atharvajoshi10 commented May 27, 2026

Uh oh!

HuggingFaceDocBuilderDev commented May 27, 2026

Uh oh!

yiyixuxu left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yiyixuxu commented May 27, 2026

Uh oh!

yiyixuxu commented May 27, 2026

Uh oh!

Uh oh!

MeiYi-dev commented Jun 5, 2026

What does this PR do?

The integration ships:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

atharvajoshi10 commented May 27, 2026

What does this PR do?

The integration ships:

Uh oh!

HuggingFaceDocBuilderDev commented May 27, 2026

Uh oh!

yiyixuxu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yiyixuxu commented May 27, 2026

Uh oh!

yiyixuxu commented May 27, 2026

Uh oh!

Uh oh!

MeiYi-dev commented Jun 5, 2026

What does this PR do?

The integration ships:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants