[PyTorch][torch.compile] Make quantizers opaque value objects by pggPL · Pull Request #7 · pggPL/TransformerEngine

pggPL · 2026-06-06T12:14:03Z

Description

Tensorless quantizers in TE (MXFP8, FP8 blockwise, FP8 current-scaling, NVFP4)
are fully described by a handful of plain, reproducible scalars — they hold no
live tensors and no process groups. This PR turns them into opaque value
objects so torch.compile can treat them as baked-in constants: two
quantizers with the same configuration become interchangeable, hashable, and
reconstructible inside an FX graph.

Quantizers that hold live state (delayed-scaling Float8Quantizer, which keeps
scale/amax tensors) and any user-defined quantizer keep the default
identity semantics, so the change is opt-in and backward compatible. On older
PyTorch builds without the opaque-object API the registration is a graceful
no-op.

Along the way this also un-breaks the existing test_torch_compile.py suite:
that file lived on main but was never wired into CI, and its
test_autocast_nested_custom case (nested te.autocast with multiple
CustomRecipe instances) was failing because of the CustomRecipe state-caching
bug fixed here. The file is now run in CI and passes.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Add opt-in value-object identity to the base Quantizer
(_value_fields / _value_key / __eq__ / __hash__). Returning None
from _value_fields() (the default) keeps identity semantics.
New module transformer_engine/pytorch/dynamo.py holding the
torch.compile glue: __fx_repr__, value-key reconstruction and
register_value_opaque_quantizer (gracefully no-op without PyTorch's
opaque-object API).
Register MXFP8Quantizer, Float8BlockQuantizer,
Float8CurrentScalingQuantizer and NVFP4Quantizer as value opaque types
(the deprecated amax_reduction_group is never part of the value).
Fix CustomRecipe state caching in TransformerEngineBaseModule.set_meta_tensor:
rebuild quantizers when the CustomRecipe instance changes (e.g. nested
te.autocast regions) instead of reusing the first recipe's state, since
every CustomRecipe shares the CustomRecipeState type but carries its own
qfactory. This fixes the previously-failing test_autocast_nested_custom.
Enable tests/pytorch/test_torch_compile.py in the L0_pytorch_unittest QA
suite (it existed on main but was never run in CI), and add the quantizer
value-object tests to it. Bringing it into CI required fixing the existing
CustomRecipe torch.compile path: the qfactory now dispatches on
QuantizerRole.tensor_type supplied by ToyLinear.get_quantizer_roles.
Guard the value-object path against a stored amax reduction group: __fx_repr__
already rejects any quantizer holding a process group, and __eq__ / __hash__
now raise too. The group is excluded from the value key, so a stored group would
otherwise compare/hash equal to a groupless quantizer and let torch.compile
reuse a graph that skips the reduction. Pass the group per quantize call instead.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

kshitij12345

LGTM

…ompile Give tensorless quantizers (MXFP8, FP8 blockwise, FP8 current-scaling, NVFP4) value-object semantics so torch.compile can treat them as baked-in constants: - Add opt-in value identity to the base Quantizer (_value_fields / _value_key / __eq__ / __hash__). Quantizers holding live tensors (delayed-scaling Float8Quantizer) and custom quantizers keep identity semantics. - New transformer_engine/pytorch/dynamo.py houses the torch.compile glue: __fx_repr__, value-key reconstruction and register_value_opaque_quantizer (gracefully a no-op on PyTorch builds without the opaque-object API). - Register the four tensorless quantizers as value opaque types. Also fix CustomRecipe state caching in TransformerEngineBaseModule: set_meta_tensor now rebuilds quantizers when the CustomRecipe instance changes (e.g. nested te.autocast regions) instead of reusing the first recipe's state, since every CustomRecipe shares the CustomRecipeState type but carries its own qfactory. Move the quantizer value-object tests into tests/pytorch/test_torch_compile.py and add that file to the L0 pytorch unittest QA suite. Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

…globals Follow-up to the value-opaque quantizer support: - Remove the module-level _QUANTIZER_VALUE_REGISTRY (qualname -> class) and _quantizer_from_value_key. __fx_repr__ now captures the quantizer class directly in the FX globals and reconstructs via _rebuild_quantizer(cls, items), matching how PyTorch's own value opaque types (e.g. DTensor placements) reconstruct themselves. This removes global mutable state and the qualname collision risk. - Consolidate the quantizer value-object tests in test_torch_compile.py down to two functions and exercise reconstruction through the public __fx_repr__ path instead of internal helpers. Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

Replace the single dynamo.py module with a dynamo/ package so the torch.compile glue can grow with a clear responsibility split across the stacked branches. This branch owns the value-opaque quantizer layer. * dynamo/quantizer_opaque.py -- register_value_opaque_quantizer and helpers * dynamo/__init__.py -- re-exports the public API so callers keep importing from transformer_engine.pytorch.dynamo unchanged Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

A value-opaque quantizer must not carry live distributed state. Scan the quantizer attributes in __fx_repr__ and raise TypeError if any holds a torch.distributed.ProcessGroup (e.g. a non-None deprecated amax_reduction_group), so it cannot be silently baked into a torch.compile FX graph. Clarify the related comments accordingly. Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

NVFP4Quantizer is registered as a value-opaque quantizer but was missing from the value-semantics / __fx_repr__ round-trip test. Add it to _VALUE_QUANTIZERS (skipped without CUDA, which it needs to construct). Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

…__/__hash__ The amax reduction group is excluded from the value key, so a value quantizer that stored one would compare/hash equal to a groupless one and let torch.compile reuse a graph that skips the reduction. __eq__/__hash__ now raise (mirroring __fx_repr__, which already rejects any process-group-bearing quantizer). The group should be passed per quantize call, not stored on the quantizer. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

Add is_value_opaque_quantizer() + the _te_compile_value_opaque flag stamped at registration, so dynamo-traced code can detect registered quantizers (and fall back to eager for unregistered ones). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

…fp4 value key - Narrow register_opaque_type except to (RuntimeError, TypeError): the API is already imported above, so ImportError/AttributeError there only mask real errors. - Add test_quantizer_value_object_fullgraph exercising torch.compile(fullgraph=True) end-to-end to verify opaque-type registration took effect. - Restore missing NVFP4Quantizer._with_random_sign_mask assignment required by _value_fields()/_value_key(). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

…trip _rebuild_quantizer only restores value-key fields, so a reconstructed NVFP4Quantizer was missing the derived rht_matrix tensor (not hashable, so not in the value key) and failed at copy()/quantize time. Add a _rebuild_derived_state hook (called by _rebuild_quantizer) that NVFP4Quantizer uses to rebuild rht_matrix from _with_random_sign_mask (lru_cache -> cheap). Extend test_quantizer_value_object to also quantize with the original and the rebuilt quantizer and require bit-exact results (gated on HW support), so a field the kernel needs but the value key omits can no longer slip through. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

Move the ProcessGroup guard out of the (overridable) __fx_repr__ into Quantizer._value_key -- the single point every value-materialization path (__eq__/__hash__/__fx_repr__) goes through -- so a custom __fx_repr__ can no longer bypass it. Generalizes the old amax-only check to any field holding a ProcessGroup. Add a test that a value quantizer carrying a live group raises. Addresses review on NVIDIA#3152. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

…assthrough Replace the trivial pass-through fullgraph test with one that drives each production quantizer through a minimal custom op (quantize + dequantize) under torch.compile(fullgraph=True) and compares to eager -- so the opaque-type registration is actually exercised inside the graph (a graph break would make fullgraph=True raise). Op registration sits right before the test. Also drop stale comments referencing the old __fx_repr__-side process-group guard. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

…paque flag - rht_matrix_random_sign_mask_t is a device-independent int derived from _with_random_sign_mask (the device only places a throwaway tensor); fix the misleading comment. - Explain why registration uses a class attribute, not a registry set: is_value_opaque_quantizer is traced inside the compile graph and dynamo can bake a getattr constant but cannot do 'type(q) in set' on the opaque class. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

is_opaque_value_type(cls) sat between the import guard and the register_opaque_type guard, so on a partial/experimental opaque-object build it could raise RuntimeError/TypeError and crash TE import. Move it inside the same except so the 'registration never crashes import' promise holds for both calls. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

…m-mem zero-copy (NVIDIA#3035) * Expert Parallelism: PyTorch wrapper + autograd ops with symm-mem zero-copy Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…NS_PER_RANK (NVIDIA#3150) * nccl with relax num_dispatch_tokens%64!=0 Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> * Skip EP tests/examples on nodes without NVLink Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…VIDIA#3141) * Preserve fprop operands for dequantized backward override Signed-off-by: Evgeny <etsykunov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add test_grouped_linear_backward_override_high_precision_forces_save_original_input test Signed-off-by: root <root@prenyx0017.a51.clusters.nvidia.com> --------- Signed-off-by: Evgeny <etsykunov@nvidia.com> Signed-off-by: root <root@prenyx0017.a51.clusters.nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: root <root@prenyx0017.a51.clusters.nvidia.com>

* Make quantized-tensor __repr__ fake-safe under torch.compile Under torch.compile, TE quantized-tensor __repr__ methods are invoked on FakeTensors during AOT autograd's structured logging. The repr bodies call self._scale_inv.item() and/or self.dequantize() (which dispatches to the raw C++ op tex.dequantize), both of which access a FakeTensor's data pointer and raise: RuntimeError: Cannot access data pointer of Tensor (e.g. FakeTensor, FunctionalTensor) ... This was the sole cause of six fp8 failures in tests/pytorch/test_torch_compile.py. Fix: add one shared helper, safe_quantized_repr, in tensor/_quantization_helpers.py (a safe leaf module importing only torch) that builds a metadata-only repr string. Each data-touching __repr__ now wraps its existing body in a try/except and falls back to the helper when the data cannot be materialized. The eager (non-fake) repr output is unchanged; only a fallback path is added. Wrapped reprs: Float8Tensor, Float8BlockwiseQTensor, MXFP8Tensor, NVFP4Tensor and their *Storage counterparts. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Make quantized __repr__ fallback universal, drop FakeTensor-specific logic Remove the FakeTensor-specific heuristic (_is_fake_data_access_error) and the warning path from safe_quantized_repr. The fallback is now a plain metadata-only repr triggered by any exception while materializing data, with each attribute access individually guarded so __repr__ never raises. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…` with `total_recv_tokens_per_rank` placeholder (NVIDIA#3154) * versioning EP C configs Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> * Rename EP prepare token_counts to recv_tokens_per_expert Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> * Add total_recv_tokens_per_rank placeholder to nvte_ep_prepare Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> * Adapt PyTorch EP binding to versioned nvte_ep C config API Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> * Rename EP group config max_num_sms to num_comm_sms Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

Move the _VALUE_OPAQUE_FLAG setattr to the end of register_value_opaque_quantizer, after register_opaque_type succeeds (or the type is already opaque). Previously the flag was set up front, so is_value_opaque_quantizer reported True even when the opaque-object API was missing or registration raised, since both paths are swallowed. Eager value semantics (__eq__/__hash__/__fx_repr__) are independent of the flag, so this only tightens the predicate to mean torch actually knows the type as opaque. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

_check_value_has_no_process_group ran on every guard eval (via __eq__/__hash__) and scanned all of vars(self) recursively. The only attribute that can hold a ProcessGroup is the deprecated amax_reduction_group, so check it directly (O(1)) and drop the _contains_process_group helper. Same guarantee, off the hot path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

# Conflicts: # transformer_engine/pytorch/tensor/float8_blockwise_tensor.py # transformer_engine/pytorch/tensor/float8_tensor.py # transformer_engine/pytorch/tensor/mxfp8_tensor.py # transformer_engine/pytorch/tensor/nvfp4_tensor.py

Remove the a==b / hash / dict-key block that just exercised Python's own dict semantics; equality and hashing are still covered by the __fx_repr__ round-trip (rebuilt == a, hash match) and the bit-exact kernel check. other_kwargs is now unused, so drop it from the parametrization and both test signatures. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

kshitij12345 approved these changes Jun 9, 2026

View reviewed changes

Comment thread tests/pytorch/test_torch_compile.py

Comment thread transformer_engine/pytorch/dynamo/quantizer_opaque.py Outdated

pggPL force-pushed the remove_process_group_from_quantizers branch from e9097d6 to 948cd6d Compare June 16, 2026 12:23

pggPL requested a review from cyanguwa as a code owner June 16, 2026 12:23

pggPL force-pushed the remove_process_group_from_quantizers branch from b8c1bec to 6c9b986 Compare June 16, 2026 14:56

pggPL force-pushed the make_qunatizers_opaque branch from 33e9d73 to d341eeb Compare June 16, 2026 15:21

pggPL force-pushed the make_qunatizers_opaque branch 2 times, most recently from adc65f6 to c7bbc83 Compare June 29, 2026 07:33

pggPL and others added 7 commits June 29, 2026 11:25

pggPL force-pushed the make_qunatizers_opaque branch from c7bbc83 to f592cbb Compare June 29, 2026 09:26

pggPL changed the base branch from remove_process_group_from_quantizers to main June 29, 2026 09:26

pggPL closed this Jun 29, 2026

pggPL reopened this Jun 29, 2026

pggPL force-pushed the make_qunatizers_opaque branch from f592cbb to 945f62d Compare June 29, 2026 09:34

pggPL and others added 10 commits June 29, 2026 12:05

Reword opaque-flag comment: self-contained, no Linear reference

2c3c5df

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

phu0ngng and others added 7 commits June 30, 2026 19:45

Drop verbose comments around value-opaque flag stamping

9db604f

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

Shorten amax_reduction_group check comment

fe5e5db

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch][torch.compile] Make quantizers opaque value objects#7

[PyTorch][torch.compile] Make quantizers opaque value objects#7
pggPL wants to merge 25 commits into
mainfrom
make_qunatizers_opaque

pggPL commented Jun 6, 2026 •

edited

Loading

Uh oh!

kshitij12345 left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

pggPL commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

kshitij12345 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pggPL commented Jun 6, 2026 •

edited

Loading