[PyTorch][torch.compile] Add TensorProto mechanism by pggPL · Pull Request #8 · pggPL/TransformerEngine

pggPL · 2026-06-06T14:52:17Z

Description

This PR introduces TensorProto — a data-free prototype of a tensor (or quantized tensor) that captures everything needed to reason about and rebuild a tensor without holding any storage: its logical shape/dtype and, for quantized tensors, the value-opaque quantizer defining the layout.

The key property is that TensorProto.create_tensor() materializes a quantized tensor purely in Python (via Quantizer.alloc_tensors + the storage's __tensor_unflatten__), so it traces under torch.compile(fullgraph=True) with no graph break — unlike make_empty, which goes through the opaque C++ tex.create_empty_quantized_tensor. This is the foundation for writing torch.library custom-op fake implementations of quantized ops.

This builds on the value-opaque quantizer work (so a TensorProto is itself safe to treat as a compile-time constant).

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

dynamo.py: Add TensorProto dataclass (shape, dtype, quantizer, requires_grad, device) with is_quantized, inner_names(), create_metadata() and create_tensor(), plus a to_tensor_proto() helper that builds a proto from a plain torch.Tensor or a QuantizedTensorStorage/QuantizedTensor.
quantized_tensor.py:
- Add the PyTorch wrapper-subclass flatten protocol (__tensor_flatten__ / __tensor_unflatten__) to QuantizedTensorStorage, driven by a per-class _FLATTEN_TENSOR_BUFFERS declaration of (attribute_name, constructor_kwarg) pairs.
- Add a _STORAGE_REGISTRY (populated via __init_subclass__) so __tensor_unflatten__ can resolve a concrete storage/wrapper class from its qualname inside an FX graph.
- Add pure-Python, traceable allocation hooks to Quantizer: alloc_tensors, create_metadata, and the opt-in overrides _describe_buffers, _storage_scalars, _resolve_storage_cls.
Quantizers: Implement the allocation hooks for Float8CurrentScalingQuantizer, MXFP8Quantizer and Float8BlockQuantizer.
Storage classes: Declare _FLATTEN_TENSOR_BUFFERS for Float8TensorStorage, MXFP8TensorStorage and Float8BlockwiseQTensorStorage.
ops/basic/basic_linear.py: Add allocation-free _functional_forward_fake / _functional_backward_fake that operate on TensorProto and return output/gradient protos, as a basis for custom-op fake impls (single-device only; TP/SP shape effects not yet modeled).
Tests: Add tests/pytorch/test_tensor_proto.py (CPU smoke tests for _describe_buffers/alloc_tensors/create_metadata, flatten round-trip, and to_tensor_proto) and torch.compile fullgraph tests in test_torch_compile.py.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Squashed PR #8 (tensor_proto_mechanism) onto the rebased base. Adds TensorProto (pure-Python, torch.compile-traceable quantized-tensor allocation via Quantizer.alloc_tensors + storage __tensor_flatten__/__tensor_unflatten__), Linear fake fwd/bwd impls for the custom-op path, and tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

kshitij12345

Overall looks good

Would it be possible to reduce duplication between _linear_forward_impl_fake and _linear_forward_impl.

Squashed PR #8 (tensor_proto_mechanism) onto the rebased base. Adds TensorProto (pure-Python, torch.compile-traceable quantized-tensor allocation via Quantizer.alloc_tensors + storage __tensor_flatten__/__tensor_unflatten__), Linear fake fwd/bwd impls for the custom-op path, and tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

The cached FP8 weight is the same tensor returned as new_weight_workspace (cache miss) or passed in as weight_workspace (cache hit). A custom op may not return a tensor that aliases an input or another return, so mark those slots and reconstruct wt_save in _linear_setup_ctx instead of saving it twice. Mirrored in the fake impl so the saved-slot layout matches. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

NVFP4Quantizer._describe_buffers grouped each amax right after its scale (per-usage), diverging from NVFP4TensorStorage._FLATTEN_TENSOR_BUFFERS (amax buffers last). The order is functionally irrelevant (buffers are consumed by name in alloc_tensors and reordered in TensorProto.inner_names), but aligning it makes describe/flatten agree and fixes test_to_tensor_proto_quantized[nvfp4]. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

…upport - TensorProto.inner_names now raises if the quantizer describes buffer(s) absent from the storage's _FLATTEN_TENSOR_BUFFERS, instead of silently appending them. - Gate the nvfp4 proto-quantizer param on nvfp4_available so it skips on hardware without NVFP4 support rather than failing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

@staticmethods

…escribe_buffers Access NVFP4Quantizer @staticmethods (convert_shape_for_fp4, get_columnwise_shape) via the class instead of the instance. Under torch.compile, instance access of a @staticmethod on a value-opaque object crashes Dynamo guard generation with "'function' object has no attribute '__func__'" (pytorch/pytorch#182741). Temporary workaround until the PyTorch-side fix lands. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

pggPL force-pushed the make_qunatizers_opaque branch from 33e9d73 to d341eeb Compare June 16, 2026 15:21

pggPL requested a review from cyanguwa as a code owner June 16, 2026 15:21

pggPL force-pushed the tensor_proto_mechanism branch from 2cccc30 to 2e252f9 Compare June 16, 2026 15:31

pggPL force-pushed the tensor_proto_mechanism branch from 2e252f9 to ba92f5b Compare June 16, 2026 16:05

pggPL force-pushed the tensor_proto_mechanism branch from ba92f5b to b1273ea Compare June 16, 2026 16:12

kshitij12345 reviewed Jun 22, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/dynamo/tensor_proto.py Outdated

Comment thread tests/pytorch/test_torch_compile.py Outdated

pggPL force-pushed the make_qunatizers_opaque branch from e4a879b to adc65f6 Compare June 29, 2026 07:14

pggPL force-pushed the tensor_proto_mechanism branch from 85355a6 to c1e40b2 Compare June 29, 2026 07:16

pggPL force-pushed the make_qunatizers_opaque branch from adc65f6 to c7bbc83 Compare June 29, 2026 07:33

pggPL force-pushed the tensor_proto_mechanism branch from c1e40b2 to e760487 Compare June 29, 2026 07:34

pggPL force-pushed the make_qunatizers_opaque branch from c7bbc83 to f592cbb Compare June 29, 2026 09:26

pggPL force-pushed the tensor_proto_mechanism branch from e760487 to 50d5c21 Compare June 29, 2026 09:26

pggPL force-pushed the make_qunatizers_opaque branch from f592cbb to 945f62d Compare June 29, 2026 09:34

pggPL force-pushed the tensor_proto_mechanism branch from 50d5c21 to da709e7 Compare June 29, 2026 09:35

pggPL force-pushed the tensor_proto_mechanism branch 3 times, most recently from 5131ebc to 77831be Compare June 29, 2026 10:24

pggPL force-pushed the tensor_proto_mechanism branch from 77831be to 29e5245 Compare June 29, 2026 12:47

pggPL force-pushed the tensor_proto_mechanism branch from 29e5245 to 99c1377 Compare June 29, 2026 13:10

pggPL force-pushed the tensor_proto_mechanism branch from 99c1377 to afa86ff Compare June 29, 2026 13:29

pggPL force-pushed the tensor_proto_mechanism branch from afa86ff to 9e78a6c Compare June 29, 2026 13:30

pggPL and others added 4 commits June 29, 2026 15:45

pggPL force-pushed the tensor_proto_mechanism branch from 9e78a6c to 50c11cd Compare June 29, 2026 13:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch][torch.compile] Add TensorProto mechanism #8

[PyTorch][torch.compile] Add TensorProto mechanism #8
pggPL wants to merge 5 commits into
make_qunatizers_opaquefrom
tensor_proto_mechanism

pggPL commented Jun 6, 2026

Uh oh!

kshitij12345 left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pggPL commented Jun 6, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

kshitij12345 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants