[PyTorch] Add torch.compile custom-op path for Linear#9
Conversation
2cccc30 to
2e252f9
Compare
Squashed PR #9 (linear_compile) onto the rebased base. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Squashed PR #9 (linear_compile) onto the rebased base. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Squashed PR #9 (linear_compile) onto the rebased base. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
2e252f9 to
ba92f5b
Compare
Squashed PR #9 (linear_compile) onto the rebased base. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
9a81f4a to
dcf5b2f
Compare
ba92f5b to
b1273ea
Compare
Squashed PR #9 (linear_compile) onto the rebased base. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Squashed PR #9 (linear_compile) onto the rebased base. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
kshitij12345
left a comment
There was a problem hiding this comment.
TODO: Review the custom_op.py file.
| torch._dynamo.reset() | ||
| # dynamic=False for now: a symbolic shape would land in an OpaqueValueBundle | ||
| # (value-opaque op arg) whose hash chokes on non-nested SymInt. Force static | ||
| # shapes (recompile per shape) until the bundle handles symbolic shapes. |
There was a problem hiding this comment.
dynamic=False for now: What happens with dynamic=True? Is it a hard error Or a graph break?
| if isinstance(input_quantizer, NVFP4Quantizer): | ||
| rht_matrix = get_rht_matrix( | ||
| input_quantizer._with_random_sign_mask, inp.device.index | ||
| ) |
There was a problem hiding this comment.
Curious to know, are these still required if the warm-up allocated these tensors so that these are not allocated on CUDAGraph's private memory-pool and hence the CUDAGraph Trees doesn't error out due to allocated tensor not being returned (/silently ignored).
|
|
||
| _TE_OP_NAMESPACE = "transformer_engine_compile" | ||
|
|
||
|
|
There was a problem hiding this comment.
Optional: It would probably be good to have a comment explain the abstractions in this file, their purpose and how to fit together.
| """Whether ``value`` may be stored inside an instance (recursive).""" | ||
| if isinstance(value, cls.PRIMITIVE_TYPES): | ||
| return True | ||
| if isinstance(value, Enum): |
|
|
||
| # Workaround for PyTorch issue: FxGraphCachePickler handles FakeScriptObject | ||
| # but not the real ProcessGroup that appears in example_inputs at inductor | ||
| # compile time. Register a copyreg reducer so the pickler can hash the key. |
There was a problem hiding this comment.
Note to self: Should this be supported by PyTorch?
| _register_autograd_for_op( | ||
| fwd_op=inner_fwd_def, bwd_op_name=inner_bwd_name, **autograd_common | ||
| ) | ||
| _register_autograd_for_op( |
There was a problem hiding this comment.
Note to self: does the outer op required grad rule?
| """ | ||
| raise NotImplementedError | ||
|
|
||
| def pack(self, owner: Any) -> List[Tuple[str, Any]]: |
There was a problem hiding this comment.
I think this should be called flatten (or maybe unpack). pack seems like we are going to pack objects into one single object but this seems to do the opposite.
29e5245 to
99c1377
Compare
9e78a6c to
50c11cd
Compare
…stants; fix SP memory leak; test suite hook-up Wrap CommOverlapCore pybind11 methods that return compile-time constants so torch.compile(fullgraph=True) can trace through them without graph breaks: - `is_fp8_ubuf()` → `ub_is_fp8()` / `get_ub_is_fp8()` in base.py; `_ub_is_fp8()` in gemm.py - `with_cublasmp()` → `ub_is_cublasmp()` in base.py All callers in linear.py, layernorm_linear.py, layernorm_mlp.py, base.py, gemm.py, userbuffers_backward_linear.py and userbuffers_forward_linear.py updated. Fix quantized grad_output not being freed early for column-parallel SP backward. Row-parallel SP already called clear_tensor_data(grad_output) to release the gathered tensor; column-parallel SP quantizes grad_output to Float8TensorStorage but never freed it before returning. Under torch.compile reduce-overhead this leaves 3 live pool tensors at recording end and triggers "Detected 3 tensor(s) in the cudagraph pool not tracked as outputs". Extend the existing clear_tensor_data guard to cover both parallel modes. Fix custom-recipe quantizer state being re-initialised on every forward call even when the recipe object has not changed. The existing early-exit for CustomRecipeState was missing an identity check on the recipe object, so any repeated call with the same recipe would bypass the early-return and rebuild quantizers unnecessarily. Add `if recipe_state.recipe is recipe: return` to restore the intended caching behaviour. Add test_torch_compile.py to L0_pytorch_unittest so the autocast and existing compile tests run in CI. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> (cherry picked from commit bfce3a7)
for more information, see https://pre-commit.ci (cherry picked from commit afe364b)
ToyLinear now overrides get_quantizer_roles so CustomRecipeState doesn't hit the no-roles warning, which graph-breaks under fullgraph=True. qfactory dispatches on role.tensor_type instead of a pre-baked string key. Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> (cherry picked from commit 22f80e4)
…tom_op Replace the low-level torch.library.Library (_TE_LIB.define/.impl + functional register_fake/register_autograd/register_torch_dispatch with lib=) with the standard torch.library.custom_op API, passing the dynamically built schema explicitly via schema=. register_fake/register_autograd/register_torch_dispatch are now methods on the returned CustomOpDef. Drops the TOR901 Library usage and is robust to re-registration (get_library_allowing_overwrite). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> (cherry picked from commit d590560)
…e sentinel Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Build a torch.compile custom-op framework in dynamo.py that traces Linear forward+backward as single graph nodes (no graph break into the eager autograd.Function):
Tests: test_te_linear_compiles (bf16 + every recipe), quantized FP8 weight input. Backward through a Float8Tensor output is a strict xfail (AOTAutograd demands a subclass cotangent and the linear backward has no FP8-cotangent path).
Description
Please include a brief summary of the changes, relevant motivation and context.
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: