XNNPACK: Lift constant mul scalars for partitioning by mansnils · Pull Request #20515 · pytorch/executorch

mansnils · 2026-06-25T14:20:34Z

XNNPACK supports the tensor overload for multiply, but plain aten.mul.Scalar is not selected by the XNNPACK partitioner.

Adds a narrow scalar-lifting pass that rewrites aten.mul.Scalar to aten.mul.Tensor by registering the scalar as a small constant buffer. This avoids introducing an aten.full op while allowing the existing multiply lowering path to partition the op.

Keep SDPA scale multipliers as aten.mul.Scalar so ConvertToSDPAPass can still recover the attention scale before replacing the pattern. Add test coverage for that guard.

Allow the XNNPACK tester to pass transform passes through to to_edge_transform_and_lower. This keeps op tests on the same path as existing XNNPACK model tests that already use explicit transform passes.

For DeIT Tiny, this removes 24 portable aten.mul.Scalar nodes and reduces delegate count from 62 to 50. In current local timing checks the latency impact is modest: about 1% faster on both Android SME2 and the aarch64 XNNPACK/KleidiAI NEON-class host runner. These are modest uplifts but may introduce more opportunities for improvements.

cc @GregoryComer @digantdesai @cbilgin @freddan80 @per @zingo @oscarandersson8218 @Sebastian-Larsson @robell @rascani

XNNPACK supports the tensor overload for multiply, but plain aten.mul.Scalar is not selected by the XNNPACK partitioner. Adds a narrow scalar-lifting pass that rewrites aten.mul.Scalar to aten.mul.Tensor by registering the scalar as a small constant buffer. This avoids introducing an aten.full op while allowing the existing multiply lowering path to partition the op. Keep SDPA scale multipliers as aten.mul.Scalar so ConvertToSDPAPass can still recover the attention scale before replacing the pattern. Add test coverage for that guard. Allow the XNNPACK tester to pass transform passes through to to_edge_transform_and_lower. This keeps op tests on the same path as existing XNNPACK model tests that already use explicit transform passes. For DeIT Tiny, this removes 24 portable aten.mul.Scalar nodes and reduces delegate count from 62 to 50. In current local timing checks the latency impact is modest: about 1% faster on both Android SME2 and the aarch64 XNNPACK/KleidiAI NEON-class host runner. These are modest uplifts but may introduce more opportunities for improvements. Signed-off-by: Måns Nilsson <mans.nilsson@arm.com> Change-Id: I83b6ad53925edb72afdf0077b5dbb99b5d9c4648

pytorch-bot · 2026-06-25T14:20:39Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20515

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[ROCm] MI350 CI runner label rename: rebase PRs using old linux.rocm.gpu.gfx950.* labels

❌ 4 New Failures, 4 Unrelated Failures, 1 Unclassified Failure

As of commit 2861e10 with merge base 82a605d ():

NEW FAILURES - The following jobs have failed:

pull / unittest / linux / linux-job (gh)
RuntimeError: Command docker exec -t 3e58ac4d0cec98995ef3c5a078a96d22f60153772b2e4a6e704d99c04fd8d686 /exec failed with exit code 1
pull / unittest-editable / linux / linux-job (gh)
RuntimeError: Command docker exec -t 8997a47eaaa8535bf85b05c4a58efc635313183bb5b18b3cfdd2ffc6ec0146e6 /exec failed with exit code 1
trunk / test-qnn-model (fp32, mv2) / linux-job (gh)
RuntimeError: Command docker exec -t eac48221f5c3aaa47b77e6496bf3d63dfa36c8a6b589d178d1a5eea00cd73c0e /exec failed with exit code 92
trunk / unittest-release / linux / linux-job (gh)
RuntimeError: Command docker exec -t 25407f2c096ddc2672d29ca56cc8c11679a7ab714454e9f240419ecfeabe9f64 /exec failed with exit code 1

UNCLASSIFIED FAILURE - DrCI could not classify the following job because the workflow did not run on the merge base. The failure may be pre-existing on trunk or introduced by this PR:

pull / test-arm-backend-public-api-backward-compatibility / linux-job (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
RuntimeError: Command docker exec -t b080c50ffb6763d14e257455dc166affdc1e5ab6bc853fb125998c8f23f2735c /exec failed with exit code 127

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / macos / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
pull / unittest-editable / macos / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
trunk / test-llama-runner-qnn-linux (fp32, qnn_16a16w, qnn) / linux-job (gh) (trunk failure)
trunk / unittest-release / macos / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

mansnils · 2026-06-25T14:21:10Z

@claude review this PR

claude · 2026-06-25T14:21:56Z

Claude finished @mansnils's task in 9m 47s —— View job

PR Review: XNNPACK — Lift constant mul scalars for partitioning

Reviewed the diff against origin/main. Overall this is a clean, well-scoped change with good test coverage and a thoughtful SDPA guard. A few points worth considering before merge.

Correctness / design

Constant added as module buffer + get_attr rather than a lifted constant placeholder — lift_constant_scalar_operands_pass.py:55-76 uses register_buffer + graph.get_attr. In a lifted edge ExportedProgram, the idiomatic way to introduce a constant is create_constant_placeholder in backends/transforms/utils.py:98, which also registers it in exp_program.constants / the graph signature. Because this pass subclasses ExportPass (GraphModule-only call()), it has no access to the ExportedProgram, which is presumably why the buffer route was taken.

In practice XNNPACK serialization handles raw get_attr nodes (node_visitor.py:577 → get_param_tensor has a get-attr fallback at utils.py:89-94), so the delegated path works. The risk is the non-delegated path: if a mapped mul.Scalar is not partitioned, the orphaned get_attr has no graph-signature entry and the ExecuTorch emitter may not handle it. The dtype guard (point 3) makes this unlikely for mul, but it's worth confirming the XNNPACK CI job covers a case where the lifted mul falls back to portable. Worth a comment explaining why a buffer + get_attr is safe here.
_feeds_sdpa_qk_bmm passthrough set is narrow and coupled to SDPA internals (lift_constant_scalar_operands_pass.py:85-110). The traversal only walks through expand_copy/view_copy to reach a bmm. If the real SDPA graph has any other op (transpose/permute/to_copy) between the scale mul.Scalar and the QK^T bmm, the guard misses it, the scalar gets lifted, and ConvertToSDPAPass.get_scale (convert_to_sdpa.py:33-52) silently fails its assert scale is not None. The unit test only exercises the canonical pattern from partition/graphs/sdpa.py. Since this generic pass now encodes knowledge of the XNNPACK SDPA pattern, consider deriving the guard from the same sdpa.get_graphs() source, or at least add a test with a pattern that has an intervening op. A false positive (leaving a non-SDPA mul→bmm scalar unlifted) is only a missed optimization, so the asymmetry is in the safe direction — but a false negative breaks SDPA.

Tester change

transform_passes silently dropped when a prebuilt stage is passed (test/tester/tester.py, Tester.to_edge_transform_and_lower). In the else branch (caller supplies to_edge_and_lower_stage), partitioners and edge_compile_config are propagated onto the stage but transform_passes is not. Passing both a stage and transform_passes= would silently ignore the passes. Either propagate it (to_edge_and_lower_stage.transform_passes = transform_passes) for consistency, or document that they're mutually exclusive. Fix this →

Minor

self._modified can be a local (lift_constant_scalar_operands_pass.py:53,113,143,149-150). It's set/reset entirely within call() and never read by helpers, so a local modified would be simpler and avoids surprising state on a reused pass instance. Per CLAUDE.md ("choose simpler, more concise"), worth collapsing.
Production lowering path doesn't pick this up. get_transform_passes() now includes the pass, but the XNNPACK recipe (recipes/xnnpack_recipe_provider.py:103-106, LoweringRecipe) sets no edge_transform_passes, so the standard recipe-based export does not benefit — only callers that explicitly pass get_transform_passes() (tests, size_analysis_tool). If the DeIT delegate-count/latency wins are meant to reach normal users, consider wiring it into the LoweringRecipe. If opt-in is intentional for now, a one-line note in the PR description would help.
0-d constant tensor — torch.tensor(value, ...) (:66) produces a rank-0 tensor relying on mul.Tensor broadcasting. The end-to-end test_fp32_mul_scalar covers this, but it's worth confirming the XNNPACK CI job is green since 0-dim operands have historically been finicky in some backends.

Tests

test/passes/test_lift_constant_scalar_operands_pass.py nicely covers lift, unmapped-op passthrough, and the SDPA guard.
test_fp32_mul_scalar exercises the full export→serialize→run path. Consider adding a case that asserts a non-canonical SDPA-like graph still recovers scale, to lock in point 2.

Nothing here is blocking on its own; #1 and #2 are the items I'd most want addressed (or explicitly reasoned about in a comment) before merge.
· branch ops

Copilot

Pull request overview

This PR improves XNNPACK partitioning coverage by rewriting constant aten.mul.Scalar occurrences into aten.mul.Tensor with a lifted constant buffer, enabling existing multiply lowering to pick them up without introducing runtime aten.full. It also extends the XNNPACK test harness to allow passing explicit transform passes into to_edge_transform_and_lower, and adds tests to protect SDPA scale handling.

Changes:

Add LiftConstantScalarOperandsPass to lift constant scalar operands (currently focused on mul.Scalar) into small tensor buffers to enable XNNPACK partitioning.
Enable XNNPACK Tester.to_edge_transform_and_lower(...) to thread transform_passes through to executorch.exir.to_edge_transform_and_lower.
Add unit and op-test coverage for scalar lifting and for the SDPA scale guard behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
backends/xnnpack/utils/configs.py	Adds the scalar-lifting pass to the default XNNPACK transform pass pipeline.
backends/xnnpack/test/tester/tester.py	Extends the XNNPACK tester stage to forward explicit `transform_passes` into `to_edge_transform_and_lower`.
backends/xnnpack/test/passes/test_lift_constant_scalar_operands_pass.py	New unit tests validating scalar lifting and guarding SDPA scale multipliers.
backends/xnnpack/test/ops/test_multiply.py	Adds an op test to ensure mul-scalar gets delegated via the transform-pass path.
backends/xnnpack/_passes/lift_constant_scalar_operands_pass.py	Implements the scalar-to-tensor lifting pass with an SDPA-specific preservation guard.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

    def __init__(
        self,
        partitioners: Optional[List[Partitioner]] = None,
        edge_compile_config: Optional[EdgeCompileConfig] = None,
+        transform_passes: Optional[List[PassType]] = None,
    ):


+    def to_edge_transform_and_lower(
+        self,
+        to_edge_and_lower_stage: Optional[BaseStages.ToEdgeTransformAndLower] = None,
+        generate_etrecord: bool = False,
+        *,
+        partitioners: Optional[List[Partitioner]] = None,
+        edge_compile_config: Optional[EdgeCompileConfig] = None,
+        transform_passes: Optional[List[PassType]] = None,
+    ):
+        if to_edge_and_lower_stage is None:
+            to_edge_and_lower_stage = ToEdgeTransformAndLower(
+                partitioners=partitioners,
+                edge_compile_config=edge_compile_config,
+                transform_passes=transform_passes,
+            )
+        else:
+            if partitioners is not None:
+                to_edge_and_lower_stage.partitioners = partitioners
+            if edge_compile_config is not None:
+                to_edge_and_lower_stage.edge_compile_conf = edge_compile_config
+        return super().to_edge_transform_and_lower(
+            to_edge_and_lower_stage,
+            generate_etrecord=generate_etrecord,
+        )


 from executorch.backends.xnnpack.utils.configs import get_xnnpack_edge_compile_config
-from executorch.exir import EdgeCompileConfig
+from executorch.exir import EdgeCompileConfig, to_edge_transform_and_lower
 from executorch.exir.backend.partitioner import Partitioner
 from torch._export.pass_base import PassType
+from torch.export import ExportedProgram
 from torchao.quantization.pt2e.quantizer import Quantizer


+            torch.export.export(module, (torch.randn(2, 3),), strict=True),
+            compile_config=get_xnnpack_edge_compile_config(skip_dim_order=True),
+        )
+        return edge.transform([LiftConstantScalarOperandsPass()]).exported_program()


mansnils requested a review from digantdesai as a code owner June 25, 2026 14:20

Copilot AI review requested due to automatic review settings June 25, 2026 14:20

mansnils added module: xnnpack Issues related to xnnpack delegation and the code under backends/xnnpack/ partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm ciflow/trunk release notes: xnnpack Changes to the XNNPack backend delegate labels Jun 25, 2026

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 25, 2026

Copilot started reviewing on behalf of mansnils June 25, 2026 14:21 View session

Copilot AI reviewed Jun 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

XNNPACK: Lift constant mul scalars for partitioning#20515

XNNPACK: Lift constant mul scalars for partitioning#20515
mansnils wants to merge 1 commit into
pytorch:mainfrom
mansnils:ops

mansnils commented Jun 25, 2026 •

edited by pytorch-bot Bot

Loading

Uh oh!

pytorch-bot Bot commented Jun 25, 2026 •

edited

Loading

Uh oh!

mansnils commented Jun 25, 2026

Uh oh!

claude Bot commented Jun 25, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

mansnils commented Jun 25, 2026 • edited by pytorch-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20515

❗ 1 Active SEVs

❌ 4 New Failures, 4 Unrelated Failures, 1 Unclassified Failure

Uh oh!

mansnils commented Jun 25, 2026

Uh oh!

claude Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: XNNPACK — Lift constant mul scalars for partitioning

Correctness / design

Tester change

Minor

Tests

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mansnils commented Jun 25, 2026 •

edited by pytorch-bot Bot

Loading

pytorch-bot Bot commented Jun 25, 2026 •

edited

Loading

claude Bot commented Jun 25, 2026 •

edited

Loading