[ExecuTorch][WebGPU] SDPA: skip QK contraction for fully-masked causal tiles by JulianCloudNTH · Pull Request #20492 · pytorch/executorch

JulianCloudNTH · 2026-06-24T19:53:38Z

Stack from ghstack (oldest at bottom):

Skip the QK contraction for fully-masked causal tiles — at S=128 prefill ~48% of the (query, key) tiles are entirely above the diagonal and contribute nothing; this elides their dot products (prefill-only; bit-identical output).

Problem: For causal prefill, ~half the (query S-tile, key context-tile) pairs are entirely above the diagonal, yet the kernel still computes their full d4 dot product before masking the result to NEG_INF.

Solution: Skip the contraction for fully-masked tiles; the existing per-element mask still writes the sentinel:

Before: every (s0, c0) tile runs the full d4 dot-product loop, then store_qk masks above-diagonal elements to NEG_INF.
After: a fully-masked tile (c0 > s0 + TM-1 + input_pos) breaks the d4 loop immediately (acc stays 0); store_qk masks every element to NEG_INF exactly as before.

Implementation:

Add skip_tile = c0 > s0 + (TM - 1) + params.input_pos, folded into the d4 loop break condition.
Store loop unchanged — runs unconditionally, so no scratch entry is left stale.
Mirrors Vulkan sdpa_compute_attn_weights_tiled.glsl (tile_in_mask_region).

Constraints:

No KV-cache, host, dispatch, or uniform change (all tiles still launch; the skip is in-shader).
Prefill-only: decode S=1 never triggers it (c0 <= input_pos < input_pos + TM - 1).
NEG_INF stays the WGSL-safe -1.0e30 (WGSL forbids a literal -inf); does not copy Vulkan's -1.0/0.0.

Co-authored with Claude Code.
@exported-using-ghexport

Differential Revision: D109517773

[ghstack-poisoned]

pytorch-bot · 2026-06-24T19:53:42Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20492

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 2 Active SEVs

There are 2 currently active SEVs. If your PR is affected, please view them below:

❌ 2 New Failures, 1 Unrelated Failure

As of commit 173487b with merge base e03f777 ():

NEW FAILURES - The following jobs have failed:

pull / test-llama-runner-qnn-linux (fp32, qnn_16a16w, qnn) / linux-job (gh)
RuntimeError: Command docker exec -t 7071cd2c3ce3f3047964eccadfece7c7607b320effc835909074470b1b678fe6 /exec failed with exit code 137
pull / unittest-buck / linux / linux-job (gh)
RuntimeError: Command docker exec -t 6e24395ccd72e7ad6cd56fae0ec41acbc82501c76e7eb56617de5c9478458df1 /exec failed with exit code 3

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest-buck / macos / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 3

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-24T19:54:27Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

[ghstack-poisoned]

SS-JIA

Review automatically exported from Phabricator review in Meta.

…l tiles Pull Request resolved: #20492 **Skip the QK contraction for fully-masked causal tiles** — at S=128 prefill ~48% of the (query, key) tiles are entirely above the diagonal and contribute nothing; this elides their dot products (prefill-only; bit-identical output). **Problem**: For causal prefill, ~half the (query S-tile, key context-tile) pairs are entirely above the diagonal, yet the kernel still computes their full `d4` dot product before masking the result to `NEG_INF`. **Solution**: Skip the contraction for fully-masked tiles; the existing per-element mask still writes the sentinel: - **Before**: every `(s0, c0)` tile runs the full `d4` dot-product loop, then `store_qk` masks above-diagonal elements to `NEG_INF`. - **After**: a fully-masked tile (`c0 > s0 + TM-1 + input_pos`) breaks the `d4` loop immediately (`acc` stays 0); `store_qk` masks every element to `NEG_INF` exactly as before. **Implementation**: - Add `skip_tile = c0 > s0 + (TM - 1) + params.input_pos`, folded into the `d4` loop break condition. - Store loop unchanged — runs unconditionally, so no scratch entry is left stale. - Mirrors Vulkan `sdpa_compute_attn_weights_tiled.glsl` (`tile_in_mask_region`). **Constraints**: - No KV-cache, host, dispatch, or uniform change (all tiles still launch; the skip is in-shader). - Prefill-only: decode `S=1` never triggers it (`c0 <= input_pos < input_pos + TM - 1`). - `NEG_INF` stays the WGSL-safe `-1.0e30` (WGSL forbids a literal `-inf`); does not copy Vulkan's `-1.0/0.0`. Co-authored with Claude Code. ghstack-source-id: 396792509 @exported-using-ghexport Differential Revision: [D109517773](https://our.internmc.facebook.com/intern/diff/D109517773/)

Update

40acc76

[ghstack-poisoned]

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 24, 2026

JulianCloudNTH mentioned this pull request Jun 24, 2026

[ExecuTorch][WebGPU] Register-tile the SDPA QK/AV kernels #20405

Merged

JulianCloudNTH temporarily deployed to cadence June 24, 2026 19:53 — with GitHub Actions Inactive

This was referenced Jun 24, 2026

[ExecuTorch][WebGPU] Coalesce SDPA AV V-cache reads along contiguous head-dim #20459

Merged

[ExecuTorch][WebGPU] SDPA: branchless aligned/tail loads in the QK/AV kernels #20493

Merged

JulianCloudNTH requested a review from psiddh June 24, 2026 21:46

psiddh approved these changes Jun 24, 2026

View reviewed changes

Update

173487b

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence June 25, 2026 02:35 — with GitHub Actions Inactive

meta-codesync Bot added the meta-exported label Jun 25, 2026

SS-JIA approved these changes Jun 25, 2026

View reviewed changes

meta-codesync Bot merged commit 1a9fe0a into gh/JulianCloudNTH/62/base Jun 25, 2026
184 of 195 checks passed

meta-codesync Bot deleted the gh/JulianCloudNTH/62/head branch June 25, 2026 06:43

meta-codesync Bot temporarily deployed to cherry-pick-bot June 25, 2026 06:43 Inactive

pytorchbot mentioned this pull request Jun 25, 2026

[ExecuTorch][WebGPU] SDPA: skip QK contraction for fully-masked causal tiles #20509

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ExecuTorch][WebGPU] SDPA: skip QK contraction for fully-masked causal tiles#20492

[ExecuTorch][WebGPU] SDPA: skip QK contraction for fully-masked causal tiles#20492
meta-codesync[bot] merged 2 commits into
gh/JulianCloudNTH/62/basefrom
gh/JulianCloudNTH/62/head

JulianCloudNTH commented Jun 24, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 24, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

SS-JIA left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

JulianCloudNTH commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20492

❗ 2 Active SEVs

❌ 2 New Failures, 1 Unrelated Failure

Uh oh!

github-actions Bot commented Jun 24, 2026

This PR needs a release notes: label

Uh oh!

SS-JIA left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JulianCloudNTH commented Jun 24, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 24, 2026 •

edited

Loading

This PR needs a `release notes:` label