[ExecuTorch][WebGPU] Add et_vk.apply_rotary_emb (interleaved RoPE) + ValueList multi-output by JulianCloudNTH · Pull Request #20264 · pytorch/executorch

JulianCloudNTH · 2026-06-13T00:08:51Z

Stack from ghstack (oldest at bottom):

[ExecuTorch][WebGPU] et_vk.prepack test suite (export + native golden) #20292
[ExecuTorch][WebGPU] Add et_vk.prepack (constant-tensor packing) for E2E weight loading #20265
[ExecuTorch][WebGPU] et_vk.apply_rotary_emb test suite (export + native golden) #20290
-> [ExecuTorch][WebGPU] Add et_vk.apply_rotary_emb (interleaved RoPE) + ValueList multi-output #20264
[ExecuTorch][WebGPU] et_vk.embedding_q4gsw test suite (export + native golden) #20289
[ExecuTorch][WebGPU] Add et_vk.embedding_q4gsw (4-bit groupwise-symmetric quantized embedding) #20263

Adds the WebGPU backend handler for et_vk.apply_rotary_emb.default (interleaved Llama rotary positional embedding) plus the ValueList graph-value support its multi-output signature requires.

The op rotates the query and key tensors by a shared freqs_cos/freqs_sin pair and is composed of two dispatches of one WGSL kernel: each thread handles one (even, odd) element pair of a head row (out[2i] = x[2i]*cos - x[2i+1]*sin, out[2i+1] = x[2i]*sin + x[2i+1]*cos), one dispatch writing xq_out and one writing xk_out, mirroring the Vulkan apply_rotary_emb reference (buffer-only, fp32, the interleaved .default variant). Each dispatch owns a distinct compute pipeline (the graph destructor releases per dispatch, so a shared handle would double-free); the workgroup size is a wg_size pipeline-override constant clamped to the device limit, both 1D dispatch counts go through WebGPUUtils::compute_1d_workgroup_count and are validated before any GPU-object allocation, and the embedded WGSL header is generated by gen_wgsl_headers.py.

The two outputs (xq_out, xk_out) are serialized by the Vulkan exporter as a single ValueList graph value, which the runtime did not previously model. This adds the ValueType::ValueList value kind, a value_lists_ table populated during build(), and a get_value_list accessor the handler uses to resolve the output ids. While in that code path it also closes a latent gap: a constant tensor whose constant_id is set but whose constants table is missing or out of range now throws (fail-loud) rather than silently leaving the buffer uninitialized.
@exported-using-ghexport

Differential Revision: D108428756

[ghstack-poisoned]

pytorch-bot · 2026-06-13T00:08:55Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20264

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[ROCm] MI350 CI jobs will have longer queue times due to CI migration

❌ 3 New Failures, 3 Unrelated Failures

As of commit df04a6b with merge base 0e65ba6 ():

NEW FAILURES - The following jobs have failed:

pull / test-lora-linux / linux-job (gh)
RuntimeError: Command docker exec -t e44eb90c67e406e8607f25bbaaa9261d8a242eb584e35669bf505fd914cc1cac /exec failed with exit code 1
pull / unittest / linux / linux-job (gh)
RuntimeError: Command docker exec -t 7bc4676341dbbe36de6bbd9a1fb3ef1f02f578e2a96d2be8c801eef7d08d0063 /exec failed with exit code 1
pull / unittest-editable / linux / linux-job (gh)
RuntimeError: Command docker exec -t 0694b50b2ea4b152e760ab32d691a00c50fc848bcda0a05de0fda2c6c7cc2b1b /exec failed with exit code 1

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-llama-runner-qnn-linux (fp32, qnn_16a16w, qnn) / linux-job (gh) (trunk failure)
pull / unittest / macos / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
pull / unittest-editable / macos / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-13T00:10:07Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

[ghstack-poisoned]

JulianCloudNTH · 2026-06-15T19:03:14Z

@claude review

claude · 2026-06-15T19:04:09Z

Claude finished @JulianCloudNTH's task in 2m 34s —— View job

Code Review: `et_vk.apply_rotary_emb` + ValueList

Reviewed against origin/gh/JulianCloudNTH/26/base. Overall this is a clean, well-guarded addition — extensive shape/dtype validation, validate-before-allocate ordering, a faithful port of the interleaved Vulkan RoPE, and a thoughtfully designed test (NH≠NKV so an output swap is caught by numel). A few notes, none blocking.

Correctness — looks solid

Kernel indexing (rotary_embedding.wgsl): the pair → (b, s, head, pair_i) decomposition and base/freqs_idx arithmetic match the interleaved out[2i]=x[2i]·cos − x[2i+1]·sin contract, and freqs_idx = s·half_dim + pair_i correctly omits batch/head since freqs is shared. The pair >= num_pairs guard handles the workgroup-count rounding. Per-dispatch uniform buffers carry the distinct n_heads_q/n_heads_k, so the shared shader is correct for GQA. 👍
Validate-before-allocate: both compute_1d_workgroup_count calls run before any shader/pipeline/buffer creation, so a throw on an over-limit dispatch can't leak GPU objects. The clamped wg_size is used consistently for both the override constant and the workgroup-count math.
The WGPUBindGroupLayout is released (line 271) only after both bind groups are created — fine, since the bind groups retain it via refcount.

Notes / suggestions

Double pipeline creation (RotaryEmbedding.cpp:237-241). The comment "a shared handle would double-free" is accurate given the destructor (WebGPUGraph.cpp:188-195) calls wgpuComputePipelineRelease per dispatch. But creating two identical pipelines (same module, layout, constants) just to satisfy ownership is a bit wasteful — every other op goes through get_or_create_pipeline/pipeline_cache_, which the destructor frees separately (lines 201-204) and which dedupes. Consider routing through the cache instead of two raw wgpuDeviceCreateComputePipeline calls; then a single cached pipeline can back both dispatches without a double-free. Not blocking, but it would match the surrounding pattern and avoid the redundant compile.
Redundant num_pairs derivation (RotaryEmbedding.cpp:60 vs 184-193). add_rope_dispatch recomputes num_pairs = numel_of(out.dims)/2 while the caller independently computes workgroup_count from xq_numel/2. They're consistent only because output shape == input shape (which is asserted earlier), so it's safe — just worth a one-liner noting the invariant, since a future shape change would silently desync the guard from the dispatch size.
Fail-loud constant gap (WebGPUGraph.cpp:317-321). Good catch closing the out-of-range/missing-table case. One residual: the new else lives inside if (constant_id >= 0 && constant_data && tensor.nbytes > 0), so a tensor with constant_id >= 0 but a null constant_data pointer still silently leaves the buffer uninitialized — the same class of latent bug this PR is fixing. If that combination is meant to be unreachable, a brief comment would help; otherwise it's worth the same throw.
Native-test coverage (test_webgpu_native.cpp). The C++ kernel is only executed at the multi shape (S=5, NH=8, NKV=2). The decode shape (S=1, GQA 32:8) is exercised only in Python (export + golden-vs-eager), not against the actual GPU kernel. The S=1 path is exactly where the s = t2 % seq / batch decomposition is most likely to break, so running export_rope_model with shape_name="decode" and adding a second native case would close that gap. export_rope_model already takes shape_name, so this is mostly CI-script plumbing.
get_value_list bounds. value_lists_[id] is safe given resize(num_vals), and a non-ValueList id returns an empty vector → caught by the size() != 2 check. Fine. No validation that out_list[0]/[1] are themselves tensor ids before get_tensor, but a malformed id would fault there deterministically; acceptable.

Minor

RotaryParams carries three _pad u32s to reach 32 bytes; the static_assert is a nice guard. Matches the WGSL Params (8×u32).
WGSL .wgsl and the generated _wgsl.h are byte-identical and the wgsl-sha256 is present — assuming gen_wgsl_headers.py was rerun, good.

Nothing here blocks merge; items 1 and 4 are the most worth addressing.
· branch gh/JulianCloudNTH/26/head

[ghstack-poisoned]

JulianCloudNTH · 2026-06-15T21:59:27Z

@claude review

claude · 2026-06-15T22:00:15Z

Claude finished @JulianCloudNTH's task in 1m 55s —— View job

Code Review: `et_vk.apply_rotary_emb` + ValueList (re-review)

Reviewed against origin/gh/JulianCloudNTH/26/base. This is a clean, well-guarded addition. Since the prior review, one of the flagged items has been resolved in code; the rest still stand as optional polish. Nothing blocks merge.

Resolved since last review

Null constant_data gap (was item 3). WebGPUGraph.cpp:322-326 now throws when constant_id >= 0 && tensor.nbytes > 0 but constant_data is null, closing the silent-uninitialized-buffer case. Combined with the out-of-range/missing-table throw at 317-321, the constant path is now fail-loud across all branches. 👍

Correctness — solid

Kernel indexing (rotary_embedding.wgsl): the pair → (b, s, head, pair_i) decomposition and base/freqs_idx = s*half_dim + pair_i arithmetic faithfully implement interleaved out[2i]=x[2i]·cos − x[2i+1]·sin, with freqs correctly shared across batch/head. The pair >= num_pairs guard handles workgroup-count rounding.
Validate-before-allocate: both compute_1d_workgroup_count calls (RotaryEmbedding.cpp:185-194) run before any shader/pipeline/buffer creation, so a throw can't leak GPU objects. wg_size is clamped to the device limit and used consistently for both the override constant and the WGC math.
GQA: per-dispatch uniform buffers carry distinct n_heads_q/n_heads_k, so the shared shader is correct for grouped-query attention.
WGSL hash verified: sha256(rotary_embedding.wgsl) = c60f1ce1… matches the wgsl-sha256 in the generated header — gen_wgsl_headers.py output is current and byte-identical.
Cleanup ordering: bgl is released (:272) only after both bind groups are built; bind groups retain it by refcount. Pipelines are owned by their dispatches and freed in the graph dtor (WebGPUGraph.cpp:188-195).

Remaining optional notes (carried over, non-blocking)

Double pipeline creation (RotaryEmbedding.cpp:239-242). Two identical pipelines (same module, layout, constants) are compiled solely to satisfy per-dispatch ownership in the destructor. Every other op routes through get_or_create_pipeline/pipeline_cache_, which dedupes and is freed separately (WebGPUGraph.cpp:201-204). Routing through the cache would let one cached pipeline back both dispatches without the redundant compile. Fix this →
num_pairs derivation desync risk (RotaryEmbedding.cpp:61 vs 185-194). add_rope_dispatch recomputes num_pairs = numel_of(out.dims)/2, while the caller derives workgroup_count from xq_numel/2. These agree only because output shape == input shape (asserted at :170-180). Safe today; the existing comment at :60 documents the invariant.
Native-test coverage. The C++ kernel is exercised only at the multi shape (S=5). The decode shape (S=1, GQA 32:8) — exactly where the s = t2 % seq / batch decomposition is most fragile — is covered only in Python (export + golden). Adding a native case with shape_name="decode" would close that gap; export_rope_model already takes shape_name.
get_value_list member-id types. out_list[0]/[1] aren't validated as tensor ids before get_tensor; a malformed id faults deterministically there, so acceptable. The size() != 2 guard handles the wrong-arity case (a non-ValueList id returns an empty vector).

Summary: the resolved null-pointer throw is the meaningful change since last review. Item 1 (pipeline cache) is the only remaining suggestion with real upside; items 2-4 are documentation/coverage polish.
· branch gh/JulianCloudNTH/26/head

[ghstack-poisoned]

#20292) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.15.0) (oldest at bottom): * __->__ #20292 * #20265 * #20290 * #20264 * #20289 * #20263 Test suite for the `et_vk.prepack` constant-materialization op, split into its own diff (op below, tests above) per the per-op test-split convention. The prepack op is how a serialized constant becomes a GPU tensor: the constant arrives as a CPU-side reference (sizes + a pointer into the .pte bytes), and the prepack node is the sole materialization — one CPU->GPU transfer straight into the consumer's buffer. The model `M(x) = x + w` (w a constant) routes `w` through a prepack node, so the delegate must run the materialization for the output to equal `x + w` rather than `x + 0`. @exported-using-ghexport Differential Revision: [D108678631](https://our.internmc.facebook.com/intern/diff/D108678631/) Differential Revision: [D108678631](https://our.internmc.facebook.com/intern/diff/D108678631)

…ValueList multi-output Pull Request resolved: #20264 Adds the WebGPU backend handler for `et_vk.apply_rotary_emb.default` (interleaved Llama rotary positional embedding) plus the `ValueList` graph-value support its multi-output signature requires. The op rotates the query and key tensors by a shared `freqs_cos`/`freqs_sin` pair and is composed of two dispatches of one WGSL kernel: each thread handles one (even, odd) element pair of a head row (`out[2i] = x[2i]*cos - x[2i+1]*sin`, `out[2i+1] = x[2i]*sin + x[2i+1]*cos`), one dispatch writing `xq_out` and one writing `xk_out`, mirroring the Vulkan `apply_rotary_emb` reference (buffer-only, fp32, the interleaved `.default` variant). Each dispatch owns a distinct compute pipeline (the graph destructor releases per dispatch, so a shared handle would double-free); the workgroup size is a `wg_size` pipeline-override constant clamped to the device limit, both 1D dispatch counts go through `WebGPUUtils::compute_1d_workgroup_count` and are validated before any GPU-object allocation, and the embedded WGSL header is generated by `gen_wgsl_headers.py`. The two outputs (`xq_out`, `xk_out`) are serialized by the Vulkan exporter as a single `ValueList` graph value, which the runtime did not previously model. This adds the `ValueType::ValueList` value kind, a `value_lists_` table populated during `build()`, and a `get_value_list` accessor the handler uses to resolve the output ids. While in that code path it also closes a latent gap: a constant tensor whose `constant_id` is set but whose constants table is missing or out of range now throws (fail-loud) rather than silently leaving the buffer uninitialized. ghstack-source-id: 395549282 @exported-using-ghexport Differential Revision: [D108428756](https://our.internmc.facebook.com/intern/diff/D108428756/)

Update

f2d1ae0

[ghstack-poisoned]

JulianCloudNTH requested review from kirklandsign and larryliu0820 as code owners June 13, 2026 00:08

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 13, 2026

Update

5d188d6

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence June 15, 2026 17:23 — with GitHub Actions Inactive

meta-codesync Bot added the meta-exported label Jun 15, 2026

Update

0b7f637

[ghstack-poisoned]

JulianCloudNTH mentioned this pull request Jun 15, 2026

[ExecuTorch][WebGPU] et_vk.embedding_q4gsw test suite (export + native golden) #20289

Merged

JulianCloudNTH temporarily deployed to cadence June 15, 2026 21:53 — with GitHub Actions Inactive

JulianCloudNTH mentioned this pull request Jun 15, 2026

[ExecuTorch][WebGPU] et_vk.apply_rotary_emb test suite (export + native golden) #20290

Merged

Update

9b09f89

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence June 15, 2026 22:25 — with GitHub Actions Inactive

JulianCloudNTH mentioned this pull request Jun 15, 2026

[ExecuTorch][WebGPU] et_vk.prepack test suite (export + native golden) #20292

Merged

psiddh approved these changes Jun 21, 2026

View reviewed changes

Update

df04a6b

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence June 22, 2026 05:03 — with GitHub Actions Inactive

meta-codesync Bot merged commit ba07493 into gh/JulianCloudNTH/26/base Jun 22, 2026
176 of 183 checks passed

meta-codesync Bot deleted the gh/JulianCloudNTH/26/head branch June 22, 2026 15:33

meta-codesync Bot temporarily deployed to cherry-pick-bot June 22, 2026 15:33 Inactive

pytorchbot mentioned this pull request Jun 22, 2026

[ExecuTorch][WebGPU] Add et_vk.apply_rotary_emb (interleaved RoPE) + ValueList multi-output #20426

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ExecuTorch][WebGPU] Add et_vk.apply_rotary_emb (interleaved RoPE) + ValueList multi-output#20264

[ExecuTorch][WebGPU] Add et_vk.apply_rotary_emb (interleaved RoPE) + ValueList multi-output#20264
meta-codesync[bot] merged 5 commits into
gh/JulianCloudNTH/26/basefrom
gh/JulianCloudNTH/26/head

JulianCloudNTH commented Jun 13, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 13, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

JulianCloudNTH commented Jun 15, 2026

Uh oh!

claude Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

JulianCloudNTH commented Jun 15, 2026

Uh oh!

claude Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

JulianCloudNTH commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20264

❗ 1 Active SEVs

❌ 3 New Failures, 3 Unrelated Failures

Uh oh!

github-actions Bot commented Jun 13, 2026

This PR needs a release notes: label

Uh oh!

JulianCloudNTH commented Jun 15, 2026

Uh oh!

claude Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: et_vk.apply_rotary_emb + ValueList

Correctness — looks solid

Notes / suggestions

Minor

Uh oh!

JulianCloudNTH commented Jun 15, 2026

Uh oh!

claude Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: et_vk.apply_rotary_emb + ValueList (re-review)

Resolved since last review

Correctness — solid

Remaining optional notes (carried over, non-blocking)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JulianCloudNTH commented Jun 13, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 13, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Jun 15, 2026 •

edited

Loading

Code Review: `et_vk.apply_rotary_emb` + ValueList

claude Bot commented Jun 15, 2026 •

edited

Loading

Code Review: `et_vk.apply_rotary_emb` + ValueList (re-review)