[ExecuTorch][WebGPU] Add et_vk.apply_rotary_emb (interleaved RoPE) + ValueList multi-output by pytorchbot · Pull Request #20426 · pytorch/executorch

pytorchbot · 2026-06-22T15:34:46Z

This PR was created by the merge bot to help merge the original PR into the main branch.
ghstack PR number: #20264 by @JulianCloudNTH
^ Please use this as the source of truth for the PR details, comments, and reviews
ghstack PR base: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/26/base
ghstack PR head: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/26/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/28/orig
Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/26/orig

@diff-train-skip-merge

…ValueList multi-output Pull Request resolved: #20264 Adds the WebGPU backend handler for `et_vk.apply_rotary_emb.default` (interleaved Llama rotary positional embedding) plus the `ValueList` graph-value support its multi-output signature requires. The op rotates the query and key tensors by a shared `freqs_cos`/`freqs_sin` pair and is composed of two dispatches of one WGSL kernel: each thread handles one (even, odd) element pair of a head row (`out[2i] = x[2i]*cos - x[2i+1]*sin`, `out[2i+1] = x[2i]*sin + x[2i+1]*cos`), one dispatch writing `xq_out` and one writing `xk_out`, mirroring the Vulkan `apply_rotary_emb` reference (buffer-only, fp32, the interleaved `.default` variant). Each dispatch owns a distinct compute pipeline (the graph destructor releases per dispatch, so a shared handle would double-free); the workgroup size is a `wg_size` pipeline-override constant clamped to the device limit, both 1D dispatch counts go through `WebGPUUtils::compute_1d_workgroup_count` and are validated before any GPU-object allocation, and the embedded WGSL header is generated by `gen_wgsl_headers.py`. The two outputs (`xq_out`, `xk_out`) are serialized by the Vulkan exporter as a single `ValueList` graph value, which the runtime did not previously model. This adds the `ValueType::ValueList` value kind, a `value_lists_` table populated during `build()`, and a `get_value_list` accessor the handler uses to resolve the output ids. While in that code path it also closes a latent gap: a constant tensor whose `constant_id` is set but whose constants table is missing or out of range now throws (fail-loud) rather than silently leaving the buffer uninitialized. ghstack-source-id: 395549282 @exported-using-ghexport Differential Revision: [D108428756](https://our.internmc.facebook.com/intern/diff/D108428756/)

pytorch-bot · 2026-06-22T15:34:50Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20426

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[ROCm] MI350 CI jobs will have longer queue times due to CI migration

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…ve golden) Pull Request resolved: #20290 Splits the `et_vk.apply_rotary_emb` tests into their own diff (op below, tests above), matching the `sdpa`/`update_cache`/`linear_q4gsw` convention, and brings them to the same rigor: a multi-shape config sweep run on-device (prefill + decode) and a library dual-oracle at both shapes. ghstack-source-id: 395549287 @exported-using-ghexport Differential Revision: [D108668384](https://our.internmc.facebook.com/intern/diff/D108668384/)

…E2E weight loading Pull Request resolved: #20265 Adds the WebGPU backend handler for `et_vk.prepack.default`, the node the VulkanPartitioner wraps around every constant feeding a delegated op so the constant is materialized into its dedicated GPU buffer before inference. For the WebGPU backend's buffer-flat/fp32 model, prepack is an identity layout (same dims, dtype, and bytes), so the handler runs no compute shader: it validates that `src` and `out` match (dims, `elem_size`, `nbytes`, non-null buffers; every check throws fail-loud) and records a one-time `src`->`out` buffer-to-buffer copy via the new `WebGPUGraph::add_prepack_copy`. The recorded copies run once in a new `build()` Phase 4 (after the op-dispatch chain is recorded), mirroring the Vulkan delegate's separate `prepack()` init phase (distinct from per-inference `execute()`). Ordering is guaranteed by the WebGPU queue -- the prepack submit precedes the first `execute()` submit on the same queue, so the copied data is visible without an explicit device poll (Dawn has no `wgpuDevicePoll`, and the backend relies on queue ordering plus the output-map wait elsewhere). `src.elem_size` is the `WebGPUTensor` field added by the embedding op lower in this stack, so prepack stacks above it. ghstack-source-id: 395549289 @exported-using-ghexport Differential Revision: [D108428754](https://our.internmc.facebook.com/intern/diff/D108428754/)

Pull Request resolved: #20292 Test suite for the `et_vk.prepack` constant-materialization op, split into its own diff (op below, tests above) per the per-op test-split convention. The prepack op is how a serialized constant becomes a GPU tensor: the constant arrives as a CPU-side reference (sizes + a pointer into the .pte bytes), and the prepack node is the sole materialization — one CPU->GPU transfer straight into the consumer's buffer. The model `M(x) = x + w` (w a constant) routes `w` through a prepack node, so the delegate must run the materialization for the output to equal `x + w` rather than `x + 0`. ghstack-source-id: 395555139 @exported-using-ghexport Differential Revision: [D108678631](https://our.internmc.facebook.com/intern/diff/D108678631/)

pytorchbot requested review from kirklandsign and larryliu0820 as code owners June 22, 2026 15:34

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 22, 2026

pytorchbot temporarily deployed to cadence June 22, 2026 15:35 — with GitHub Actions Inactive

JulianCloudNTH self-requested a review June 22, 2026 16:28

JulianCloudNTH approved these changes Jun 22, 2026

View reviewed changes

JulianCloudNTH added 3 commits June 22, 2026 11:38

JulianCloudNTH merged commit 1c65102 into gh/JulianCloudNTH/28/orig Jun 22, 2026
144 of 151 checks passed

JulianCloudNTH temporarily deployed to cadence June 22, 2026 18:39 — with GitHub Actions Inactive

JulianCloudNTH deleted the gh/JulianCloudNTH/26/orig branch June 22, 2026 18:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ExecuTorch][WebGPU] Add et_vk.apply_rotary_emb (interleaved RoPE) + ValueList multi-output#20426

[ExecuTorch][WebGPU] Add et_vk.apply_rotary_emb (interleaved RoPE) + ValueList multi-output#20426
JulianCloudNTH merged 4 commits into
gh/JulianCloudNTH/28/origfrom
gh/JulianCloudNTH/26/orig

pytorchbot commented Jun 22, 2026

Uh oh!

pytorch-bot Bot commented Jun 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

pytorchbot commented Jun 22, 2026

Uh oh!

pytorch-bot Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20426

❗ 1 Active SEVs

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot Bot commented Jun 22, 2026 •

edited

Loading