[ExecuTorch][WebGPU] Add et_vk.apply_rotary_emb (interleaved RoPE) + ValueList multi-output#20264
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20264
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 3 New Failures, 3 Unrelated FailuresAs of commit df04a6b with merge base 0e65ba6 ( NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
|
@claude review |
|
Claude finished @JulianCloudNTH's task in 2m 34s —— View job Code Review:
|
|
@claude review |
|
Claude finished @JulianCloudNTH's task in 1m 55s —— View job Code Review:
|
ba07493
into
gh/JulianCloudNTH/26/base
#20292) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.15.0) (oldest at bottom): * __->__ #20292 * #20265 * #20290 * #20264 * #20289 * #20263 Test suite for the `et_vk.prepack` constant-materialization op, split into its own diff (op below, tests above) per the per-op test-split convention. The prepack op is how a serialized constant becomes a GPU tensor: the constant arrives as a CPU-side reference (sizes + a pointer into the .pte bytes), and the prepack node is the sole materialization — one CPU->GPU transfer straight into the consumer's buffer. The model `M(x) = x + w` (w a constant) routes `w` through a prepack node, so the delegate must run the materialization for the output to equal `x + w` rather than `x + 0`. @exported-using-ghexport Differential Revision: [D108678631](https://our.internmc.facebook.com/intern/diff/D108678631/) Differential Revision: [D108678631](https://our.internmc.facebook.com/intern/diff/D108678631)
…ValueList multi-output Pull Request resolved: #20264 Adds the WebGPU backend handler for `et_vk.apply_rotary_emb.default` (interleaved Llama rotary positional embedding) plus the `ValueList` graph-value support its multi-output signature requires. The op rotates the query and key tensors by a shared `freqs_cos`/`freqs_sin` pair and is composed of two dispatches of one WGSL kernel: each thread handles one (even, odd) element pair of a head row (`out[2i] = x[2i]*cos - x[2i+1]*sin`, `out[2i+1] = x[2i]*sin + x[2i+1]*cos`), one dispatch writing `xq_out` and one writing `xk_out`, mirroring the Vulkan `apply_rotary_emb` reference (buffer-only, fp32, the interleaved `.default` variant). Each dispatch owns a distinct compute pipeline (the graph destructor releases per dispatch, so a shared handle would double-free); the workgroup size is a `wg_size` pipeline-override constant clamped to the device limit, both 1D dispatch counts go through `WebGPUUtils::compute_1d_workgroup_count` and are validated before any GPU-object allocation, and the embedded WGSL header is generated by `gen_wgsl_headers.py`. The two outputs (`xq_out`, `xk_out`) are serialized by the Vulkan exporter as a single `ValueList` graph value, which the runtime did not previously model. This adds the `ValueType::ValueList` value kind, a `value_lists_` table populated during `build()`, and a `get_value_list` accessor the handler uses to resolve the output ids. While in that code path it also closes a latent gap: a constant tensor whose `constant_id` is set but whose constants table is missing or out of range now throws (fail-loud) rather than silently leaving the buffer uninitialized. ghstack-source-id: 395549282 @exported-using-ghexport Differential Revision: [D108428756](https://our.internmc.facebook.com/intern/diff/D108428756/)
…ValueList multi-output Pull Request resolved: #20264 Adds the WebGPU backend handler for `et_vk.apply_rotary_emb.default` (interleaved Llama rotary positional embedding) plus the `ValueList` graph-value support its multi-output signature requires. The op rotates the query and key tensors by a shared `freqs_cos`/`freqs_sin` pair and is composed of two dispatches of one WGSL kernel: each thread handles one (even, odd) element pair of a head row (`out[2i] = x[2i]*cos - x[2i+1]*sin`, `out[2i+1] = x[2i]*sin + x[2i+1]*cos`), one dispatch writing `xq_out` and one writing `xk_out`, mirroring the Vulkan `apply_rotary_emb` reference (buffer-only, fp32, the interleaved `.default` variant). Each dispatch owns a distinct compute pipeline (the graph destructor releases per dispatch, so a shared handle would double-free); the workgroup size is a `wg_size` pipeline-override constant clamped to the device limit, both 1D dispatch counts go through `WebGPUUtils::compute_1d_workgroup_count` and are validated before any GPU-object allocation, and the embedded WGSL header is generated by `gen_wgsl_headers.py`. The two outputs (`xq_out`, `xk_out`) are serialized by the Vulkan exporter as a single `ValueList` graph value, which the runtime did not previously model. This adds the `ValueType::ValueList` value kind, a `value_lists_` table populated during `build()`, and a `get_value_list` accessor the handler uses to resolve the output ids. While in that code path it also closes a latent gap: a constant tensor whose `constant_id` is set but whose constants table is missing or out of range now throws (fail-loud) rather than silently leaving the buffer uninitialized. ghstack-source-id: 395549282 @exported-using-ghexport Differential Revision: [D108428756](https://our.internmc.facebook.com/intern/diff/D108428756/)
Stack from ghstack (oldest at bottom):
Adds the WebGPU backend handler for
et_vk.apply_rotary_emb.default(interleaved Llama rotary positional embedding) plus theValueListgraph-value support its multi-output signature requires.The op rotates the query and key tensors by a shared
freqs_cos/freqs_sinpair and is composed of two dispatches of one WGSL kernel: each thread handles one (even, odd) element pair of a head row (out[2i] = x[2i]*cos - x[2i+1]*sin,out[2i+1] = x[2i]*sin + x[2i+1]*cos), one dispatch writingxq_outand one writingxk_out, mirroring the Vulkanapply_rotary_embreference (buffer-only, fp32, the interleaved.defaultvariant). Each dispatch owns a distinct compute pipeline (the graph destructor releases per dispatch, so a shared handle would double-free); the workgroup size is awg_sizepipeline-override constant clamped to the device limit, both 1D dispatch counts go throughWebGPUUtils::compute_1d_workgroup_countand are validated before any GPU-object allocation, and the embedded WGSL header is generated bygen_wgsl_headers.py.The two outputs (
xq_out,xk_out) are serialized by the Vulkan exporter as a singleValueListgraph value, which the runtime did not previously model. This adds theValueType::ValueListvalue kind, avalue_lists_table populated duringbuild(), and aget_value_listaccessor the handler uses to resolve the output ids. While in that code path it also closes a latent gap: a constant tensor whoseconstant_idis set but whose constants table is missing or out of range now throws (fail-loud) rather than silently leaving the buffer uninitialized.@exported-using-ghexport
Differential Revision: D108428756
Differential Revision: D108428756