[ExecuTorch][WebGPU] Add et_vk.embedding_q4gsw (4-bit groupwise-symmetric quantized embedding) by JulianCloudNTH · Pull Request #20263 · pytorch/executorch

JulianCloudNTH · 2026-06-13T00:08:45Z

Stack from ghstack (oldest at bottom):

[ExecuTorch][WebGPU] et_vk.prepack test suite (export + native golden) #20292
[ExecuTorch][WebGPU] Add et_vk.prepack (constant-tensor packing) for E2E weight loading #20265
[ExecuTorch][WebGPU] et_vk.apply_rotary_emb test suite (export + native golden) #20290
[ExecuTorch][WebGPU] Add et_vk.apply_rotary_emb (interleaved RoPE) + ValueList multi-output #20264
[ExecuTorch][WebGPU] et_vk.embedding_q4gsw test suite (export + native golden) #20289
-> [ExecuTorch][WebGPU] Add et_vk.embedding_q4gsw (4-bit groupwise-symmetric quantized embedding) #20263

Adds the WebGPU backend handler for et_vk.embedding_q4gsw.default (a 4-bit groupwise-symmetric quantized embedding gather) plus the host-side integer-input infra it requires.

The op is a single compute dispatch composed of one stage: one thread per 32-element block of each gathered row dequantizes the packed 4-bit table (q = (nibble - 8) * scale; even dim = high nibble, odd dim = low) into the fp32 output, mirroring the Vulkan embedding_q4gsw reference (flat buffer-backed weight; is_linear_weight=true is unsupported and throws). The workgroup size is a wg_size pipeline-override constant clamped to the device limit via WebGPUUtils::clamp_workgroup_size, the 1D dispatch count goes through WebGPUUtils::compute_1d_workgroup_count (validated before any GPU-object allocation), and the embedded WGSL string header is generated by gen_wgsl_headers.py.

Embedding indices arrive as int64 at the program boundary but the serialized graph stores them as int32, so the shared input path is extended with a host-side InputData view ({data, nbytes, host_is_int64}) and copy_inputs gains three branches: a byte-for-byte fast path when host and GPU sizes match, an int64->int32 narrowing copy when the buffer is int32 and the host input is twice as wide (mirrors the Vulkan kLong->kInt staging cast), and a fail-loud throw otherwise. WebGPUTensor gains elem_size/is_int to drive the narrowing decision, and update_symints_from_inputs takes the same InputData vector so execute() builds a single input list consumed by both.
@exported-using-ghexport

Differential Revision: D108428753

[ghstack-poisoned]

pytorch-bot · 2026-06-13T00:08:48Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20263

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[ROCm] MI350 CI jobs will have longer queue times due to CI migration

❌ 3 New Failures, 8 Pending, 1 Unrelated Failure

As of commit 7dea4c4 with merge base 0e65ba6 ():

NEW FAILURES - The following jobs have failed:

Build Presets / apple (llm) / build (gh)
The process '/opt/homebrew/bin/git' failed with exit code 1
pull / unittest / macos / macos-job (gh)
export/tests/test_target_recipes.py::TestTargetRecipes::test_mv3_model
pull / unittest-editable / macos / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-llama-runner-qnn-linux (fp32, qnn_16a16w, qnn) / linux-job (gh) (trunk failure)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-13T00:09:36Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

[ghstack-poisoned]

JulianCloudNTH · 2026-06-15T19:03:01Z

@claude review

claude · 2026-06-15T19:03:51Z

Claude finished @JulianCloudNTH's task in 2m 1s —— View job

Code Review: `et_vk.embedding_q4gsw` (WebGPU)

Read PR diff and changed files
Review op implementation (EmbeddingQ4gsw.cpp + WGSL)
Review host-side input infra (WebGPUGraph, copy_inputs, InputData)
Review tests
Post review feedback

Overall this is a clean, well-validated addition. The op does extensive up-front shape/byte guards before allocating any GPU object, the WGSL nibble unpacking matches the documented pack convention, and the int64→int32 narrowing path is well reasoned. One correctness bug and a few smaller notes below.

🔴 `copy_inputs` can throw across the backend boundary

WebGPUBackend::execute deliberately wraps the symint refresh in a try/catch with the comment "Fail loud as a runtime Error so a throw never crosses the backend boundary" — but copy_inputs is called outside that block:

graph->copy_inputs(inputs);          // WebGPUBackend.cpp:109  — NOT guarded

try {
  graph->update_symints_from_inputs(inputs);   // :112-118 — guarded
  graph->propagate_resize();
} catch (const std::exception& e) { ... return Error::Internal; }

This PR newly makes copy_inputs throw std::runtime_error (the "fail-loud" branch in WebGPUGraph.cpp). Before this PR copy_inputs was non-throwing, so the unguarded call was fine — now an unsupported/mismatched input will let a C++ exception escape execute(), which returns Error and is not expected to throw. That contradicts the design comment two lines below and can crash the runtime instead of returning a clean error.

Suggest moving the copy_inputs call inside the existing try/catch (and the input-gathering loop is fine where it is):

try {
  graph->copy_inputs(inputs);
  graph->update_symints_from_inputs(inputs);
  graph->propagate_resize();
} catch (const std::exception& e) {
  ET_LOG(Error, "WebGPU input copy/symint refresh failed: %s", e.what());
  return Error::Internal;
}

Fix this →

🟡 Out-of-range indices read out of bounds (silent wrong output)

In the WGSL, token = u32(t_indices[indices_idx]) is used directly to index t_weight and t_scales with no clamp against vocab size. A token ≥ vocab (or a negative int32 that wraps to a huge u32) indexes past the weight/scales buffers. WebGPU bounds-checking will return 0 rather than crash, so the failure is silent. This mirrors the Vulkan reference's trust in valid indices, so it may be acceptable — but worth a one-line comment noting indices are assumed in-range, since unlike a CPU gather there's no bounds error to surface the mistake.

🟡 `total_blocks` narrowed to `uint32` without a guard

total_blocks is computed as uint64_t but stored into the uint32_t EmbeddingParams::total_blocks and passed to compute_1d_workgroup_count via a static_cast<uint32_t>. There's no guard that it fits in 32 bits. All current test shapes are tiny (256 blocks), so this is theoretical, but a > UINT32_MAX check alongside the other guards would keep the fail-loud invariant consistent for large gathers.

🟢 Minor: per-execute allocation in the narrowing path

copy_inputs allocates a fresh std::vector<int32_t> narrowed(numel) on every execute for the int64→int32 case. Fine for correctness; if the embedding-gather is on a hot path you may later want a reusable scratch buffer. Not blocking.

Things that look good

EmbeddingParams static_assert(sizeof == 32) + explicit _pad matches std140 uniform alignment — nice.
Validation happens before any wgpuDeviceCreate* call, so failures don't leak GPU objects.
The Python test computes the golden two independent ways (et_vk reference vs. a hand-rolled dequant+F.embedding) in test_golden_matches_eager, so a buggy golden can't mask a buggy kernel — good defensive testing.
embedding_q4gsw_wgsl.h carries the wgsl-sha256 and DO NOT EDIT banner; assuming gen_wgsl_headers.py regenerated it, the embedded copy matches the .wgsl source (I diffed them — identical).
The native C++ test exercises the int64→int32 narrowing path end-to-end by feeding an int64 tensor at the boundary.

Question

The native test (test_webgpu_native.cpp) only runs the small shape (embed=64). The export helper supports llama1b (embed=2048, group=64) and the Python test_golden_matches_eager covers it, but the native GPU kernel is never run at that shape. Was that intentional (keep CI light), or worth adding so the per-group scale indexing is validated on-device too?

[ghstack-poisoned]

JulianCloudNTH · 2026-06-15T21:59:14Z

@claude review

claude · 2026-06-15T22:00:06Z

Claude finished @JulianCloudNTH's task in 3m 45s —— View job

Code Review: `et_vk.embedding_q4gsw` (WebGPU)

Read PR diff and changed files
Review op implementation (EmbeddingQ4gsw.cpp + WGSL)
Verify dequant convention against the Vulkan reference
Review host-side input infra (WebGPUGraph, copy_inputs, InputData)
Re-check the prior review's findings against current HEAD

This is a clean, well-guarded addition. I re-verified the dequant math against the Vulkan flat-buffer reference and it matches exactly; the up-front validation discipline is good. Notes below, including which items from the earlier review are now resolved.

✅ Resolved since the last review

copy_inputs now inside the try/catch. WebGPUBackend.cpp:110-117 moves graph->copy_inputs(inputs) into the guarded block, so the new fail-loud std::runtime_error branch in WebGPUGraph::copy_inputs is converted to Error::Internal and never crosses the backend boundary. The previous 🔴 is fixed.
Out-of-range index behavior is now documented. embedding_q4gsw.wgsl:30 carries the // token assumed in-range (mirrors Vulkan; no vocab clamp) comment, which is what the earlier 🟡 asked for. Still a silent-zero on OOB (WebGPU bounds-checks to 0 rather than faulting), but it now matches the Vulkan contract explicitly.

✔️ Dequant convention verified against Vulkan

I diffed the unpacking against backends/vulkan/runtime/graph/ops/glsl/embedding_q4gsw.glsl (flat-buffer / non-linear path):

Byte→nibble: even dim = high nibble (b >> 4), odd dim = low nibble (b & 0xF) — matches Vulkan's b0>>4 / b0&0xF pairing.
q = (nibble - 8) * scale matches the +8-shifted pack.
row_byte_base = token * (embed_dim/2) lines up with Vulkan's embedding_idx * blocks_per_row of 16-byte blocks, and little-endian byte addressing (word[byte_idx>>2], shift (byte_idx&3)*8) reads the same byte sequence as Vulkan's ivec4 indexing.
Scale index token * groups_per_row + dim / group_size matches Vulkan exactly.

The embedded embedding_q4gsw_wgsl.h is byte-identical to the .wgsl source (sha256 banner present).

🟡 `total_blocks` truncated to `uint32` without a guard (still open)

EmbeddingQ4gsw.cpp:110-111 computes total_blocks as uint64_t, then :139 passes static_cast<uint32_t>(total_blocks) to compute_1d_workgroup_count and :148 stores it into the uint32_t field. If total_blocks > UINT32_MAX the cast silently truncates before the dispatch-limit check, so the fail-loud invariant the rest of this function maintains is bypassed for very large gathers. All current shapes are tiny, so it's theoretical — but a one-line > UINT32_MAX guard alongside the other byte/shape checks would keep it consistent.

Fix this →

🟢 Per-execute allocation in the narrowing path (non-blocking)

WebGPUGraph.cpp:522 allocates a fresh std::vector<int32_t> narrowed(numel) on every execute for the int64→int32 case. Fine for correctness; if embedding-gather lands on a hot path, a reusable scratch buffer would avoid the per-call allocation. Not blocking.

🟢 Minor: `Params.num_indices` is unused in the WGSL

The uniform carries num_indices (set at EmbeddingQ4gsw.cpp:144) but the shader never reads it — the kernel derives indices_idx from block / blocks_per_row instead. Harmless (it's part of the 32-byte std140 layout), just dead in the shader. Worth a comment or dropping into _pad if you want the struct to reflect only what's consumed.

Note on tests

This PR's diff (7 files) contains no test files — the test_webgpu_native.cpp / test_golden_matches_eager coverage referenced earlier appears to live elsewhere in the ghstack (#20264 / #20289 / #20290), not here. Flagging only so the op + tests landing separately is intentional and not an omission. If the native test only exercises the small shape (embed=64), it's still worth running the llama1b shape (embed=2048, group=64) on-device at least once so the per-group scale indexing is validated on GPU and not just in the Python golden.

Looks good

static_assert(sizeof(EmbeddingParams) == 32) + explicit _pad for std140 alignment.
All shape/byte validation runs before any wgpuDeviceCreate*, so failures don't leak GPU objects.
wg_size clamped to the device limit and the 1D dispatch count validated against maxComputeWorkgroupsPerDimension before allocation.
The int64→int32 narrowing path is well-reasoned: byte-for-byte fast path, exact nbytes == tensor.nbytes * 2 width check for the cast, and a fail-loud throw otherwise.
• Branch: gh/JulianCloudNTH/25/head

[ghstack-poisoned]

#20292) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.15.0) (oldest at bottom): * __->__ #20292 * #20265 * #20290 * #20264 * #20289 * #20263 Test suite for the `et_vk.prepack` constant-materialization op, split into its own diff (op below, tests above) per the per-op test-split convention. The prepack op is how a serialized constant becomes a GPU tensor: the constant arrives as a CPU-side reference (sizes + a pointer into the .pte bytes), and the prepack node is the sole materialization — one CPU->GPU transfer straight into the consumer's buffer. The model `M(x) = x + w` (w a constant) routes `w` through a prepack node, so the delegate must run the materialization for the output to equal `x + w` rather than `x + 0`. @exported-using-ghexport Differential Revision: [D108678631](https://our.internmc.facebook.com/intern/diff/D108678631/) Differential Revision: [D108678631](https://our.internmc.facebook.com/intern/diff/D108678631)

@JulianCloudNTH

…tric quantized embedding) (#20414) This PR was created by the merge bot to help merge the original PR into the main branch. ghstack PR number: #20263 by @JulianCloudNTH ^ Please use this as the source of truth for the PR details, comments, and reviews ghstack PR base: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/25/base ghstack PR head: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/25/head Merge bot PR base: https://github.com/pytorch/executorch/tree/main Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/25/orig @diff-train-skip-merge --------- Co-authored-by: Julian Ng-Thow-Hing <juliannth@meta.com>

Update

0d9b542

[ghstack-poisoned]

JulianCloudNTH requested review from kirklandsign and larryliu0820 as code owners June 13, 2026 00:08

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 13, 2026

Update

819232f

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence June 15, 2026 17:23 — with GitHub Actions Inactive

meta-codesync Bot added the meta-exported label Jun 15, 2026

Update

9d3f3f1

[ghstack-poisoned]

This was referenced Jun 15, 2026

[ExecuTorch][WebGPU] et_vk.embedding_q4gsw test suite (export + native golden) #20289

Merged

[ExecuTorch][WebGPU] et_vk.apply_rotary_emb test suite (export + native golden) #20290

Merged

JulianCloudNTH temporarily deployed to cadence June 15, 2026 21:53 — with GitHub Actions Inactive

Update

0943c93

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence June 15, 2026 22:25 — with GitHub Actions Inactive

JulianCloudNTH mentioned this pull request Jun 15, 2026

[ExecuTorch][WebGPU] et_vk.prepack test suite (export + native golden) #20292

Merged

psiddh approved these changes Jun 21, 2026

View reviewed changes

Update

7dea4c4

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence June 22, 2026 05:05 — with GitHub Actions Inactive

meta-codesync Bot merged commit f3d16c3 into gh/JulianCloudNTH/25/base Jun 22, 2026
175 of 183 checks passed

meta-codesync Bot deleted the gh/JulianCloudNTH/25/head branch June 22, 2026 06:46

meta-codesync Bot temporarily deployed to cherry-pick-bot June 22, 2026 06:46 Inactive

pytorchbot mentioned this pull request Jun 22, 2026

[ExecuTorch][WebGPU] Add et_vk.embedding_q4gsw (4-bit groupwise-symmetric quantized embedding) #20414

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ExecuTorch][WebGPU] Add et_vk.embedding_q4gsw (4-bit groupwise-symmetric quantized embedding)#20263

[ExecuTorch][WebGPU] Add et_vk.embedding_q4gsw (4-bit groupwise-symmetric quantized embedding)#20263
meta-codesync[bot] merged 5 commits into
gh/JulianCloudNTH/25/basefrom
gh/JulianCloudNTH/25/head

JulianCloudNTH commented Jun 13, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 13, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

JulianCloudNTH commented Jun 15, 2026

Uh oh!

claude Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

JulianCloudNTH commented Jun 15, 2026

Uh oh!

claude Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

JulianCloudNTH commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20263

❗ 1 Active SEVs

❌ 3 New Failures, 8 Pending, 1 Unrelated Failure

Uh oh!

github-actions Bot commented Jun 13, 2026

This PR needs a release notes: label

Uh oh!

JulianCloudNTH commented Jun 15, 2026

Uh oh!

claude Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: et_vk.embedding_q4gsw (WebGPU)

🔴 copy_inputs can throw across the backend boundary

🟡 Out-of-range indices read out of bounds (silent wrong output)

🟡 total_blocks narrowed to uint32 without a guard

🟢 Minor: per-execute allocation in the narrowing path

Things that look good

Question

Uh oh!

JulianCloudNTH commented Jun 15, 2026

Uh oh!

claude Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: et_vk.embedding_q4gsw (WebGPU)

✅ Resolved since the last review

✔️ Dequant convention verified against Vulkan

🟡 total_blocks truncated to uint32 without a guard (still open)

🟢 Per-execute allocation in the narrowing path (non-blocking)

🟢 Minor: Params.num_indices is unused in the WGSL

Note on tests

Looks good

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JulianCloudNTH commented Jun 13, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 13, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Jun 15, 2026 •

edited

Loading

Code Review: `et_vk.embedding_q4gsw` (WebGPU)

🔴 `copy_inputs` can throw across the backend boundary

🟡 `total_blocks` narrowed to `uint32` without a guard

claude Bot commented Jun 15, 2026 •

edited

Loading

Code Review: `et_vk.embedding_q4gsw` (WebGPU)

🟡 `total_blocks` truncated to `uint32` without a guard (still open)

🟢 Minor: `Params.num_indices` is unused in the WGSL