[ExecuTorch][WebGPU] Add slice_copy op (aten.slice_copy.Tensor) by JulianCloudNTH · Pull Request #20394 · pytorch/executorch

JulianCloudNTH · 2026-06-18T21:35:42Z

Stack from ghstack (oldest at bottom):

Adds aten.slice_copy.Tensor to the WebGPU delegate as a gather: each output element is mapped back to its source input element along the sliced dim via start + coord * step.

Composition (single compute dispatch):

runtime/ops/slice/Slice.cpp — reads args = [self, dim, start, end, step, out] via read_scalar (static Int/Null-sentinel default; throws on dynamic SymInt); normalizes negative dim/start, clamps start to [0, in_size]; builds two TensorMeta UBOs + a SliceParams{dim, start, step} uniform; guards fp32; dispatches over compute_1d_workgroup_count(out.numel) with override wg_size; releases all uniforms after the bind group.
runtime/ops/slice/slice.wgsl — delinearizes the output index over the contiguous output strides, maps the sliced-dim coordinate back to the input (start + coord*step), relinearizes over the input strides.
@exported-using-ghexport

Differential Revision: D108793168

[ghstack-poisoned]

pytorch-bot · 2026-06-18T21:35:46Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20394

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 2 Pending, 2 Unrelated Failures

As of commit fad5528 with merge base e03f777 ():

NEW FAILURES - The following jobs have failed:

Propose to merge ghstack orig PRs to main / Try to create a PR with ghstack /orig branch (gh)
Process completed with exit code 1.
pull / test-qnn-models-linux (mv3) / linux-job (gh)
RuntimeError: Command docker exec -t 31e2c1d82991b5e4eb9e96dd3ca4e65d4d335c38da2d5d21062292477877e9a4 /exec failed with exit code 92
pull / unittest-buck / linux / linux-job (gh)
RuntimeError: Command docker exec -t 9f8438410ffc2d982eabb4986e0a924b1d68be590149764f3a4535f63caaf3d2 /exec failed with exit code 3

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-llama-runner-qnn-linux (fp32, qnn_16a16w, qnn) / linux-job (gh) (trunk failure)
pull / unittest-buck / macos / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 3

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-18T21:36:20Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

JulianCloudNTH · 2026-06-18T21:54:06Z

@claude review

claude · 2026-06-18T21:54:51Z

Claude finished @JulianCloudNTH's task in 2m 27s —— View job

Code Review: `slice_copy` WebGPU op

Read changed files
Check conventions vs sibling ops & TensorMeta
Analyze correctness, edge cases, style
Post review

Reviewed against the sibling select op (runtime/ops/select/), which this closely and correctly mirrors. The gather approach is sound: contiguous-output delinearization → map sliced-dim coord via start + coord*step → relinearize over input strides. The std140 TensorMeta UBO usage, fp32 guard, workgroup sizing, uniform lifetime handling, and bind-group layout all match the established pattern. Nice work — the negative-dim/start normalization and the Int/Null/SymInt sentinel handling in read_scalar are handled carefully.

A few points below; only the first is a real (if low-likelihood) gap.

Suggestions

1. step is not validated — a non-positive value silently misbehaves (Slice.cpp:73, 86-88)
step is read and cast straight to uint32_t. The other parameters get explicit range checks (dim out of range, start clamp), but step doesn't. A step <= 0 would wrap to a huge u32 and produce out-of-bounds reads in the shader (no bounds check on in_bufi). PyTorch/the partitioner guarantee step >= 1 for slice_copy.Tensor, so this is defensive, but it's cheap and matches the surrounding validation style:

const int64_t step = read_scalar(graph, args.at(4), 1, "step");
if (step < 1) {
  throw std::runtime_error("slice: step must be >= 1");
}

Fix this →

2. No bounds check on in_bufi in the shader (slice.wgsl:39)
The computed input index is used to read input[in_bufi] without a guard. This relies entirely on AOT-correct shapes + params (1 above). That's consistent with select.wgsl, so it's acceptable as-is — just noting the invariant the C++ side must uphold.

3. Test coverage and partitioner allowlist appear to be missing from this PR

aten.slice_copy.Tensor is not added to WEBGPU_SUPPORTED_OPS in backends/webgpu/test/tester.py (cf. select_copy.int at line 26).
No SliceModule/_SLICE_CONFIGS op-test was added — the _fn_config_suite builder in op_tests/cases.py:149 already lists slice in its docstring (view/select/slice), so wiring one up should be a near-copy of the select suite.

If those land in another PR of this ghstack, ignore this — but the runtime op isn't exercised by the op-test framework without them.

Nits

The bind-group / pipeline / layout boilerplate (Slice.cpp:106-182) is byte-for-byte identical to Select.cpp. This is the existing per-op convention, so duplicating it is the right call here, but the WebGPU backend may eventually want a shared helper as the op count grows. No action needed.
slice_wgsl.h carries a wgsl-sha256 — make sure test_wgsl_codegen.py passes (it should, since the header matches slice.wgsl).

Overall: clean, idiomatic, and consistent with the backend. The only thing I'd actually change before merge is the step guard (#1).
· branch gh/JulianCloudNTH/43/head

[ghstack-poisoned]

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.15.0) (oldest at bottom): * #20465 * #20464 * #20463 * #20435 * #20399 * #20398 * #20397 * #20396 * #20395 * #20394 * #20393 * #20392 * #20391 * #20390 * #20363 * __->__ #20362 * #20361 * #20360 * #20359 Adds `aten.select_copy.int` to the WebGPU delegate as a gather: picks a fixed index along one dim, producing an output of rank (input rank - 1). Composition (single dispatch): - `select/Select.cpp` — reads `[self, dim, index, out]` (static `Int` via `read_scalar`; throws on dynamic `SymInt`), normalizes + bounds-checks dim/index, builds 2 `TensorMeta` UBOs + a `SelectParams{dim,index}`, fp32 guard, 1D-dispatch over `numel`, releases uniforms after the bind group. - `select/select.wgsl` — seeds the input offset with `index * in.strides[dim]`, delinearizes the output index, maps each out dim to its in dim (shifted past the selected dim), relinearizes on input strides. @exported-using-ghexport Differential Revision: [D108793166](https://our.internmc.facebook.com/intern/diff/D108793166/) Differential Revision: [D108793166](https://our.internmc.facebook.com/intern/diff/D108793166)

…ework) (#20363) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.15.0) (oldest at bottom): * #20465 * #20464 * #20463 * #20435 * #20399 * #20398 * #20397 * #20396 * #20395 * #20394 * #20393 * #20392 * #20391 * #20390 * __->__ #20363 * #20362 * #20361 * #20360 * #20359 Registers `aten.select_copy.int` in the `cases.py` op-test framework: a `_select_suite` of 4 configs (leading/middle/last dim + negative index) that `generate_op_tests` exports and compares to a torch golden on Dawn. Also adds `test/ops/select/test_select.py` (`SelectModule` + `CONFIGS` + an export-delegation/eager smoke test) and the `aten.select_copy.int` partitioner-allowlist entry in `tester.py`. @exported-using-ghexport Differential Revision: [D108793161](https://our.internmc.facebook.com/intern/diff/D108793161/) Differential Revision: [D108793161](https://our.internmc.facebook.com/intern/diff/D108793161)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.15.0) (oldest at bottom): * #20465 * #20464 * #20463 * #20435 * #20399 * #20398 * #20397 * #20396 * #20395 * #20394 * #20393 * #20392 * #20391 * __->__ #20390 * #20363 * #20362 * #20361 * #20360 * #20359 Adds `aten.sigmoid.default` to the WebGPU delegate: element-wise `1/(1+exp(-x))` over a flat fp32 buffer. On the Llama critical path (`F.silu` -> `sigmoid` + `mul`). Composition (single dispatch): - `sigmoid/UnaryOp.cpp` — binds input (storage, read-only) + output (storage) + a `Params{num_elements}` uniform, 1D-dispatches over `num_elements` with `override wg_size` (clamped to the device limit); mirrors the `add` op (uniform mapped-at-creation, released after the bind group). - `sigmoid/sigmoid.wgsl` — guards `idx >= num_elements` and writes the logistic of each element. @exported-using-ghexport Differential Revision: [D108793157](https://our.internmc.facebook.com/intern/diff/D108793157/) Differential Revision: [D108793157](https://our.internmc.facebook.com/intern/diff/D108793157)

…k) (#20391) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.15.0) (oldest at bottom): * #20465 * #20464 * #20463 * #20435 * #20399 * #20398 * #20397 * #20396 * #20395 * #20394 * #20393 * #20392 * __->__ #20391 * #20390 * #20363 * #20362 * #20361 * #20360 * #20359 Registers `aten.sigmoid.default` in the `cases.py` op-test framework: a `_sigmoid_suite` (hard-coded shapes + a saturation case over a `linspace(-12, 12)` input) that `generate_op_tests` exports and compares to an fp64 torch golden on Dawn. Also adds `test/ops/sigmoid/test_sigmoid.py` (`SigmoidModule` + `N` + `_det_input` + an export-delegation/eager smoke test) and the `aten.sigmoid.default` partitioner-allowlist entry in `tester.py`. @exported-using-ghexport Differential Revision: [D108793159](https://our.internmc.facebook.com/intern/diff/D108793159/) Differential Revision: [D108793159](https://our.internmc.facebook.com/intern/diff/D108793159)

…20392) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.15.0) (oldest at bottom): * #20465 * #20464 * #20463 * #20435 * #20399 * #20398 * #20397 * #20396 * #20395 * #20394 * #20393 * __->__ #20392 * #20391 * #20390 * #20363 * #20362 * #20361 * #20360 * #20359 Adds `aten.squeeze_copy.dims` and `aten.unsqueeze_copy.default` to the WebGPU delegate. Both are numel-preserving shape ops; on a dense row-major buffer backend they are the same flat copy as `view_copy` — only the shape metadata differs (mirrors the Vulkan delegate, which routes both through `add_view_copy_node`). Composition (no new kernel): - `squeeze/Squeeze.cpp` — reads `args = [self, dims, out]`, ignores the AOT-fixed `dims`, calls `add_flat_copy(graph, in, out)` from `runtime/ops/view_copy/view_copy.h`. - `unsqueeze/Unsqueeze.cpp` — reads `args = [self, dim, out]`, ignores the AOT-fixed `dim`, calls `add_flat_copy(graph, in, out)`. @exported-using-ghexport Differential Revision: [D108793153](https://our.internmc.facebook.com/intern/diff/D108793153/) Differential Revision: [D108793153](https://our.internmc.facebook.com/intern/diff/D108793153)

….py op-test framework) (#20393) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.15.0) (oldest at bottom): * #20465 * #20464 * #20463 * #20435 * #20399 * #20398 * #20397 * #20396 * #20395 * #20394 * __->__ #20393 * #20392 * #20391 * #20390 * #20363 * #20362 * #20361 * #20360 * #20359 Registers `aten.squeeze_copy.dims` and `aten.unsqueeze_copy.default` in the `cases.py` op-test framework: a `_squeeze_suite` of 3 configs (squeeze leading/middle/multiple size-1 dims) and a `_unsqueeze_suite` of 3 configs (insert dim at front/middle/last) that `generate_op_tests` exports via `VulkanPartitioner` and compares to a torch golden on Dawn. Also adds `test/ops/squeeze/test_squeeze.py` (`SqueezeModule` + `CONFIGS` + `_op_delegated` smoke test), `test/ops/unsqueeze/test_unsqueeze.py` (`UnsqueezeModule` + `CONFIGS` + `_op_delegated` smoke test), and the two partitioner-allowlist entries in `tester.py`. @exported-using-ghexport Differential Revision: [D108793152](https://our.internmc.facebook.com/intern/diff/D108793152/) Differential Revision: [D108793152](https://our.internmc.facebook.com/intern/diff/D108793152)

…work) (#20395) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.15.0) (oldest at bottom): * #20465 * #20464 * #20463 * #20435 * #20399 * #20398 * #20397 * #20396 * __->__ #20395 * #20394 * #20393 * #20392 * #20391 * #20390 * #20363 * #20362 * #20361 * #20360 * #20359 Registers `aten.slice_copy.Tensor` in the `cases.py` op-test framework: a `_slice_suite` of 4 configs (leading-dim slice `[:,1:5]`, last-dim slice `[...,1:3]`, step-2 `[:,0:8:2]`, negative-end `[:,1:-1]`) that `generate_op_tests` exports via `VulkanPartitioner` and compares to a torch golden on Dawn. Also adds `test/ops/slice/test_slice.py` (`SliceModule` + `CONFIGS` + export-delegation/eager smoke test) and the `aten.slice_copy.Tensor` partitioner-allowlist entry in `tester.py`. @exported-using-ghexport Differential Revision: [D108793151](https://our.internmc.facebook.com/intern/diff/D108793151/) Differential Revision: [D108793151](https://our.internmc.facebook.com/intern/diff/D108793151)

…ermute_copy.default) (#20396) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.15.0) (oldest at bottom): * #20465 * #20464 * #20463 * #20435 * #20399 * #20398 * #20397 * __->__ #20396 * #20395 * #20394 * #20393 * #20392 * #20391 * #20390 * #20363 * #20362 * #20361 * #20360 * #20359 Adds `aten.permute_copy.default` (a coordinate-reorder gather) to the WebGPU delegate, and the `IntList` graph value type it needs to read its `dims` argument. Composition: - `runtime/WebGPUGraph.{h,cpp}` — adds `ValueType::IntList` backed by `std::vector<std::vector<int64_t>> int_lists_` + `get_int_list(int)`; `build()` deserializes `vkgraph::GraphTypes::IntList` via `value_as_IntList()->items()` (int64, matching the FlatBuffer `[long]`); mirrors the existing scalar value plumbing. - `runtime/ops/permute/Permute.cpp` — reads the permutation via `get_int_list`, normalizes negative dims, validates it is a permutation of `[0, ndim)`, builds two `TensorMeta` UBOs + a `PermuteParams{perm: vec4<u32>}` uniform, guards fp32 + rank≤4, dispatches over `compute_1d_workgroup_count(out.numel)` with `override wg_size`; releases all uniforms after the bind group. - `runtime/ops/permute/permute.wgsl` — delinearizes the output index over the contiguous output strides, reads `input` at `in.strides[perm[d]]` per dim (mirrors Vulkan `permute_buffer.glsl`). - Registers both `aten.permute_copy.default` and `aten.permute.default` to the same handler. @exported-using-ghexport Differential Revision: [D108793162](https://our.internmc.facebook.com/intern/diff/D108793162/) Differential Revision: [D108793162](https://our.internmc.facebook.com/intern/diff/D108793162)

…mework) (#20397) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.15.0) (oldest at bottom): * #20465 * #20464 * #20463 * #20435 * #20399 * #20398 * __->__ #20397 * #20396 * #20395 * #20394 * #20393 * #20392 * #20391 * #20390 * #20363 * #20362 * #20361 * #20360 * #20359 Registers `aten.permute_copy.default` in the `cases.py` op-test framework: a `_permute_suite` of 4 configs (3D rotation, 4D middle-dim transpose, 2D transpose, full 4D shuffle) that `generate_op_tests` exports via `VulkanPartitioner` and compares to a torch golden on Dawn. Also adds `test/ops/permute/test_permute.py` (`PermuteModule` + `CONFIGS` + `_op_delegated` smoke test) and the `aten.permute_copy.default` partitioner-allowlist entry in `tester.py`. @exported-using-ghexport Differential Revision: [D108793156](https://our.internmc.facebook.com/intern/diff/D108793156/) Differential Revision: [D108793156](https://our.internmc.facebook.com/intern/diff/D108793156)

Pull Request resolved: pytorch#20394 Adds `aten.slice_copy.Tensor` to the WebGPU delegate as a gather: each output element is mapped back to its source input element along the sliced dim via `start + coord * step`. Composition (single compute dispatch): - `runtime/ops/slice/Slice.cpp` — reads `args = [self, dim, start, end, step, out]` via `read_scalar` (static `Int`/`Null`-sentinel default; throws on dynamic `SymInt`); normalizes negative `dim`/`start`, clamps `start` to `[0, in_size]`; builds two `TensorMeta` UBOs + a `SliceParams{dim, start, step}` uniform; guards fp32; dispatches over `compute_1d_workgroup_count(out.numel)` with `override wg_size`; releases all uniforms after the bind group. - `runtime/ops/slice/slice.wgsl` — delinearizes the output index over the contiguous output strides, maps the sliced-dim coordinate back to the input (`start + coord*step`), relinearizes over the input strides. ghstack-source-id: 397026527 @exported-using-ghexport Differential Revision: [D108793168](https://our.internmc.facebook.com/intern/diff/D108793168/)

### Summary Manual merge of four WebGPU-delegate op PRs that landed internally but could not auto-merge to `main`. These are stacked ghstack PRs — when the lower PRs in the stack merged, their head branches were deleted and these four PRs' base branches were orphaned, so the orig-PR proposer failed with `422 base invalid`. This PR re-lands the same four commits (identical content to the originals, flat test layout) as a clean stack on top of current `main`: - [#20394](#20394) — Add `slice_copy` op (`aten.slice_copy.Tensor`) - [#20395](#20395) — `slice_copy` op test suite (cases.py op-test framework) - [#20396](#20396) — Add `permute_copy` + `IntList` graph support (`aten.permute_copy.default`) - [#20397](#20397) — `permute_copy` op test suite (cases.py op-test framework) ### Test plan Each op ships with its `cases.py` op-test suite (exported via `VulkanPartitioner`, compared to a torch golden on Dawn) plus an export-delegation smoke test, exercised by the WebGPU op-test CI (`etvk-*`). Verified internally; content is identical to the original four PRs. @diff-train-skip-merge

Update

408569b

[ghstack-poisoned]

JulianCloudNTH requested review from kirklandsign and larryliu0820 as code owners June 18, 2026 21:35

JulianCloudNTH temporarily deployed to cadence June 18, 2026 21:36 — with GitHub Actions Inactive

This was referenced Jun 18, 2026

[ExecuTorch][WebGPU] mul op test suite (cases.py op-test framework) #20359

Merged

[ExecuTorch][WebGPU] Add view_copy op (aten.view_copy.default) #20360

Merged

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 18, 2026

Update

92bf354

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence June 18, 2026 22:25 — with GitHub Actions Inactive

meta-codesync Bot added the meta-exported label Jun 18, 2026

Update

1cc010f

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence June 22, 2026 20:40 — with GitHub Actions Inactive

JulianCloudNTH mentioned this pull request Jun 22, 2026

[ExecuTorch][WebGPU] Flatten landed-op test dirs to test/ops/test_<op>.py #20435

Open

Update

4b0352a

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence June 23, 2026 20:34 — with GitHub Actions Inactive

This was referenced Jun 23, 2026

[ExecuTorch][WebGPU] Add clone op (aten.clone.default) #20463

Open

[ExecuTorch][WebGPU] Add aten.index.Tensor (1D-self gather) #20464

Open

[ExecuTorch][WebGPU] aten.index.Tensor test suite (export + native golden) #20465

Open

Update

233332b

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence June 23, 2026 22:25 — with GitHub Actions Inactive

Update

fad5528

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence June 25, 2026 17:24 — with GitHub Actions Inactive

psiddh approved these changes Jun 26, 2026

View reviewed changes

JulianCloudNTH merged commit 284d8a4 into gh/JulianCloudNTH/43/base Jun 26, 2026
178 of 183 checks passed

JulianCloudNTH deleted the gh/JulianCloudNTH/43/head branch June 26, 2026 17:08

JulianCloudNTH had a problem deploying to cherry-pick-bot June 26, 2026 17:08 — with GitHub Actions Failure

JulianCloudNTH mentioned this pull request Jun 26, 2026

Manual merge of PRs #20394–#20397 (slice_copy + permute_copy) #20550

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ExecuTorch][WebGPU] Add slice_copy op (aten.slice_copy.Tensor)#20394

[ExecuTorch][WebGPU] Add slice_copy op (aten.slice_copy.Tensor)#20394
JulianCloudNTH merged 6 commits into
gh/JulianCloudNTH/43/basefrom
gh/JulianCloudNTH/43/head

JulianCloudNTH commented Jun 18, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 18, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 18, 2026

Uh oh!

JulianCloudNTH commented Jun 18, 2026

Uh oh!

claude Bot commented Jun 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

JulianCloudNTH commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20394

❌ 3 New Failures, 2 Pending, 2 Unrelated Failures

Uh oh!

github-actions Bot commented Jun 18, 2026

This PR needs a release notes: label

Uh oh!

JulianCloudNTH commented Jun 18, 2026

Uh oh!

claude Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: slice_copy WebGPU op

Suggestions

Nits

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JulianCloudNTH commented Jun 18, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 18, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Jun 18, 2026 •

edited

Loading

Code Review: `slice_copy` WebGPU op