[ExecuTorch][WebGPU] Add view_copy op (aten.view_copy.default)#20360
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20360
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 2 Unrelated FailuresAs of commit d5d024e with merge base e03f777 ( NEW FAILURE - The following job has failed:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
da3f976
into
gh/JulianCloudNTH/35/base
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.15.0) (oldest at bottom): * #20465 * #20464 * #20463 * #20435 * #20399 * #20398 * #20397 * #20396 * #20395 * #20394 * #20393 * #20392 * #20391 * #20390 * #20363 * __->__ #20362 * #20361 * #20360 * #20359 Adds `aten.select_copy.int` to the WebGPU delegate as a gather: picks a fixed index along one dim, producing an output of rank (input rank - 1). Composition (single dispatch): - `select/Select.cpp` — reads `[self, dim, index, out]` (static `Int` via `read_scalar`; throws on dynamic `SymInt`), normalizes + bounds-checks dim/index, builds 2 `TensorMeta` UBOs + a `SelectParams{dim,index}`, fp32 guard, 1D-dispatch over `numel`, releases uniforms after the bind group. - `select/select.wgsl` — seeds the input offset with `index * in.strides[dim]`, delinearizes the output index, maps each out dim to its in dim (shifted past the selected dim), relinearizes on input strides. @exported-using-ghexport Differential Revision: [D108793166](https://our.internmc.facebook.com/intern/diff/D108793166/) Differential Revision: [D108793166](https://our.internmc.facebook.com/intern/diff/D108793166)
…ework) (#20363) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.15.0) (oldest at bottom): * #20465 * #20464 * #20463 * #20435 * #20399 * #20398 * #20397 * #20396 * #20395 * #20394 * #20393 * #20392 * #20391 * #20390 * __->__ #20363 * #20362 * #20361 * #20360 * #20359 Registers `aten.select_copy.int` in the `cases.py` op-test framework: a `_select_suite` of 4 configs (leading/middle/last dim + negative index) that `generate_op_tests` exports and compares to a torch golden on Dawn. Also adds `test/ops/select/test_select.py` (`SelectModule` + `CONFIGS` + an export-delegation/eager smoke test) and the `aten.select_copy.int` partitioner-allowlist entry in `tester.py`. @exported-using-ghexport Differential Revision: [D108793161](https://our.internmc.facebook.com/intern/diff/D108793161/) Differential Revision: [D108793161](https://our.internmc.facebook.com/intern/diff/D108793161)
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.15.0) (oldest at bottom): * #20465 * #20464 * #20463 * #20435 * #20399 * #20398 * #20397 * #20396 * #20395 * #20394 * #20393 * #20392 * #20391 * __->__ #20390 * #20363 * #20362 * #20361 * #20360 * #20359 Adds `aten.sigmoid.default` to the WebGPU delegate: element-wise `1/(1+exp(-x))` over a flat fp32 buffer. On the Llama critical path (`F.silu` -> `sigmoid` + `mul`). Composition (single dispatch): - `sigmoid/UnaryOp.cpp` — binds input (storage, read-only) + output (storage) + a `Params{num_elements}` uniform, 1D-dispatches over `num_elements` with `override wg_size` (clamped to the device limit); mirrors the `add` op (uniform mapped-at-creation, released after the bind group). - `sigmoid/sigmoid.wgsl` — guards `idx >= num_elements` and writes the logistic of each element. @exported-using-ghexport Differential Revision: [D108793157](https://our.internmc.facebook.com/intern/diff/D108793157/) Differential Revision: [D108793157](https://our.internmc.facebook.com/intern/diff/D108793157)
…k) (#20391) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.15.0) (oldest at bottom): * #20465 * #20464 * #20463 * #20435 * #20399 * #20398 * #20397 * #20396 * #20395 * #20394 * #20393 * #20392 * __->__ #20391 * #20390 * #20363 * #20362 * #20361 * #20360 * #20359 Registers `aten.sigmoid.default` in the `cases.py` op-test framework: a `_sigmoid_suite` (hard-coded shapes + a saturation case over a `linspace(-12, 12)` input) that `generate_op_tests` exports and compares to an fp64 torch golden on Dawn. Also adds `test/ops/sigmoid/test_sigmoid.py` (`SigmoidModule` + `N` + `_det_input` + an export-delegation/eager smoke test) and the `aten.sigmoid.default` partitioner-allowlist entry in `tester.py`. @exported-using-ghexport Differential Revision: [D108793159](https://our.internmc.facebook.com/intern/diff/D108793159/) Differential Revision: [D108793159](https://our.internmc.facebook.com/intern/diff/D108793159)
…20392) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.15.0) (oldest at bottom): * #20465 * #20464 * #20463 * #20435 * #20399 * #20398 * #20397 * #20396 * #20395 * #20394 * #20393 * __->__ #20392 * #20391 * #20390 * #20363 * #20362 * #20361 * #20360 * #20359 Adds `aten.squeeze_copy.dims` and `aten.unsqueeze_copy.default` to the WebGPU delegate. Both are numel-preserving shape ops; on a dense row-major buffer backend they are the same flat copy as `view_copy` — only the shape metadata differs (mirrors the Vulkan delegate, which routes both through `add_view_copy_node`). Composition (no new kernel): - `squeeze/Squeeze.cpp` — reads `args = [self, dims, out]`, ignores the AOT-fixed `dims`, calls `add_flat_copy(graph, in, out)` from `runtime/ops/view_copy/view_copy.h`. - `unsqueeze/Unsqueeze.cpp` — reads `args = [self, dim, out]`, ignores the AOT-fixed `dim`, calls `add_flat_copy(graph, in, out)`. @exported-using-ghexport Differential Revision: [D108793153](https://our.internmc.facebook.com/intern/diff/D108793153/) Differential Revision: [D108793153](https://our.internmc.facebook.com/intern/diff/D108793153)
….py op-test framework) (#20393) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.15.0) (oldest at bottom): * #20465 * #20464 * #20463 * #20435 * #20399 * #20398 * #20397 * #20396 * #20395 * #20394 * __->__ #20393 * #20392 * #20391 * #20390 * #20363 * #20362 * #20361 * #20360 * #20359 Registers `aten.squeeze_copy.dims` and `aten.unsqueeze_copy.default` in the `cases.py` op-test framework: a `_squeeze_suite` of 3 configs (squeeze leading/middle/multiple size-1 dims) and a `_unsqueeze_suite` of 3 configs (insert dim at front/middle/last) that `generate_op_tests` exports via `VulkanPartitioner` and compares to a torch golden on Dawn. Also adds `test/ops/squeeze/test_squeeze.py` (`SqueezeModule` + `CONFIGS` + `_op_delegated` smoke test), `test/ops/unsqueeze/test_unsqueeze.py` (`UnsqueezeModule` + `CONFIGS` + `_op_delegated` smoke test), and the two partitioner-allowlist entries in `tester.py`. @exported-using-ghexport Differential Revision: [D108793152](https://our.internmc.facebook.com/intern/diff/D108793152/) Differential Revision: [D108793152](https://our.internmc.facebook.com/intern/diff/D108793152)
Pull Request resolved: #20360 **Add `aten.view_copy.default` as a native buffer-to-buffer DMA** — a contiguous reshape on the dense row-major buffer backend is a flat copy, so it needs no shader. **Problem:** a reshape only relabels shape metadata; the bytes are unchanged. Launching a compute dispatch (shader module + pipeline + bind group + uniform) just to run `output[i] = input[i]` is wasted setup for every view/clone/squeeze/unsqueeze in the graph. **Solution:** - Before: a `view_copy.wgsl` compute kernel dispatched over `num_elements`, with its own pipeline/bind-group/uniform per copy. - After: `add_flat_copy` records a `wgpuCommandEncoderCopyBufferToBuffer` DMA in graph order — no shader, no pipeline, no uniform. **Implementation:** - `WebGPUGraph` gains a `WebGPUDispatch::Kind::Copy` command + `add_buffer_copy(src, dst, nbytes)`; `execute()` emits the encoder-level copy between compute passes (both single-shot and chunked paths), preserving the existing per-pass read-after-write ordering. - `add_flat_copy` (declared in `view_copy.h`, reused by the stacked clone/squeeze/unsqueeze) keeps the fail-loud guards (both tensors, fp32 4-byte alignment, equal `nbytes`) and treats an aliased in/out buffer as a no-op. - Tensor buffers already carry `CopySrc | CopyDst`, so no usage-flag change is needed. - Mirrors Vulkan `add_view_copy_node` (`backends/vulkan/runtime/graph/ops/impl/View.cpp`): Vulkan always dispatches `view_buffer.glsl` only to remap non-contiguous layouts, which the buffer-only WebGPU backend never produces — so the contiguous DMA is the equivalent path. ghstack-source-id: 397026498 @exported-using-ghexport Differential Revision: [D108793164](https://our.internmc.facebook.com/intern/diff/D108793164/)
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.15.0) (oldest at bottom): * #20465 * #20464 * #20463 * #20435 * #20399 * #20398 * #20397 * #20396 * #20395 * __->__ #20394 * #20393 * #20392 * #20391 * #20390 * #20363 * #20362 * #20361 * #20360 * #20359 Adds `aten.slice_copy.Tensor` to the WebGPU delegate as a gather: each output element is mapped back to its source input element along the sliced dim via `start + coord * step`. Composition (single compute dispatch): - `runtime/ops/slice/Slice.cpp` — reads `args = [self, dim, start, end, step, out]` via `read_scalar` (static `Int`/`Null`-sentinel default; throws on dynamic `SymInt`); normalizes negative `dim`/`start`, clamps `start` to `[0, in_size]`; builds two `TensorMeta` UBOs + a `SliceParams{dim, start, step}` uniform; guards fp32; dispatches over `compute_1d_workgroup_count(out.numel)` with `override wg_size`; releases all uniforms after the bind group. - `runtime/ops/slice/slice.wgsl` — delinearizes the output index over the contiguous output strides, maps the sliced-dim coordinate back to the input (`start + coord*step`), relinearizes over the input strides. @exported-using-ghexport Differential Revision: [D108793168](https://our.internmc.facebook.com/intern/diff/D108793168/) Differential Revision: [D108793168](https://our.internmc.facebook.com/intern/diff/D108793168)
…work) (#20395) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.15.0) (oldest at bottom): * #20465 * #20464 * #20463 * #20435 * #20399 * #20398 * #20397 * #20396 * __->__ #20395 * #20394 * #20393 * #20392 * #20391 * #20390 * #20363 * #20362 * #20361 * #20360 * #20359 Registers `aten.slice_copy.Tensor` in the `cases.py` op-test framework: a `_slice_suite` of 4 configs (leading-dim slice `[:,1:5]`, last-dim slice `[...,1:3]`, step-2 `[:,0:8:2]`, negative-end `[:,1:-1]`) that `generate_op_tests` exports via `VulkanPartitioner` and compares to a torch golden on Dawn. Also adds `test/ops/slice/test_slice.py` (`SliceModule` + `CONFIGS` + export-delegation/eager smoke test) and the `aten.slice_copy.Tensor` partitioner-allowlist entry in `tester.py`. @exported-using-ghexport Differential Revision: [D108793151](https://our.internmc.facebook.com/intern/diff/D108793151/) Differential Revision: [D108793151](https://our.internmc.facebook.com/intern/diff/D108793151)
…ermute_copy.default) (#20396) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.15.0) (oldest at bottom): * #20465 * #20464 * #20463 * #20435 * #20399 * #20398 * #20397 * __->__ #20396 * #20395 * #20394 * #20393 * #20392 * #20391 * #20390 * #20363 * #20362 * #20361 * #20360 * #20359 Adds `aten.permute_copy.default` (a coordinate-reorder gather) to the WebGPU delegate, and the `IntList` graph value type it needs to read its `dims` argument. Composition: - `runtime/WebGPUGraph.{h,cpp}` — adds `ValueType::IntList` backed by `std::vector<std::vector<int64_t>> int_lists_` + `get_int_list(int)`; `build()` deserializes `vkgraph::GraphTypes::IntList` via `value_as_IntList()->items()` (int64, matching the FlatBuffer `[long]`); mirrors the existing scalar value plumbing. - `runtime/ops/permute/Permute.cpp` — reads the permutation via `get_int_list`, normalizes negative dims, validates it is a permutation of `[0, ndim)`, builds two `TensorMeta` UBOs + a `PermuteParams{perm: vec4<u32>}` uniform, guards fp32 + rank≤4, dispatches over `compute_1d_workgroup_count(out.numel)` with `override wg_size`; releases all uniforms after the bind group. - `runtime/ops/permute/permute.wgsl` — delinearizes the output index over the contiguous output strides, reads `input` at `in.strides[perm[d]]` per dim (mirrors Vulkan `permute_buffer.glsl`). - Registers both `aten.permute_copy.default` and `aten.permute.default` to the same handler. @exported-using-ghexport Differential Revision: [D108793162](https://our.internmc.facebook.com/intern/diff/D108793162/) Differential Revision: [D108793162](https://our.internmc.facebook.com/intern/diff/D108793162)
…mework) (#20397) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.15.0) (oldest at bottom): * #20465 * #20464 * #20463 * #20435 * #20399 * #20398 * __->__ #20397 * #20396 * #20395 * #20394 * #20393 * #20392 * #20391 * #20390 * #20363 * #20362 * #20361 * #20360 * #20359 Registers `aten.permute_copy.default` in the `cases.py` op-test framework: a `_permute_suite` of 4 configs (3D rotation, 4D middle-dim transpose, 2D transpose, full 4D shuffle) that `generate_op_tests` exports via `VulkanPartitioner` and compares to a torch golden on Dawn. Also adds `test/ops/permute/test_permute.py` (`PermuteModule` + `CONFIGS` + `_op_delegated` smoke test) and the `aten.permute_copy.default` partitioner-allowlist entry in `tester.py`. @exported-using-ghexport Differential Revision: [D108793156](https://our.internmc.facebook.com/intern/diff/D108793156/) Differential Revision: [D108793156](https://our.internmc.facebook.com/intern/diff/D108793156)
Stack from ghstack (oldest at bottom):
Adds
aten.view_copy.defaultto the WebGPU delegate. A contiguous reshape on a dense row-major buffer backend is a flat copyoutput[i] = input[i], so the op is a single 1D-dispatch copy kernel.Composition (single compute dispatch):
runtime/ops/view_copy/view_copy.h— declaresadd_flat_copy(graph, in_id, out_id): fail-loud guards (both tensors, fp32, numel-preserving) + theview_copy.wgsldispatch overcompute_1d_workgroup_count(num_elements)withoverride wg_size; mirrors Vulkanadd_view_copy_node.runtime/ops/view_copy/ViewCopy.cpp— readsargs = [self, size, out], ignores the AOT-fixedsizevalue-id (output shape comes fromout_tensor.dims), callsadd_flat_copy.runtime/ops/view_copy/view_copy.wgsl— guardsidx >= num_elements, writesoutput[idx] = input[idx].add_flat_copyis factored into the header so the stacked squeeze/unsqueeze ops reuse it without a new kernel.@exported-using-ghexport
Differential Revision: D108793164
Differential Revision: D108793164