Remove the need for allowscalar by blegat · Pull Request #80 · blegat/ArrayDiff.jl

blegat · 2026-05-28T20:00:40Z

Something fun, I noticed that adding debug printing to the code was making it faster, and it has no effect now after this PR.
This was because, before this PR, the scalar access were forcing the cpu to wait the gpu to synchronize. Our hypothesis with Claude is that, by doing this printing, it would arrive at a state where the gpu was already ready so no need to way. And without the synchronization, it would need to wait and the cost of waiting would somehow exceed the time taken to do the printing.

Before

julia --project=perf perf/gpu_bench.jl
Precompiling ArrayDiff finished.
  1 dependency successfully precompiled in 5 seconds. 53 already precompiled.
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  242.748 μs …  6.517 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     355.868 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   341.301 μs ± 81.345 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▂▅                                ▃██▅▁                     
  ▂▃██▆▃▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▂▁▁▁▁▁▁▂▁▂▁▂▃▄▆█████▆▅▄▃▃▃▃▃▄▄▄▄▃▃▃▂▂▂▂ ▃
  243 μs          Histogram: frequency by time          421 μs <

 Memory estimate: 31.61 KiB, allocs estimate: 1004.
Profiler ran for 31.12 ms, capturing 1264 events.

Host-side activity: calling CUDA APIs took 555.99 µs (1.79% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬──────────────────────────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                     │ Name                                                 │
├──────────┼────────────┼───────┼───────────────────────────────────────┼──────────────────────────────────────────────────────┤
│    0.43% │   132.8 µs │    14 │   9.49 µs ± 13.8   (  4.77 ‥ 57.22)   │ cuLaunchKernelEx                                     │
│    0.21% │    63.9 µs │     4 │  15.97 µs ± 10.82  (  9.06 ‥ 31.95)   │ cuMemcpyDtoHAsync                                    │
│    0.15% │    45.3 µs │    11 │   4.12 µs ± 6.19   (  1.43 ‥ 22.65)   │ cuMemAllocFromPoolAsync                              │
│    0.13% │   41.48 µs │     9 │   4.61 µs ± 1.52   (   3.1 ‥ 8.11)    │ cuMemcpyHtoDAsync                                    │
│    0.10% │   30.76 µs │     4 │   7.69 µs ± 5.14   (  3.81 ‥ 15.26)   │ cuMemsetD32Async                                     │
│    0.07% │   21.46 µs │     2 │  10.73 µs ± 3.03   (  8.58 ‥ 12.87)   │ cuMemcpyDtoDAsync                                    │
│    0.06% │   18.84 µs │     3 │   6.28 µs ± 1.85   (  4.77 ‥ 8.34)    │ cuLaunchKernel                                       │
│    0.04% │   11.21 µs │     9 │   1.25 µs ± 1.18   (  0.48 ‥ 4.29)    │ cuStreamSynchronize                                  │
│    0.03% │    8.82 µs │     1 │                                       │ cudaLaunchKernelExC                                  │
│    0.02% │    7.63 µs │     1 │                                       │ cudaLaunchKernel                                     │
│    0.02% │    4.77 µs │     8 │ 596.05 ns ± 845.34 (   0.0 ‥ 2622.6)  │ cudaGetLastError                                     │
│    0.01% │    4.29 µs │     1 │                                       │ cudaEventRecord                                      │
│    0.01% │     3.1 µs │     3 │   1.03 µs ± 0.55   (  0.72 ‥ 1.67)    │ cuKernelGetFunction                                  │
│    0.01% │    2.62 µs │     3 │  874.2 ns ± 902.64 (238.42 ‥ 1907.35) │ cudaGetDevice                                        │
│    0.01% │    1.91 µs │     3 │ 635.78 ns ± 364.19 (238.42 ‥ 953.67)  │ cuOccupancyMaxActiveBlocksPerMultiprocessorWithFlags │
│    0.01% │    1.91 µs │     3 │ 635.78 ns ± 137.65 (476.84 ‥ 715.26)  │ cudaDeviceGetAttribute                               │
│    0.00% │    1.43 µs │     4 │ 357.63 ns ± 238.42 (238.42 ‥ 715.26)  │ cuCtxPushCurrent                                     │
│    0.00% │  953.67 ns │     4 │ 238.42 ns ± 0.0    (238.42 ‥ 238.42)  │ cuCtxPopCurrent                                      │
│    0.00% │  715.26 ns │     1 │                                       │ cuKernelGetName                                      │
│    0.00% │  715.26 ns │     4 │ 178.81 ns ± 119.21 (   0.0 ‥ 238.42)  │ cuCtxGetDevice                                       │
│    0.00% │  476.84 ns │     4 │ 119.21 ns ± 238.42 (   0.0 ‥ 476.84)  │ cuDeviceGet                                          │
└──────────┴────────────┴───────┴───────────────────────────────────────┴──────────────────────────────────────────────────────┘

Device-side activity: GPU was busy for 146.15 µs (0.47% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                                                              ⋯
├──────────┼────────────┼───────┼──────────────────────────────────────┼────────────────────────────────────────────────────────────────────
│    0.08% │   25.75 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_32x128_8x5_nn_align1> ⋯
│    0.06% │   17.64 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.06% │   17.64 µs │     1 │                                      │ void magma_sgemmEx_kernel<float, float, float, true, false, 6, 4, ⋯
│    0.04% │   13.11 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.04% │   11.92 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_128x32_8x5_nt_align1> ⋯
│    0.04% │   11.21 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_128x32_8x5_nn_align1> ⋯
│    0.04% │   10.97 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.03% │     9.3 µs │     4 │   2.32 µs ± 3.39   (  0.48 ‥ 7.39)   │ [set device memory]                                               ⋯
│    0.02% │    4.77 µs │     1 │                                      │ partial_mapreduce_grid(identity, reducer, NamedTuple<__is_missing ⋯
│    0.01% │    4.53 µs │     4 │   1.13 µs ± 0.12   (  0.95 ‥ 1.19)   │ [copy device to pageable memory]                                  ⋯
│    0.01% │    2.62 µs │     2 │   1.31 µs ± 0.17   (  1.19 ‥ 1.43)   │ [copy device to device memory]                                    ⋯
│    0.01% │    2.38 µs │     1 │                                      │ partial_mapreduce_grid(identity, reducer, NamedTuple<__is_missing ⋯
│    0.01% │    1.91 µs │     1 │                                      │ partial_mapreduce_grid(identity, add_sum, void, CartesianIndices< ⋯
│    0.01% │    1.91 µs │     9 │ 211.93 ns ± 143.27 (   0.0 ‥ 476.84) │ [copy pageable to device memory]                                  ⋯
│    0.01% │    1.91 µs │     1 │                                      │ partial_mapreduce_grid(_, add_sum, Float32, CartesianIndices<2, T ⋯
│    0.01% │    1.67 µs │     1 │                                      │ void cublasLt::splitKreduce_kernel<32, 16, int, float, float, flo ⋯
│    0.00% │    1.43 µs │     1 │                                      │ partial_mapreduce_grid(identity, add_sum, Float32, CartesianIndic ⋯
│    0.00% │    1.43 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.00% │  953.67 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.00% │  953.67 ns │     1 │                                      │ gpu_broadcast_kernel_cartesian(CompilerMetadata<DynamicSize, Dyna ⋯
│    0.00% │  953.67 ns │     1 │                                      │ gpu_broadcast_kernel_cartesian(CompilerMetadata<DynamicSize, Dyna ⋯
│    0.00% │  715.26 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.00% │  476.84 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
└──────────┴────────────┴───────┴──────────────────────────────────────┴────────────────────────────────────────────────────────────────────

After

$ julia --project=perf perf/gpu_bench.jl
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  226.312 μs …  4.894 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     239.389 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   266.470 μs ± 67.781 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▄▇██▇▆▄▃▂▃▃▃▃▂▂                     ▁▂▄▅▅▅▄▃▂▂▁▁▁ ▁▁▁▂▂▁▁▁  ▃
  ▅████████████████▆▆▅▆▁▄▅▄▁▁▁▃▅▃▁▅▄▄▅███████████████████████▇ █
  226 μs        Histogram: log(frequency) by time       379 μs <

 Memory estimate: 31.97 KiB, allocs estimate: 1004.
Profiler ran for 12.53 ms, capturing 1316 events.

Host-side activity: calling CUDA APIs took 617.5 µs (4.93% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬──────────────────────────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                     │ Name                                                 │
├──────────┼────────────┼───────┼───────────────────────────────────────┼──────────────────────────────────────────────────────┤
│    1.29% │  162.12 µs │    15 │  10.81 µs ± 18.5   (  4.77 ‥ 77.49)   │ cuLaunchKernelEx                                     │
│    0.49% │   61.51 µs │     2 │  30.76 µs ± 23.27  ( 14.31 ‥ 47.21)   │ cuMemcpyDtoHAsync                                    │
│    0.35% │   44.11 µs │    11 │   4.01 µs ± 5.05   (  1.67 ‥ 19.07)   │ cuMemAllocFromPoolAsync                              │
│    0.34% │    42.2 µs │     4 │  10.55 µs ± 9.31   (  4.05 ‥ 24.32)   │ cuMemsetD32Async                                     │
│    0.30% │   37.67 µs │     8 │   4.71 µs ± 1.37   (   3.1 ‥ 7.39)    │ cuMemcpyHtoDAsync                                    │
│    0.21% │    26.7 µs │     2 │  13.35 µs ± 4.05   ( 10.49 ‥ 16.21)   │ cuMemcpyDtoDAsync                                    │
│    0.17% │   20.98 µs │     3 │   6.99 µs ± 1.62   (  5.72 ‥ 8.82)    │ cuLaunchKernel                                       │
│    0.08% │   10.01 µs │     1 │                                       │ cudaLaunchKernelExC                                  │
│    0.07% │    8.58 µs │     1 │                                       │ cudaLaunchKernel                                     │
│    0.06% │    8.11 µs │     5 │   1.62 µs ± 1.38   (  0.72 ‥ 4.05)    │ cuStreamSynchronize                                  │
│    0.04% │    4.53 µs │     8 │ 566.24 ns ± 583.13 (   0.0 ‥ 1907.35) │ cudaGetLastError                                     │
│    0.03% │    4.29 µs │     1 │                                       │ cudaEventRecord                                      │
│    0.03% │    3.81 µs │     3 │   1.27 µs ± 0.73   (  0.48 ‥ 1.91)    │ cuKernelGetFunction                                  │
│    0.02% │    2.86 µs │     3 │ 953.67 ns ± 825.91 (476.84 ‥ 1907.35) │ cudaGetDevice                                        │
│    0.01% │    1.67 µs │     4 │ 417.23 ns ± 357.63 (238.42 ‥ 953.67)  │ cuCtxPushCurrent                                     │
│    0.01% │    1.67 µs │     3 │ 556.31 ns ± 137.65 (476.84 ‥ 715.26)  │ cuOccupancyMaxActiveBlocksPerMultiprocessorWithFlags │
│    0.01% │    1.43 µs │     3 │ 476.84 ns ± 0.0    (476.84 ‥ 476.84)  │ cudaDeviceGetAttribute                               │
│    0.01% │  715.26 ns │     4 │ 178.81 ns ± 228.27 (   0.0 ‥ 476.84)  │ cuCtxPopCurrent                                      │
│    0.01% │  715.26 ns │     4 │ 178.81 ns ± 119.21 (   0.0 ‥ 238.42)  │ cuCtxGetDevice                                       │
│    0.00% │  476.84 ns │     1 │                                       │ cuKernelGetName                                      │
│    0.00% │  476.84 ns │     4 │ 119.21 ns ± 238.42 (   0.0 ‥ 476.84)  │ cuDeviceGet                                          │
└──────────┴────────────┴───────┴───────────────────────────────────────┴──────────────────────────────────────────────────────┘

Device-side activity: GPU was busy for 145.2 µs (1.16% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                                                              ⋯
├──────────┼────────────┼───────┼──────────────────────────────────────┼────────────────────────────────────────────────────────────────────
│    0.21% │   25.75 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_32x128_8x5_nn_align1> ⋯
│    0.14% │   18.12 µs │     1 │                                      │ void magma_sgemmEx_kernel<float, float, float, true, false, 6, 4, ⋯
│    0.14% │   17.64 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.11% │   13.35 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.10% │   11.92 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_128x32_8x5_nt_align1> ⋯
│    0.09% │   11.21 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.09% │   10.97 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_128x32_8x5_nn_align1> ⋯
│    0.08% │    9.54 µs │     4 │   2.38 µs ± 3.5    (  0.48 ‥ 7.63)   │ [set device memory]                                               ⋯
│    0.03% │    4.05 µs │     1 │                                      │ partial_mapreduce_grid(identity, reducer, NamedTuple<__is_missing ⋯
│    0.03% │    3.81 µs │     2 │   1.91 µs ± 0.34   (  1.67 ‥ 2.15)   │ [copy device to pageable memory]                                  ⋯
│    0.02% │    2.38 µs │     1 │                                      │ partial_mapreduce_grid(identity, reducer, NamedTuple<__is_missing ⋯
│    0.02% │    2.15 µs │     2 │   1.07 µs ± 0.17   (  0.95 ‥ 1.19)   │ [copy device to device memory]                                    ⋯
│    0.02% │    1.91 µs │     1 │                                      │ void cublasLt::splitKreduce_kernel<32, 16, int, float, float, flo ⋯
│    0.02% │    1.91 µs │     8 │ 238.42 ns ± 0.0    (238.42 ‥ 238.42) │ [copy pageable to device memory]                                  ⋯
│    0.01% │    1.67 µs │     1 │                                      │ partial_mapreduce_grid(identity, add_sum, void, CartesianIndices< ⋯
│    0.01% │    1.67 µs │     1 │                                      │ partial_mapreduce_grid(_, add_sum, Float32, CartesianIndices<2, T ⋯
│    0.01% │    1.43 µs │     1 │                                      │ partial_mapreduce_grid(identity, add_sum, Float32, CartesianIndic ⋯
│    0.01% │    1.43 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.01% │  953.67 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.01% │  953.67 ns │     1 │                                      │ gpu_broadcast_kernel_cartesian(CompilerMetadata<DynamicSize, Dyna ⋯
│    0.01% │  715.26 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.01% │  715.26 ns │     1 │                                      │ gpu_broadcast_kernel_cartesian(CompilerMetadata<DynamicSize, Dyna ⋯
│    0.00% │  476.84 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.00% │  476.84 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
└──────────┴────────────┴───────┴──────────────────────────────────────┴────────────────────────────────────────────────────────────────────

codecov · 2026-05-28T20:31:53Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.23%. Comparing base (e349782) to head (27eb297).

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #80      +/-   ##
==========================================
- Coverage   92.23%   92.23%   -0.01%     
==========================================
  Files          25       25              
  Lines        3220     3219       -1     
==========================================
- Hits         2970     2969       -1     
  Misses        250      250

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

blegat added 5 commits May 28, 2026 21:49

Remove the need for allowscalar

c4c8a05

Fixes

adef74a

Fix format

4ded416

Fix format

1ac55a2

Remove last allowscalar

27eb297

blegat merged commit 4c81236 into main May 29, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove the need for allowscalar#80

Remove the need for allowscalar#80
blegat merged 5 commits into
mainfrom
bl/disallow_scalar

blegat commented May 28, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

blegat commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

blegat commented May 28, 2026 •

edited

Loading

codecov Bot commented May 28, 2026 •

edited

Loading