Skip to content

Remove the need for allowscalar#80

Merged
blegat merged 5 commits into
mainfrom
bl/disallow_scalar
May 29, 2026
Merged

Remove the need for allowscalar#80
blegat merged 5 commits into
mainfrom
bl/disallow_scalar

Conversation

@blegat
Copy link
Copy Markdown
Owner

@blegat blegat commented May 28, 2026

Something fun, I noticed that adding debug printing to the code was making it faster, and it has no effect now after this PR.
This was because, before this PR, the scalar access were forcing the cpu to wait the gpu to synchronize. Our hypothesis with Claude is that, by doing this printing, it would arrive at a state where the gpu was already ready so no need to way. And without the synchronization, it would need to wait and the cost of waiting would somehow exceed the time taken to do the printing.

Before

julia --project=perf perf/gpu_bench.jl
Precompiling ArrayDiff finished.
  1 dependency successfully precompiled in 5 seconds. 53 already precompiled.
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  242.748 μs …  6.517 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     355.868 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   341.301 μs ± 81.345 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▂▅                                ▃██▅▁                     
  ▂▃██▆▃▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▂▁▁▁▁▁▁▂▁▂▁▂▃▄▆█████▆▅▄▃▃▃▃▃▄▄▄▄▃▃▃▂▂▂▂ ▃
  243 μs          Histogram: frequency by time          421 μs <

 Memory estimate: 31.61 KiB, allocs estimate: 1004.
Profiler ran for 31.12 ms, capturing 1264 events.

Host-side activity: calling CUDA APIs took 555.99 µs (1.79% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬──────────────────────────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                     │ Name                                                 │
├──────────┼────────────┼───────┼───────────────────────────────────────┼──────────────────────────────────────────────────────┤
│    0.43% │   132.8 µs │    14 │   9.49 µs ± 13.8   (  4.77 ‥ 57.22)   │ cuLaunchKernelEx                                     │
│    0.21% │    63.9 µs │     4 │  15.97 µs ± 10.82  (  9.06 ‥ 31.95)   │ cuMemcpyDtoHAsync                                    │
│    0.15% │    45.3 µs │    11 │   4.12 µs ± 6.19   (  1.43 ‥ 22.65)   │ cuMemAllocFromPoolAsync                              │
│    0.13% │   41.48 µs │     9 │   4.61 µs ± 1.52   (   3.1 ‥ 8.11)    │ cuMemcpyHtoDAsync                                    │
│    0.10% │   30.76 µs │     4 │   7.69 µs ± 5.14   (  3.81 ‥ 15.26)   │ cuMemsetD32Async                                     │
│    0.07% │   21.46 µs │     2 │  10.73 µs ± 3.03   (  8.58 ‥ 12.87)   │ cuMemcpyDtoDAsync                                    │
│    0.06% │   18.84 µs │     3 │   6.28 µs ± 1.85   (  4.77 ‥ 8.34)    │ cuLaunchKernel                                       │
│    0.04% │   11.21 µs │     9 │   1.25 µs ± 1.18   (  0.48 ‥ 4.29)    │ cuStreamSynchronize                                  │
│    0.03% │    8.82 µs │     1 │                                       │ cudaLaunchKernelExC                                  │
│    0.02% │    7.63 µs │     1 │                                       │ cudaLaunchKernel                                     │
│    0.02% │    4.77 µs │     8 │ 596.05 ns ± 845.34 (   0.0 ‥ 2622.6)  │ cudaGetLastError                                     │
│    0.01% │    4.29 µs │     1 │                                       │ cudaEventRecord                                      │
│    0.01% │     3.1 µs │     3 │   1.03 µs ± 0.55   (  0.72 ‥ 1.67)    │ cuKernelGetFunction                                  │
│    0.01% │    2.62 µs │     3 │  874.2 ns ± 902.64 (238.42 ‥ 1907.35) │ cudaGetDevice                                        │
│    0.01% │    1.91 µs │     3 │ 635.78 ns ± 364.19 (238.42 ‥ 953.67)  │ cuOccupancyMaxActiveBlocksPerMultiprocessorWithFlags │
│    0.01% │    1.91 µs │     3 │ 635.78 ns ± 137.65 (476.84 ‥ 715.26)  │ cudaDeviceGetAttribute                               │
│    0.00% │    1.43 µs │     4 │ 357.63 ns ± 238.42 (238.42 ‥ 715.26)  │ cuCtxPushCurrent                                     │
│    0.00% │  953.67 ns │     4 │ 238.42 ns ± 0.0    (238.42 ‥ 238.42)  │ cuCtxPopCurrent                                      │
│    0.00% │  715.26 ns │     1 │                                       │ cuKernelGetName                                      │
│    0.00% │  715.26 ns │     4 │ 178.81 ns ± 119.21 (   0.0 ‥ 238.42)  │ cuCtxGetDevice                                       │
│    0.00% │  476.84 ns │     4 │ 119.21 ns ± 238.42 (   0.0 ‥ 476.84)  │ cuDeviceGet                                          │
└──────────┴────────────┴───────┴───────────────────────────────────────┴──────────────────────────────────────────────────────┘

Device-side activity: GPU was busy for 146.15 µs (0.47% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                                                              ⋯
├──────────┼────────────┼───────┼──────────────────────────────────────┼────────────────────────────────────────────────────────────────────
│    0.08% │   25.75 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_32x128_8x5_nn_align1> ⋯
│    0.06% │   17.64 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.06% │   17.64 µs │     1 │                                      │ void magma_sgemmEx_kernel<float, float, float, true, false, 6, 4, ⋯
│    0.04% │   13.11 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.04% │   11.92 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_128x32_8x5_nt_align1> ⋯
│    0.04% │   11.21 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_128x32_8x5_nn_align1> ⋯
│    0.04% │   10.97 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.03% │     9.3 µs │     4 │   2.32 µs ± 3.39   (  0.48 ‥ 7.39)   │ [set device memory]                                               ⋯
│    0.02% │    4.77 µs │     1 │                                      │ partial_mapreduce_grid(identity, reducer, NamedTuple<__is_missing ⋯
│    0.01% │    4.53 µs │     4 │   1.13 µs ± 0.12   (  0.95 ‥ 1.19)   │ [copy device to pageable memory]                                  ⋯
│    0.01% │    2.62 µs │     2 │   1.31 µs ± 0.17   (  1.19 ‥ 1.43)   │ [copy device to device memory]                                    ⋯
│    0.01% │    2.38 µs │     1 │                                      │ partial_mapreduce_grid(identity, reducer, NamedTuple<__is_missing ⋯
│    0.01% │    1.91 µs │     1 │                                      │ partial_mapreduce_grid(identity, add_sum, void, CartesianIndices< ⋯
│    0.01% │    1.91 µs │     9 │ 211.93 ns ± 143.27 (   0.0 ‥ 476.84) │ [copy pageable to device memory]                                  ⋯
│    0.01% │    1.91 µs │     1 │                                      │ partial_mapreduce_grid(_, add_sum, Float32, CartesianIndices<2, T ⋯
│    0.01% │    1.67 µs │     1 │                                      │ void cublasLt::splitKreduce_kernel<32, 16, int, float, float, flo ⋯
│    0.00% │    1.43 µs │     1 │                                      │ partial_mapreduce_grid(identity, add_sum, Float32, CartesianIndic ⋯
│    0.00% │    1.43 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.00% │  953.67 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.00% │  953.67 ns │     1 │                                      │ gpu_broadcast_kernel_cartesian(CompilerMetadata<DynamicSize, Dyna ⋯
│    0.00% │  953.67 ns │     1 │                                      │ gpu_broadcast_kernel_cartesian(CompilerMetadata<DynamicSize, Dyna ⋯
│    0.00% │  715.26 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.00% │  476.84 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
└──────────┴────────────┴───────┴──────────────────────────────────────┴────────────────────────────────────────────────────────────────────

After

$ julia --project=perf perf/gpu_bench.jl
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  226.312 μs …  4.894 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     239.389 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   266.470 μs ± 67.781 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▄▇██▇▆▄▃▂▃▃▃▃▂▂                     ▁▂▄▅▅▅▄▃▂▂▁▁▁ ▁▁▁▂▂▁▁▁  ▃
  ▅████████████████▆▆▅▆▁▄▅▄▁▁▁▃▅▃▁▅▄▄▅███████████████████████▇ █
  226 μs        Histogram: log(frequency) by time       379 μs <

 Memory estimate: 31.97 KiB, allocs estimate: 1004.
Profiler ran for 12.53 ms, capturing 1316 events.

Host-side activity: calling CUDA APIs took 617.5 µs (4.93% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬──────────────────────────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                     │ Name                                                 │
├──────────┼────────────┼───────┼───────────────────────────────────────┼──────────────────────────────────────────────────────┤
│    1.29% │  162.12 µs │    15 │  10.81 µs ± 18.5   (  4.77 ‥ 77.49)   │ cuLaunchKernelEx                                     │
│    0.49% │   61.51 µs │     2 │  30.76 µs ± 23.27  ( 14.31 ‥ 47.21)   │ cuMemcpyDtoHAsync                                    │
│    0.35% │   44.11 µs │    11 │   4.01 µs ± 5.05   (  1.67 ‥ 19.07)   │ cuMemAllocFromPoolAsync                              │
│    0.34% │    42.2 µs │     4 │  10.55 µs ± 9.31   (  4.05 ‥ 24.32)   │ cuMemsetD32Async                                     │
│    0.30% │   37.67 µs │     8 │   4.71 µs ± 1.37   (   3.1 ‥ 7.39)    │ cuMemcpyHtoDAsync                                    │
│    0.21% │    26.7 µs │     2 │  13.35 µs ± 4.05   ( 10.49 ‥ 16.21)   │ cuMemcpyDtoDAsync                                    │
│    0.17% │   20.98 µs │     3 │   6.99 µs ± 1.62   (  5.72 ‥ 8.82)    │ cuLaunchKernel                                       │
│    0.08% │   10.01 µs │     1 │                                       │ cudaLaunchKernelExC                                  │
│    0.07% │    8.58 µs │     1 │                                       │ cudaLaunchKernel                                     │
│    0.06% │    8.11 µs │     5 │   1.62 µs ± 1.38   (  0.72 ‥ 4.05)    │ cuStreamSynchronize                                  │
│    0.04% │    4.53 µs │     8 │ 566.24 ns ± 583.13 (   0.0 ‥ 1907.35) │ cudaGetLastError                                     │
│    0.03% │    4.29 µs │     1 │                                       │ cudaEventRecord                                      │
│    0.03% │    3.81 µs │     3 │   1.27 µs ± 0.73   (  0.48 ‥ 1.91)    │ cuKernelGetFunction                                  │
│    0.02% │    2.86 µs │     3 │ 953.67 ns ± 825.91 (476.84 ‥ 1907.35) │ cudaGetDevice                                        │
│    0.01% │    1.67 µs │     4 │ 417.23 ns ± 357.63 (238.42 ‥ 953.67)  │ cuCtxPushCurrent                                     │
│    0.01% │    1.67 µs │     3 │ 556.31 ns ± 137.65 (476.84 ‥ 715.26)  │ cuOccupancyMaxActiveBlocksPerMultiprocessorWithFlags │
│    0.01% │    1.43 µs │     3 │ 476.84 ns ± 0.0    (476.84 ‥ 476.84)  │ cudaDeviceGetAttribute                               │
│    0.01% │  715.26 ns │     4 │ 178.81 ns ± 228.27 (   0.0 ‥ 476.84)  │ cuCtxPopCurrent                                      │
│    0.01% │  715.26 ns │     4 │ 178.81 ns ± 119.21 (   0.0 ‥ 238.42)  │ cuCtxGetDevice                                       │
│    0.00% │  476.84 ns │     1 │                                       │ cuKernelGetName                                      │
│    0.00% │  476.84 ns │     4 │ 119.21 ns ± 238.42 (   0.0 ‥ 476.84)  │ cuDeviceGet                                          │
└──────────┴────────────┴───────┴───────────────────────────────────────┴──────────────────────────────────────────────────────┘

Device-side activity: GPU was busy for 145.2 µs (1.16% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                                                              ⋯
├──────────┼────────────┼───────┼──────────────────────────────────────┼────────────────────────────────────────────────────────────────────
│    0.21% │   25.75 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_32x128_8x5_nn_align1> ⋯
│    0.14% │   18.12 µs │     1 │                                      │ void magma_sgemmEx_kernel<float, float, float, true, false, 6, 4, ⋯
│    0.14% │   17.64 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.11% │   13.35 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.10% │   11.92 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_128x32_8x5_nt_align1> ⋯
│    0.09% │   11.21 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.09% │   10.97 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_128x32_8x5_nn_align1> ⋯
│    0.08% │    9.54 µs │     4 │   2.38 µs ± 3.5    (  0.48 ‥ 7.63)   │ [set device memory]                                               ⋯
│    0.03% │    4.05 µs │     1 │                                      │ partial_mapreduce_grid(identity, reducer, NamedTuple<__is_missing ⋯
│    0.03% │    3.81 µs │     2 │   1.91 µs ± 0.34   (  1.67 ‥ 2.15)   │ [copy device to pageable memory]                                  ⋯
│    0.02% │    2.38 µs │     1 │                                      │ partial_mapreduce_grid(identity, reducer, NamedTuple<__is_missing ⋯
│    0.02% │    2.15 µs │     2 │   1.07 µs ± 0.17   (  0.95 ‥ 1.19)   │ [copy device to device memory]                                    ⋯
│    0.02% │    1.91 µs │     1 │                                      │ void cublasLt::splitKreduce_kernel<32, 16, int, float, float, flo ⋯
│    0.02% │    1.91 µs │     8 │ 238.42 ns ± 0.0    (238.42 ‥ 238.42) │ [copy pageable to device memory]                                  ⋯
│    0.01% │    1.67 µs │     1 │                                      │ partial_mapreduce_grid(identity, add_sum, void, CartesianIndices< ⋯
│    0.01% │    1.67 µs │     1 │                                      │ partial_mapreduce_grid(_, add_sum, Float32, CartesianIndices<2, T ⋯
│    0.01% │    1.43 µs │     1 │                                      │ partial_mapreduce_grid(identity, add_sum, Float32, CartesianIndic ⋯
│    0.01% │    1.43 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.01% │  953.67 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.01% │  953.67 ns │     1 │                                      │ gpu_broadcast_kernel_cartesian(CompilerMetadata<DynamicSize, Dyna ⋯
│    0.01% │  715.26 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.01% │  715.26 ns │     1 │                                      │ gpu_broadcast_kernel_cartesian(CompilerMetadata<DynamicSize, Dyna ⋯
│    0.00% │  476.84 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.00% │  476.84 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
└──────────┴────────────┴───────┴──────────────────────────────────────┴────────────────────────────────────────────────────────────────────

@codecov
Copy link
Copy Markdown

codecov Bot commented May 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.23%. Comparing base (e349782) to head (27eb297).

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #80      +/-   ##
==========================================
- Coverage   92.23%   92.23%   -0.01%     
==========================================
  Files          25       25              
  Lines        3220     3219       -1     
==========================================
- Hits         2970     2969       -1     
  Misses        250      250              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@blegat blegat merged commit 4c81236 into main May 29, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant