Preload non-block values as well by blegat · Pull Request #78 · blegat/ArrayDiff.jl

blegat · 2026-05-28T19:00:27Z

Before

$ julia --project=perf perf/gpu_bench.jl
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  362.407 μs …  10.112 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     387.509 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   397.109 μs ± 101.367 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

         ▂▆█▇▄▁                                                  
  ▁▁▁▂▃▄▅██████▆▄▃▂▂▂▂▂▂▂▂▃▄▄▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  362 μs           Histogram: frequency by time          504 μs <

 Memory estimate: 32.41 KiB, allocs estimate: 1051.
Profiler ran for 31.22 ms, capturing 1406 events.

Host-side activity: calling CUDA APIs took 678.78 µs (2.17% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────────┬──────────────────────────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                      │ Name                                                 │
├──────────┼────────────┼───────┼────────────────────────────────────────┼──────────────────────────────────────────────────────┤
│    0.52% │  161.89 µs │    14 │  11.56 µs ± 22.1   (  4.29 ‥ 88.21)    │ cuLaunchKernelEx                                     │
│    0.27% │   84.16 µs │     5 │  16.83 µs ± 14.64  (  8.11 ‥ 42.68)    │ cuMemcpyDtoHAsync                                    │
│    0.16% │    51.5 µs │    11 │   4.68 µs ± 8.43   (  1.19 ‥ 30.04)    │ cuMemAllocFromPoolAsync                              │
│    0.15% │   45.54 µs │    10 │   4.55 µs ± 1.78   (   3.1 ‥ 7.87)     │ cuMemcpyHtoDAsync                                    │
│    0.13% │   40.77 µs │     4 │  10.19 µs ± 9.2    (  3.81 ‥ 23.84)    │ cuMemsetD32Async                                     │
│    0.09% │    26.7 µs │     3 │    8.9 µs ± 6.35   (  4.77 ‥ 16.21)    │ cuLaunchKernel                                       │
│    0.08% │   25.27 µs │     2 │  12.64 µs ± 4.38   (  9.54 ‥ 15.74)    │ cuMemcpyDtoDAsync                                    │
│    0.04% │   13.59 µs │    11 │   1.24 µs ± 1.43   (  0.48 ‥ 5.48)     │ cuStreamSynchronize                                  │
│    0.03% │    9.06 µs │     1 │                                        │ cudaLaunchKernelExC                                  │
│    0.03% │    7.87 µs │     1 │                                        │ cudaLaunchKernel                                     │
│    0.02% │    5.25 µs │     8 │ 655.65 ns ± 1086.98 (238.42 ‥ 3337.86) │ cudaGetLastError                                     │
│    0.02% │    4.77 µs │     1 │                                        │ cudaEventRecord                                      │
│    0.01% │    3.58 µs │     3 │   1.19 µs ± 1.04   (  0.48 ‥ 2.38)     │ cuKernelGetFunction                                  │
│    0.01% │    2.62 µs │     3 │  874.2 ns ± 688.26 (476.84 ‥ 1668.93)  │ cudaGetDevice                                        │
│    0.01% │    2.15 µs │     3 │ 715.26 ns ± 238.42 (476.84 ‥ 953.67)   │ cuOccupancyMaxActiveBlocksPerMultiprocessorWithFlags │
│    0.01% │    1.91 µs │     3 │ 635.78 ns ± 275.3  (476.84 ‥ 953.67)   │ cudaDeviceGetAttribute                               │
│    0.00% │  715.26 ns │     4 │ 178.81 ns ± 228.27 (   0.0 ‥ 476.84)   │ cuCtxPushCurrent                                     │
│    0.00% │  715.26 ns │     4 │ 178.81 ns ± 228.27 (   0.0 ‥ 476.84)   │ cuDeviceGet                                          │
│    0.00% │  476.84 ns │     1 │                                        │ cuKernelGetName                                      │
│    0.00% │  476.84 ns │     4 │ 119.21 ns ± 137.65 (   0.0 ‥ 238.42)   │ cuCtxPopCurrent                                      │
│    0.00% │     0.0 ns │     4 │    0.0 ns ± 0.0    (   0.0 ‥ 0.0)      │ cuCtxGetDevice                                       │
└──────────┴────────────┴───────┴────────────────────────────────────────┴──────────────────────────────────────────────────────┘

Device-side activity: GPU was busy for 148.06 µs (0.47% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                                                              ⋯
├──────────┼────────────┼───────┼──────────────────────────────────────┼────────────────────────────────────────────────────────────────────
│    0.08% │   25.51 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_32x128_8x5_nn_align1> ⋯
│    0.06% │   17.88 µs │     1 │                                      │ void magma_sgemmEx_kernel<float, float, float, true, false, 6, 4, ⋯
│    0.06% │    17.4 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.04% │   13.11 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.04% │   11.68 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_128x32_8x5_nt_align1> ⋯
│    0.04% │   11.21 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.04% │   10.97 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_128x32_8x5_nn_align1> ⋯
│    0.03% │    9.06 µs │     4 │   2.26 µs ± 3.43   (  0.24 ‥ 7.39)   │ [set device memory]                                               ⋯
│    0.02% │    6.68 µs │     5 │   1.34 µs ± 0.21   (  1.19 ‥ 1.67)   │ [copy device to pageable memory]                                  ⋯
│    0.02% │    5.01 µs │     1 │                                      │ partial_mapreduce_grid(identity, reducer, NamedTuple<__is_missing ⋯
│    0.01% │    2.62 µs │     1 │                                      │ partial_mapreduce_grid(identity, reducer, NamedTuple<__is_missing ⋯
│    0.01% │    2.15 µs │     2 │   1.07 µs ± 0.17   (  0.95 ‥ 1.19)   │ [copy device to device memory]                                    ⋯
│    0.01% │    1.91 µs │     1 │                                      │ partial_mapreduce_grid(identity, add_sum, Float32, CartesianIndic ⋯
│    0.01% │    1.91 µs │     1 │                                      │ partial_mapreduce_grid(identity, add_sum, void, CartesianIndices< ⋯
│    0.01% │    1.91 µs │    10 │ 190.73 ns ± 100.53 (   0.0 ‥ 238.42) │ [copy pageable to device memory]                                  ⋯
│    0.01% │    1.91 µs │     1 │                                      │ partial_mapreduce_grid(_, add_sum, Float32, CartesianIndices<2, T ⋯
│    0.01% │    1.67 µs │     1 │                                      │ void cublasLt::splitKreduce_kernel<32, 16, int, float, float, flo ⋯
│    0.00% │    1.43 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.00% │  953.67 ns │     1 │                                      │ gpu_broadcast_kernel_cartesian(CompilerMetadata<DynamicSize, Dyna ⋯
│    0.00% │  953.67 ns │     1 │                                      │ gpu_broadcast_kernel_cartesian(CompilerMetadata<DynamicSize, Dyna ⋯
│    0.00% │  715.26 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.00% │  715.26 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.00% │  715.26 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
└──────────┴────────────┴───────┴──────────────────────────────────────┴────────────────────────────────────────────────────────────────────

After

$ julia --project=perf perf/gpu_bench.jl
Precompiling ArrayDiff finished.
  1 dependency successfully precompiled in 4 seconds. 53 already precompiled.
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  261.185 μs …   9.858 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     390.277 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   377.380 μs ± 112.290 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▄                             ▅█▃                            
  ▂▄██▄▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▂▂▂▂▂▂▂▃▄█████▅▄▃▃▄▄▆▅▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  261 μs           Histogram: frequency by time          496 μs <

 Memory estimate: 32.20 KiB, allocs estimate: 1040.
Profiler ran for 15.65 ms, capturing 1762 events.

Host-side activity: calling CUDA APIs took 370.03 µs (2.36% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬──────────────────────────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                     │ Name                                                 │
├──────────┼────────────┼───────┼───────────────────────────────────────┼──────────────────────────────────────────────────────┤
│    0.54% │   83.92 µs │    14 │   5.99 µs ± 9.27   (  2.86 ‥ 38.15)   │ cuLaunchKernelEx                                     │
│    0.33% │   51.02 µs │     5 │   10.2 µs ± 6.71   (  6.68 ‥ 22.17)   │ cuMemcpyDtoHAsync                                    │
│    0.17% │   27.18 µs │     4 │   6.79 µs ± 3.72   (   3.1 ‥ 10.97)   │ cuMemsetD32Async                                     │
│    0.16% │   25.03 µs │    11 │   2.28 µs ± 4.0    (  0.72 ‥ 14.31)   │ cuMemAllocFromPoolAsync                              │
│    0.14% │   22.17 µs │     9 │   2.46 µs ± 0.83   (  1.67 ‥ 4.29)    │ cuMemcpyHtoDAsync                                    │
│    0.08% │   11.92 µs │     2 │   5.96 µs ± 2.36   (  4.29 ‥ 7.63)    │ cuMemcpyDtoDAsync                                    │
│    0.07% │   10.97 µs │     3 │   3.66 µs ± 0.77   (   3.1 ‥ 4.53)    │ cuLaunchKernel                                       │
│    0.04% │     6.2 µs │     1 │                                       │ cudaLaunchKernelExC                                  │
│    0.04% │    5.48 µs │    11 │ 498.51 ns ± 634.88 (238.42 ‥ 2384.19) │ cuStreamSynchronize                                  │
│    0.03% │    5.25 µs │     1 │                                       │ cudaLaunchKernel                                     │
│    0.02% │    3.34 µs │     1 │                                       │ cudaEventRecord                                      │
│    0.02% │    2.86 µs │     3 │ 953.67 ns ± 630.8  (476.84 ‥ 1668.93) │ cuKernelGetFunction                                  │
│    0.02% │    2.38 µs │     8 │ 298.02 ns ± 305.59 (   0.0 ‥ 953.67)  │ cudaGetLastError                                     │
│    0.01% │    1.91 µs │     3 │ 635.78 ns ± 688.26 (238.42 ‥ 1430.51) │ cudaGetDevice                                        │
│    0.01% │    1.67 µs │     3 │ 556.31 ns ± 137.65 (476.84 ‥ 715.26)  │ cuOccupancyMaxActiveBlocksPerMultiprocessorWithFlags │
│    0.00% │  476.84 ns │     4 │ 119.21 ns ± 137.65 (   0.0 ‥ 238.42)  │ cuCtxPushCurrent                                     │
│    0.00% │  476.84 ns │     1 │                                       │ cuKernelGetName                                      │
│    0.00% │  476.84 ns │     4 │ 119.21 ns ± 137.65 (   0.0 ‥ 238.42)  │ cuCtxGetDevice                                       │
│    0.00% │  476.84 ns │     3 │ 158.95 ns ± 137.65 (   0.0 ‥ 238.42)  │ cudaDeviceGetAttribute                               │
│    0.00% │  238.42 ns │     4 │   59.6 ns ± 119.21 (   0.0 ‥ 238.42)  │ cuDeviceGet                                          │
│    0.00% │     0.0 ns │     4 │    0.0 ns ± 0.0    (   0.0 ‥ 0.0)     │ cuCtxPopCurrent                                      │
└──────────┴────────────┴───────┴───────────────────────────────────────┴──────────────────────────────────────────────────────┘

Device-side activity: GPU was busy for 147.58 µs (0.94% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                                                              ⋯
├──────────┼────────────┼───────┼──────────────────────────────────────┼────────────────────────────────────────────────────────────────────
│    0.16% │   25.51 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_32x128_8x5_nn_align1> ⋯
│    0.12% │   18.12 µs │     1 │                                      │ void magma_sgemmEx_kernel<float, float, float, true, false, 6, 4, ⋯
│    0.11% │   17.88 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.08% │   13.11 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.08% │    12.4 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_128x32_8x5_nt_align1> ⋯
│    0.07% │   11.44 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.07% │   10.97 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_128x32_8x5_nn_align1> ⋯
│    0.06% │    8.82 µs │     4 │   2.21 µs ± 3.47   (  0.24 ‥ 7.39)   │ [set device memory]                                               ⋯
│    0.04% │    6.68 µs │     5 │   1.34 µs ± 0.13   (  1.19 ‥ 1.43)   │ [copy device to pageable memory]                                  ⋯
│    0.03% │    4.05 µs │     1 │                                      │ partial_mapreduce_grid(identity, reducer, NamedTuple<__is_missing ⋯
│    0.02% │    2.38 µs │     1 │                                      │ partial_mapreduce_grid(identity, reducer, NamedTuple<__is_missing ⋯
│    0.02% │    2.38 µs │     2 │   1.19 µs ± 0.0    (  1.19 ‥ 1.19)   │ [copy device to device memory]                                    ⋯
│    0.01% │    1.91 µs │     1 │                                      │ partial_mapreduce_grid(_, add_sum, Float32, CartesianIndices<2, T ⋯
│    0.01% │    1.67 µs │     1 │                                      │ partial_mapreduce_grid(identity, add_sum, Float32, CartesianIndic ⋯
│    0.01% │    1.67 µs │     1 │                                      │ partial_mapreduce_grid(identity, add_sum, void, CartesianIndices< ⋯
│    0.01% │    1.67 µs │     1 │                                      │ void cublasLt::splitKreduce_kernel<32, 16, int, float, float, flo ⋯
│    0.01% │    1.43 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.01% │    1.43 µs │     9 │ 158.95 ns ± 119.21 (   0.0 ‥ 238.42) │ [copy pageable to device memory]                                  ⋯
│    0.01% │    1.19 µs │     1 │                                      │ gpu_broadcast_kernel_cartesian(CompilerMetadata<DynamicSize, Dyna ⋯
│    0.00% │  715.26 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.00% │  715.26 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.00% │  715.26 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.00% │  715.26 ns │     1 │                                      │ gpu_broadcast_kernel_cartesian(CompilerMetadata<DynamicSize, Dyna ⋯
└──────────┴────────────┴───────┴──────────────────────────────────────┴────────────────────────────────────────────────────────────────────

codecov · 2026-05-28T19:14:32Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.23%. Comparing base (229d888) to head (4fec915).

Additional details and impacted files

@@           Coverage Diff           @@
##             main      #78   +/-   ##
=======================================
  Coverage   92.23%   92.23%           
=======================================
  Files          25       25           
  Lines        3219     3220    +1     
=======================================
+ Hits         2969     2970    +1     
  Misses        250      250

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

blegat added 2 commits May 28, 2026 20:59

preload non-block values as well

691f835

Fix format

4fec915

blegat merged commit 0c6cc31 into main May 28, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preload non-block values as well#78

Preload non-block values as well#78
blegat merged 2 commits into
mainfrom
bl/preload

blegat commented May 28, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

blegat commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

blegat commented May 28, 2026 •

edited

Loading

codecov Bot commented May 28, 2026 •

edited

Loading