Skip to content

Preload non-block values as well#78

Merged
blegat merged 2 commits into
mainfrom
bl/preload
May 28, 2026
Merged

Preload non-block values as well#78
blegat merged 2 commits into
mainfrom
bl/preload

Conversation

@blegat
Copy link
Copy Markdown
Owner

@blegat blegat commented May 28, 2026

Before

$ julia --project=perf perf/gpu_bench.jl
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  362.407 μs …  10.112 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     387.509 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   397.109 μs ± 101.367 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

         ▂▆█▇▄▁                                                  
  ▁▁▁▂▃▄▅██████▆▄▃▂▂▂▂▂▂▂▂▃▄▄▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  362 μs           Histogram: frequency by time          504 μs <

 Memory estimate: 32.41 KiB, allocs estimate: 1051.
Profiler ran for 31.22 ms, capturing 1406 events.

Host-side activity: calling CUDA APIs took 678.78 µs (2.17% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────────┬──────────────────────────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                      │ Name                                                 │
├──────────┼────────────┼───────┼────────────────────────────────────────┼──────────────────────────────────────────────────────┤
│    0.52% │  161.89 µs │    14 │  11.56 µs ± 22.1   (  4.29 ‥ 88.21)    │ cuLaunchKernelEx                                     │
│    0.27% │   84.16 µs │     5 │  16.83 µs ± 14.64  (  8.11 ‥ 42.68)    │ cuMemcpyDtoHAsync                                    │
│    0.16% │    51.5 µs │    11 │   4.68 µs ± 8.43   (  1.19 ‥ 30.04)    │ cuMemAllocFromPoolAsync                              │
│    0.15% │   45.54 µs │    10 │   4.55 µs ± 1.78   (   3.1 ‥ 7.87)     │ cuMemcpyHtoDAsync                                    │
│    0.13% │   40.77 µs │     4 │  10.19 µs ± 9.2    (  3.81 ‥ 23.84)    │ cuMemsetD32Async                                     │
│    0.09% │    26.7 µs │     3 │    8.9 µs ± 6.35   (  4.77 ‥ 16.21)    │ cuLaunchKernel                                       │
│    0.08% │   25.27 µs │     2 │  12.64 µs ± 4.38   (  9.54 ‥ 15.74)    │ cuMemcpyDtoDAsync                                    │
│    0.04% │   13.59 µs │    11 │   1.24 µs ± 1.43   (  0.48 ‥ 5.48)     │ cuStreamSynchronize                                  │
│    0.03% │    9.06 µs │     1 │                                        │ cudaLaunchKernelExC                                  │
│    0.03% │    7.87 µs │     1 │                                        │ cudaLaunchKernel                                     │
│    0.02% │    5.25 µs │     8 │ 655.65 ns ± 1086.98 (238.42 ‥ 3337.86) │ cudaGetLastError                                     │
│    0.02% │    4.77 µs │     1 │                                        │ cudaEventRecord                                      │
│    0.01% │    3.58 µs │     3 │   1.19 µs ± 1.04   (  0.48 ‥ 2.38)     │ cuKernelGetFunction                                  │
│    0.01% │    2.62 µs │     3 │  874.2 ns ± 688.26 (476.84 ‥ 1668.93)  │ cudaGetDevice                                        │
│    0.01% │    2.15 µs │     3 │ 715.26 ns ± 238.42 (476.84 ‥ 953.67)   │ cuOccupancyMaxActiveBlocksPerMultiprocessorWithFlags │
│    0.01% │    1.91 µs │     3 │ 635.78 ns ± 275.3  (476.84 ‥ 953.67)   │ cudaDeviceGetAttribute                               │
│    0.00% │  715.26 ns │     4 │ 178.81 ns ± 228.27 (   0.0 ‥ 476.84)   │ cuCtxPushCurrent                                     │
│    0.00% │  715.26 ns │     4 │ 178.81 ns ± 228.27 (   0.0 ‥ 476.84)   │ cuDeviceGet                                          │
│    0.00% │  476.84 ns │     1 │                                        │ cuKernelGetName                                      │
│    0.00% │  476.84 ns │     4 │ 119.21 ns ± 137.65 (   0.0 ‥ 238.42)   │ cuCtxPopCurrent                                      │
│    0.00% │     0.0 ns │     4 │    0.0 ns ± 0.0    (   0.0 ‥ 0.0)      │ cuCtxGetDevice                                       │
└──────────┴────────────┴───────┴────────────────────────────────────────┴──────────────────────────────────────────────────────┘

Device-side activity: GPU was busy for 148.06 µs (0.47% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                                                              ⋯
├──────────┼────────────┼───────┼──────────────────────────────────────┼────────────────────────────────────────────────────────────────────
│    0.08% │   25.51 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_32x128_8x5_nn_align1> ⋯
│    0.06% │   17.88 µs │     1 │                                      │ void magma_sgemmEx_kernel<float, float, float, true, false, 6, 4, ⋯
│    0.06% │    17.4 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.04% │   13.11 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.04% │   11.68 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_128x32_8x5_nt_align1> ⋯
│    0.04% │   11.21 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.04% │   10.97 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_128x32_8x5_nn_align1> ⋯
│    0.03% │    9.06 µs │     4 │   2.26 µs ± 3.43   (  0.24 ‥ 7.39)   │ [set device memory]                                               ⋯
│    0.02% │    6.68 µs │     5 │   1.34 µs ± 0.21   (  1.19 ‥ 1.67)   │ [copy device to pageable memory]                                  ⋯
│    0.02% │    5.01 µs │     1 │                                      │ partial_mapreduce_grid(identity, reducer, NamedTuple<__is_missing ⋯
│    0.01% │    2.62 µs │     1 │                                      │ partial_mapreduce_grid(identity, reducer, NamedTuple<__is_missing ⋯
│    0.01% │    2.15 µs │     2 │   1.07 µs ± 0.17   (  0.95 ‥ 1.19)   │ [copy device to device memory]                                    ⋯
│    0.01% │    1.91 µs │     1 │                                      │ partial_mapreduce_grid(identity, add_sum, Float32, CartesianIndic ⋯
│    0.01% │    1.91 µs │     1 │                                      │ partial_mapreduce_grid(identity, add_sum, void, CartesianIndices< ⋯
│    0.01% │    1.91 µs │    10 │ 190.73 ns ± 100.53 (   0.0 ‥ 238.42) │ [copy pageable to device memory]                                  ⋯
│    0.01% │    1.91 µs │     1 │                                      │ partial_mapreduce_grid(_, add_sum, Float32, CartesianIndices<2, T ⋯
│    0.01% │    1.67 µs │     1 │                                      │ void cublasLt::splitKreduce_kernel<32, 16, int, float, float, flo ⋯
│    0.00% │    1.43 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.00% │  953.67 ns │     1 │                                      │ gpu_broadcast_kernel_cartesian(CompilerMetadata<DynamicSize, Dyna ⋯
│    0.00% │  953.67 ns │     1 │                                      │ gpu_broadcast_kernel_cartesian(CompilerMetadata<DynamicSize, Dyna ⋯
│    0.00% │  715.26 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.00% │  715.26 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.00% │  715.26 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
└──────────┴────────────┴───────┴──────────────────────────────────────┴────────────────────────────────────────────────────────────────────

After

$ julia --project=perf perf/gpu_bench.jl
Precompiling ArrayDiff finished.
  1 dependency successfully precompiled in 4 seconds. 53 already precompiled.
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  261.185 μs …   9.858 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     390.277 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   377.380 μs ± 112.290 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▄                             ▅█▃                            
  ▂▄██▄▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▂▂▂▂▂▂▂▃▄█████▅▄▃▃▄▄▆▅▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  261 μs           Histogram: frequency by time          496 μs <

 Memory estimate: 32.20 KiB, allocs estimate: 1040.
Profiler ran for 15.65 ms, capturing 1762 events.

Host-side activity: calling CUDA APIs took 370.03 µs (2.36% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬──────────────────────────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                     │ Name                                                 │
├──────────┼────────────┼───────┼───────────────────────────────────────┼──────────────────────────────────────────────────────┤
│    0.54% │   83.92 µs │    14 │   5.99 µs ± 9.27   (  2.86 ‥ 38.15)   │ cuLaunchKernelEx                                     │
│    0.33% │   51.02 µs │     5 │   10.2 µs ± 6.71   (  6.68 ‥ 22.17)   │ cuMemcpyDtoHAsync                                    │
│    0.17% │   27.18 µs │     4 │   6.79 µs ± 3.72   (   3.1 ‥ 10.97)   │ cuMemsetD32Async                                     │
│    0.16% │   25.03 µs │    11 │   2.28 µs ± 4.0    (  0.72 ‥ 14.31)   │ cuMemAllocFromPoolAsync                              │
│    0.14% │   22.17 µs │     9 │   2.46 µs ± 0.83   (  1.67 ‥ 4.29)    │ cuMemcpyHtoDAsync                                    │
│    0.08% │   11.92 µs │     2 │   5.96 µs ± 2.36   (  4.29 ‥ 7.63)    │ cuMemcpyDtoDAsync                                    │
│    0.07% │   10.97 µs │     3 │   3.66 µs ± 0.77   (   3.1 ‥ 4.53)    │ cuLaunchKernel                                       │
│    0.04% │     6.2 µs │     1 │                                       │ cudaLaunchKernelExC                                  │
│    0.04% │    5.48 µs │    11 │ 498.51 ns ± 634.88 (238.42 ‥ 2384.19) │ cuStreamSynchronize                                  │
│    0.03% │    5.25 µs │     1 │                                       │ cudaLaunchKernel                                     │
│    0.02% │    3.34 µs │     1 │                                       │ cudaEventRecord                                      │
│    0.02% │    2.86 µs │     3 │ 953.67 ns ± 630.8  (476.84 ‥ 1668.93) │ cuKernelGetFunction                                  │
│    0.02% │    2.38 µs │     8 │ 298.02 ns ± 305.59 (   0.0 ‥ 953.67)  │ cudaGetLastError                                     │
│    0.01% │    1.91 µs │     3 │ 635.78 ns ± 688.26 (238.42 ‥ 1430.51) │ cudaGetDevice                                        │
│    0.01% │    1.67 µs │     3 │ 556.31 ns ± 137.65 (476.84 ‥ 715.26)  │ cuOccupancyMaxActiveBlocksPerMultiprocessorWithFlags │
│    0.00% │  476.84 ns │     4 │ 119.21 ns ± 137.65 (   0.0 ‥ 238.42)  │ cuCtxPushCurrent                                     │
│    0.00% │  476.84 ns │     1 │                                       │ cuKernelGetName                                      │
│    0.00% │  476.84 ns │     4 │ 119.21 ns ± 137.65 (   0.0 ‥ 238.42)  │ cuCtxGetDevice                                       │
│    0.00% │  476.84 ns │     3 │ 158.95 ns ± 137.65 (   0.0 ‥ 238.42)  │ cudaDeviceGetAttribute                               │
│    0.00% │  238.42 ns │     4 │   59.6 ns ± 119.21 (   0.0 ‥ 238.42)  │ cuDeviceGet                                          │
│    0.00% │     0.0 ns │     4 │    0.0 ns ± 0.0    (   0.0 ‥ 0.0)     │ cuCtxPopCurrent                                      │
└──────────┴────────────┴───────┴───────────────────────────────────────┴──────────────────────────────────────────────────────┘

Device-side activity: GPU was busy for 147.58 µs (0.94% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                                                              ⋯
├──────────┼────────────┼───────┼──────────────────────────────────────┼────────────────────────────────────────────────────────────────────
│    0.16% │   25.51 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_32x128_8x5_nn_align1> ⋯
│    0.12% │   18.12 µs │     1 │                                      │ void magma_sgemmEx_kernel<float, float, float, true, false, 6, 4, ⋯
│    0.11% │   17.88 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.08% │   13.11 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.08% │    12.4 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_128x32_8x5_nt_align1> ⋯
│    0.07% │   11.44 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.07% │   10.97 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_128x32_8x5_nn_align1> ⋯
│    0.06% │    8.82 µs │     4 │   2.21 µs ± 3.47   (  0.24 ‥ 7.39)   │ [set device memory]                                               ⋯
│    0.04% │    6.68 µs │     5 │   1.34 µs ± 0.13   (  1.19 ‥ 1.43)   │ [copy device to pageable memory]                                  ⋯
│    0.03% │    4.05 µs │     1 │                                      │ partial_mapreduce_grid(identity, reducer, NamedTuple<__is_missing ⋯
│    0.02% │    2.38 µs │     1 │                                      │ partial_mapreduce_grid(identity, reducer, NamedTuple<__is_missing ⋯
│    0.02% │    2.38 µs │     2 │   1.19 µs ± 0.0    (  1.19 ‥ 1.19)   │ [copy device to device memory]                                    ⋯
│    0.01% │    1.91 µs │     1 │                                      │ partial_mapreduce_grid(_, add_sum, Float32, CartesianIndices<2, T ⋯
│    0.01% │    1.67 µs │     1 │                                      │ partial_mapreduce_grid(identity, add_sum, Float32, CartesianIndic ⋯
│    0.01% │    1.67 µs │     1 │                                      │ partial_mapreduce_grid(identity, add_sum, void, CartesianIndices< ⋯
│    0.01% │    1.67 µs │     1 │                                      │ void cublasLt::splitKreduce_kernel<32, 16, int, float, float, flo ⋯
│    0.01% │    1.43 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.01% │    1.43 µs │     9 │ 158.95 ns ± 119.21 (   0.0 ‥ 238.42) │ [copy pageable to device memory]                                  ⋯
│    0.01% │    1.19 µs │     1 │                                      │ gpu_broadcast_kernel_cartesian(CompilerMetadata<DynamicSize, Dyna ⋯
│    0.00% │  715.26 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.00% │  715.26 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.00% │  715.26 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.00% │  715.26 ns │     1 │                                      │ gpu_broadcast_kernel_cartesian(CompilerMetadata<DynamicSize, Dyna ⋯
└──────────┴────────────┴───────┴──────────────────────────────────────┴────────────────────────────────────────────────────────────────────

@codecov
Copy link
Copy Markdown

codecov Bot commented May 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.23%. Comparing base (229d888) to head (4fec915).

Additional details and impacted files
@@           Coverage Diff           @@
##             main      #78   +/-   ##
=======================================
  Coverage   92.23%   92.23%           
=======================================
  Files          25       25           
  Lines        3219     3220    +1     
=======================================
+ Hits         2969     2970    +1     
  Misses        250      250           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@blegat blegat merged commit 0c6cc31 into main May 28, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant