Read exponent from CPU RAM by blegat · Pull Request #79 · blegat/ArrayDiff.jl

blegat · 2026-05-28T19:18:51Z

Before

$ julia --project=perf perf/gpu_bench.jl
Precompiling ArrayDiff finished.
  1 dependency successfully precompiled in 4 seconds. 53 already precompiled.
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  261.185 μs …   9.858 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     390.277 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   377.380 μs ± 112.290 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▄                             ▅█▃                            
  ▂▄██▄▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▂▂▂▂▂▂▂▃▄█████▅▄▃▃▄▄▆▅▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  261 μs           Histogram: frequency by time          496 μs <

 Memory estimate: 32.20 KiB, allocs estimate: 1040.
Profiler ran for 15.65 ms, capturing 1762 events.

Host-side activity: calling CUDA APIs took 370.03 µs (2.36% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬──────────────────────────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                     │ Name                                                 │
├──────────┼────────────┼───────┼───────────────────────────────────────┼──────────────────────────────────────────────────────┤
│    0.54% │   83.92 µs │    14 │   5.99 µs ± 9.27   (  2.86 ‥ 38.15)   │ cuLaunchKernelEx                                     │
│    0.33% │   51.02 µs │     5 │   10.2 µs ± 6.71   (  6.68 ‥ 22.17)   │ cuMemcpyDtoHAsync                                    │
│    0.17% │   27.18 µs │     4 │   6.79 µs ± 3.72   (   3.1 ‥ 10.97)   │ cuMemsetD32Async                                     │
│    0.16% │   25.03 µs │    11 │   2.28 µs ± 4.0    (  0.72 ‥ 14.31)   │ cuMemAllocFromPoolAsync                              │
│    0.14% │   22.17 µs │     9 │   2.46 µs ± 0.83   (  1.67 ‥ 4.29)    │ cuMemcpyHtoDAsync                                    │
│    0.08% │   11.92 µs │     2 │   5.96 µs ± 2.36   (  4.29 ‥ 7.63)    │ cuMemcpyDtoDAsync                                    │
│    0.07% │   10.97 µs │     3 │   3.66 µs ± 0.77   (   3.1 ‥ 4.53)    │ cuLaunchKernel                                       │
│    0.04% │     6.2 µs │     1 │                                       │ cudaLaunchKernelExC                                  │
│    0.04% │    5.48 µs │    11 │ 498.51 ns ± 634.88 (238.42 ‥ 2384.19) │ cuStreamSynchronize                                  │
│    0.03% │    5.25 µs │     1 │                                       │ cudaLaunchKernel                                     │
│    0.02% │    3.34 µs │     1 │                                       │ cudaEventRecord                                      │
│    0.02% │    2.86 µs │     3 │ 953.67 ns ± 630.8  (476.84 ‥ 1668.93) │ cuKernelGetFunction                                  │
│    0.02% │    2.38 µs │     8 │ 298.02 ns ± 305.59 (   0.0 ‥ 953.67)  │ cudaGetLastError                                     │
│    0.01% │    1.91 µs │     3 │ 635.78 ns ± 688.26 (238.42 ‥ 1430.51) │ cudaGetDevice                                        │
│    0.01% │    1.67 µs │     3 │ 556.31 ns ± 137.65 (476.84 ‥ 715.26)  │ cuOccupancyMaxActiveBlocksPerMultiprocessorWithFlags │
│    0.00% │  476.84 ns │     4 │ 119.21 ns ± 137.65 (   0.0 ‥ 238.42)  │ cuCtxPushCurrent                                     │
│    0.00% │  476.84 ns │     1 │                                       │ cuKernelGetName                                      │
│    0.00% │  476.84 ns │     4 │ 119.21 ns ± 137.65 (   0.0 ‥ 238.42)  │ cuCtxGetDevice                                       │
│    0.00% │  476.84 ns │     3 │ 158.95 ns ± 137.65 (   0.0 ‥ 238.42)  │ cudaDeviceGetAttribute                               │
│    0.00% │  238.42 ns │     4 │   59.6 ns ± 119.21 (   0.0 ‥ 238.42)  │ cuDeviceGet                                          │
│    0.00% │     0.0 ns │     4 │    0.0 ns ± 0.0    (   0.0 ‥ 0.0)     │ cuCtxPopCurrent                                      │
└──────────┴────────────┴───────┴───────────────────────────────────────┴──────────────────────────────────────────────────────┘

Device-side activity: GPU was busy for 147.58 µs (0.94% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                                                              ⋯
├──────────┼────────────┼───────┼──────────────────────────────────────┼────────────────────────────────────────────────────────────────────
│    0.16% │   25.51 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_32x128_8x5_nn_align1> ⋯
│    0.12% │   18.12 µs │     1 │                                      │ void magma_sgemmEx_kernel<float, float, float, true, false, 6, 4, ⋯
│    0.11% │   17.88 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.08% │   13.11 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.08% │    12.4 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_128x32_8x5_nt_align1> ⋯
│    0.07% │   11.44 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.07% │   10.97 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_128x32_8x5_nn_align1> ⋯
│    0.06% │    8.82 µs │     4 │   2.21 µs ± 3.47   (  0.24 ‥ 7.39)   │ [set device memory]                                               ⋯
│    0.04% │    6.68 µs │     5 │   1.34 µs ± 0.13   (  1.19 ‥ 1.43)   │ [copy device to pageable memory]                                  ⋯
│    0.03% │    4.05 µs │     1 │                                      │ partial_mapreduce_grid(identity, reducer, NamedTuple<__is_missing ⋯
│    0.02% │    2.38 µs │     1 │                                      │ partial_mapreduce_grid(identity, reducer, NamedTuple<__is_missing ⋯
│    0.02% │    2.38 µs │     2 │   1.19 µs ± 0.0    (  1.19 ‥ 1.19)   │ [copy device to device memory]                                    ⋯
│    0.01% │    1.91 µs │     1 │                                      │ partial_mapreduce_grid(_, add_sum, Float32, CartesianIndices<2, T ⋯
│    0.01% │    1.67 µs │     1 │                                      │ partial_mapreduce_grid(identity, add_sum, Float32, CartesianIndic ⋯
│    0.01% │    1.67 µs │     1 │                                      │ partial_mapreduce_grid(identity, add_sum, void, CartesianIndices< ⋯
│    0.01% │    1.67 µs │     1 │                                      │ void cublasLt::splitKreduce_kernel<32, 16, int, float, float, flo ⋯
│    0.01% │    1.43 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.01% │    1.43 µs │     9 │ 158.95 ns ± 119.21 (   0.0 ‥ 238.42) │ [copy pageable to device memory]                                  ⋯
│    0.01% │    1.19 µs │     1 │                                      │ gpu_broadcast_kernel_cartesian(CompilerMetadata<DynamicSize, Dyna ⋯
│    0.00% │  715.26 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.00% │  715.26 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.00% │  715.26 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.00% │  715.26 ns │     1 │                                      │ gpu_broadcast_kernel_cartesian(CompilerMetadata<DynamicSize, Dyna ⋯
└──────────┴────────────┴───────┴──────────────────────────────────────┴────────────────────────────────────────────────────────────────────

After

julia --project=perf perf/gpu_bench.jl
Precompiling ArrayDiff finished.
  1 dependency successfully precompiled in 5 seconds. 53 already precompiled.
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  242.748 μs …  6.517 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     355.868 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   341.301 μs ± 81.345 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▂▅                                ▃██▅▁                     
  ▂▃██▆▃▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▂▁▁▁▁▁▁▂▁▂▁▂▃▄▆█████▆▅▄▃▃▃▃▃▄▄▄▄▃▃▃▂▂▂▂ ▃
  243 μs          Histogram: frequency by time          421 μs <

 Memory estimate: 31.61 KiB, allocs estimate: 1004.
Profiler ran for 31.12 ms, capturing 1264 events.

Host-side activity: calling CUDA APIs took 555.99 µs (1.79% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬──────────────────────────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                     │ Name                                                 │
├──────────┼────────────┼───────┼───────────────────────────────────────┼──────────────────────────────────────────────────────┤
│    0.43% │   132.8 µs │    14 │   9.49 µs ± 13.8   (  4.77 ‥ 57.22)   │ cuLaunchKernelEx                                     │
│    0.21% │    63.9 µs │     4 │  15.97 µs ± 10.82  (  9.06 ‥ 31.95)   │ cuMemcpyDtoHAsync                                    │
│    0.15% │    45.3 µs │    11 │   4.12 µs ± 6.19   (  1.43 ‥ 22.65)   │ cuMemAllocFromPoolAsync                              │
│    0.13% │   41.48 µs │     9 │   4.61 µs ± 1.52   (   3.1 ‥ 8.11)    │ cuMemcpyHtoDAsync                                    │
│    0.10% │   30.76 µs │     4 │   7.69 µs ± 5.14   (  3.81 ‥ 15.26)   │ cuMemsetD32Async                                     │
│    0.07% │   21.46 µs │     2 │  10.73 µs ± 3.03   (  8.58 ‥ 12.87)   │ cuMemcpyDtoDAsync                                    │
│    0.06% │   18.84 µs │     3 │   6.28 µs ± 1.85   (  4.77 ‥ 8.34)    │ cuLaunchKernel                                       │
│    0.04% │   11.21 µs │     9 │   1.25 µs ± 1.18   (  0.48 ‥ 4.29)    │ cuStreamSynchronize                                  │
│    0.03% │    8.82 µs │     1 │                                       │ cudaLaunchKernelExC                                  │
│    0.02% │    7.63 µs │     1 │                                       │ cudaLaunchKernel                                     │
│    0.02% │    4.77 µs │     8 │ 596.05 ns ± 845.34 (   0.0 ‥ 2622.6)  │ cudaGetLastError                                     │
│    0.01% │    4.29 µs │     1 │                                       │ cudaEventRecord                                      │
│    0.01% │     3.1 µs │     3 │   1.03 µs ± 0.55   (  0.72 ‥ 1.67)    │ cuKernelGetFunction                                  │
│    0.01% │    2.62 µs │     3 │  874.2 ns ± 902.64 (238.42 ‥ 1907.35) │ cudaGetDevice                                        │
│    0.01% │    1.91 µs │     3 │ 635.78 ns ± 364.19 (238.42 ‥ 953.67)  │ cuOccupancyMaxActiveBlocksPerMultiprocessorWithFlags │
│    0.01% │    1.91 µs │     3 │ 635.78 ns ± 137.65 (476.84 ‥ 715.26)  │ cudaDeviceGetAttribute                               │
│    0.00% │    1.43 µs │     4 │ 357.63 ns ± 238.42 (238.42 ‥ 715.26)  │ cuCtxPushCurrent                                     │
│    0.00% │  953.67 ns │     4 │ 238.42 ns ± 0.0    (238.42 ‥ 238.42)  │ cuCtxPopCurrent                                      │
│    0.00% │  715.26 ns │     1 │                                       │ cuKernelGetName                                      │
│    0.00% │  715.26 ns │     4 │ 178.81 ns ± 119.21 (   0.0 ‥ 238.42)  │ cuCtxGetDevice                                       │
│    0.00% │  476.84 ns │     4 │ 119.21 ns ± 238.42 (   0.0 ‥ 476.84)  │ cuDeviceGet                                          │
└──────────┴────────────┴───────┴───────────────────────────────────────┴──────────────────────────────────────────────────────┘

Device-side activity: GPU was busy for 146.15 µs (0.47% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                                                              ⋯
├──────────┼────────────┼───────┼──────────────────────────────────────┼────────────────────────────────────────────────────────────────────
│    0.08% │   25.75 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_32x128_8x5_nn_align1> ⋯
│    0.06% │   17.64 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.06% │   17.64 µs │     1 │                                      │ void magma_sgemmEx_kernel<float, float, float, true, false, 6, 4, ⋯
│    0.04% │   13.11 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.04% │   11.92 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_128x32_8x5_nt_align1> ⋯
│    0.04% │   11.21 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_128x32_8x5_nn_align1> ⋯
│    0.04% │   10.97 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.03% │     9.3 µs │     4 │   2.32 µs ± 3.39   (  0.48 ‥ 7.39)   │ [set device memory]                                               ⋯
│    0.02% │    4.77 µs │     1 │                                      │ partial_mapreduce_grid(identity, reducer, NamedTuple<__is_missing ⋯
│    0.01% │    4.53 µs │     4 │   1.13 µs ± 0.12   (  0.95 ‥ 1.19)   │ [copy device to pageable memory]                                  ⋯
│    0.01% │    2.62 µs │     2 │   1.31 µs ± 0.17   (  1.19 ‥ 1.43)   │ [copy device to device memory]                                    ⋯
│    0.01% │    2.38 µs │     1 │                                      │ partial_mapreduce_grid(identity, reducer, NamedTuple<__is_missing ⋯
│    0.01% │    1.91 µs │     1 │                                      │ partial_mapreduce_grid(identity, add_sum, void, CartesianIndices< ⋯
│    0.01% │    1.91 µs │     9 │ 211.93 ns ± 143.27 (   0.0 ‥ 476.84) │ [copy pageable to device memory]                                  ⋯
│    0.01% │    1.91 µs │     1 │                                      │ partial_mapreduce_grid(_, add_sum, Float32, CartesianIndices<2, T ⋯
│    0.01% │    1.67 µs │     1 │                                      │ void cublasLt::splitKreduce_kernel<32, 16, int, float, float, flo ⋯
│    0.00% │    1.43 µs │     1 │                                      │ partial_mapreduce_grid(identity, add_sum, Float32, CartesianIndic ⋯
│    0.00% │    1.43 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.00% │  953.67 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.00% │  953.67 ns │     1 │                                      │ gpu_broadcast_kernel_cartesian(CompilerMetadata<DynamicSize, Dyna ⋯
│    0.00% │  953.67 ns │     1 │                                      │ gpu_broadcast_kernel_cartesian(CompilerMetadata<DynamicSize, Dyna ⋯
│    0.00% │  715.26 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.00% │  476.84 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
└──────────┴────────────┴───────┴──────────────────────────────────────┴────────────────────────────────────────────────────────────────────

codecov · 2026-05-28T19:35:55Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.23%. Comparing base (0c6cc31) to head (bbebd29).

Additional details and impacted files

@@           Coverage Diff           @@
##             main      #79   +/-   ##
=======================================
  Coverage   92.23%   92.23%           
=======================================
  Files          25       25           
  Lines        3220     3220           
=======================================
  Hits         2970     2970           
  Misses        250      250

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Read exponent from CPU RAM

bbebd29

blegat merged commit e349782 into main May 28, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read exponent from CPU RAM#79

Read exponent from CPU RAM#79
blegat merged 1 commit into
mainfrom
bl/exp_cst

blegat commented May 28, 2026

Uh oh!

codecov Bot commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

blegat commented May 28, 2026

Uh oh!

codecov Bot commented May 28, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant