Skip to content

Read exponent from CPU RAM#79

Merged
blegat merged 1 commit into
mainfrom
bl/exp_cst
May 28, 2026
Merged

Read exponent from CPU RAM#79
blegat merged 1 commit into
mainfrom
bl/exp_cst

Conversation

@blegat
Copy link
Copy Markdown
Owner

@blegat blegat commented May 28, 2026

Before

$ julia --project=perf perf/gpu_bench.jl
Precompiling ArrayDiff finished.
  1 dependency successfully precompiled in 4 seconds. 53 already precompiled.
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  261.185 μs …   9.858 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     390.277 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   377.380 μs ± 112.290 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▄                             ▅█▃                            
  ▂▄██▄▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▂▂▂▂▂▂▂▃▄█████▅▄▃▃▄▄▆▅▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  261 μs           Histogram: frequency by time          496 μs <

 Memory estimate: 32.20 KiB, allocs estimate: 1040.
Profiler ran for 15.65 ms, capturing 1762 events.

Host-side activity: calling CUDA APIs took 370.03 µs (2.36% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬──────────────────────────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                     │ Name                                                 │
├──────────┼────────────┼───────┼───────────────────────────────────────┼──────────────────────────────────────────────────────┤
│    0.54% │   83.92 µs │    14 │   5.99 µs ± 9.27   (  2.86 ‥ 38.15)   │ cuLaunchKernelEx                                     │
│    0.33% │   51.02 µs │     5 │   10.2 µs ± 6.71   (  6.68 ‥ 22.17)   │ cuMemcpyDtoHAsync                                    │
│    0.17% │   27.18 µs │     4 │   6.79 µs ± 3.72   (   3.1 ‥ 10.97)   │ cuMemsetD32Async                                     │
│    0.16% │   25.03 µs │    11 │   2.28 µs ± 4.0    (  0.72 ‥ 14.31)   │ cuMemAllocFromPoolAsync                              │
│    0.14% │   22.17 µs │     9 │   2.46 µs ± 0.83   (  1.67 ‥ 4.29)    │ cuMemcpyHtoDAsync                                    │
│    0.08% │   11.92 µs │     2 │   5.96 µs ± 2.36   (  4.29 ‥ 7.63)    │ cuMemcpyDtoDAsync                                    │
│    0.07% │   10.97 µs │     3 │   3.66 µs ± 0.77   (   3.1 ‥ 4.53)    │ cuLaunchKernel                                       │
│    0.04% │     6.2 µs │     1 │                                       │ cudaLaunchKernelExC                                  │
│    0.04% │    5.48 µs │    11 │ 498.51 ns ± 634.88 (238.42 ‥ 2384.19) │ cuStreamSynchronize                                  │
│    0.03% │    5.25 µs │     1 │                                       │ cudaLaunchKernel                                     │
│    0.02% │    3.34 µs │     1 │                                       │ cudaEventRecord                                      │
│    0.02% │    2.86 µs │     3 │ 953.67 ns ± 630.8  (476.84 ‥ 1668.93) │ cuKernelGetFunction                                  │
│    0.02% │    2.38 µs │     8 │ 298.02 ns ± 305.59 (   0.0 ‥ 953.67)  │ cudaGetLastError                                     │
│    0.01% │    1.91 µs │     3 │ 635.78 ns ± 688.26 (238.42 ‥ 1430.51) │ cudaGetDevice                                        │
│    0.01% │    1.67 µs │     3 │ 556.31 ns ± 137.65 (476.84 ‥ 715.26)  │ cuOccupancyMaxActiveBlocksPerMultiprocessorWithFlags │
│    0.00% │  476.84 ns │     4 │ 119.21 ns ± 137.65 (   0.0 ‥ 238.42)  │ cuCtxPushCurrent                                     │
│    0.00% │  476.84 ns │     1 │                                       │ cuKernelGetName                                      │
│    0.00% │  476.84 ns │     4 │ 119.21 ns ± 137.65 (   0.0 ‥ 238.42)  │ cuCtxGetDevice                                       │
│    0.00% │  476.84 ns │     3 │ 158.95 ns ± 137.65 (   0.0 ‥ 238.42)  │ cudaDeviceGetAttribute                               │
│    0.00% │  238.42 ns │     4 │   59.6 ns ± 119.21 (   0.0 ‥ 238.42)  │ cuDeviceGet                                          │
│    0.00% │     0.0 ns │     4 │    0.0 ns ± 0.0    (   0.0 ‥ 0.0)     │ cuCtxPopCurrent                                      │
└──────────┴────────────┴───────┴───────────────────────────────────────┴──────────────────────────────────────────────────────┘

Device-side activity: GPU was busy for 147.58 µs (0.94% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                                                              ⋯
├──────────┼────────────┼───────┼──────────────────────────────────────┼────────────────────────────────────────────────────────────────────
│    0.16% │   25.51 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_32x128_8x5_nn_align1> ⋯
│    0.12% │   18.12 µs │     1 │                                      │ void magma_sgemmEx_kernel<float, float, float, true, false, 6, 4, ⋯
│    0.11% │   17.88 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.08% │   13.11 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.08% │    12.4 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_128x32_8x5_nt_align1> ⋯
│    0.07% │   11.44 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.07% │   10.97 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_128x32_8x5_nn_align1> ⋯
│    0.06% │    8.82 µs │     4 │   2.21 µs ± 3.47   (  0.24 ‥ 7.39)   │ [set device memory]                                               ⋯
│    0.04% │    6.68 µs │     5 │   1.34 µs ± 0.13   (  1.19 ‥ 1.43)   │ [copy device to pageable memory]                                  ⋯
│    0.03% │    4.05 µs │     1 │                                      │ partial_mapreduce_grid(identity, reducer, NamedTuple<__is_missing ⋯
│    0.02% │    2.38 µs │     1 │                                      │ partial_mapreduce_grid(identity, reducer, NamedTuple<__is_missing ⋯
│    0.02% │    2.38 µs │     2 │   1.19 µs ± 0.0    (  1.19 ‥ 1.19)   │ [copy device to device memory]                                    ⋯
│    0.01% │    1.91 µs │     1 │                                      │ partial_mapreduce_grid(_, add_sum, Float32, CartesianIndices<2, T ⋯
│    0.01% │    1.67 µs │     1 │                                      │ partial_mapreduce_grid(identity, add_sum, Float32, CartesianIndic ⋯
│    0.01% │    1.67 µs │     1 │                                      │ partial_mapreduce_grid(identity, add_sum, void, CartesianIndices< ⋯
│    0.01% │    1.67 µs │     1 │                                      │ void cublasLt::splitKreduce_kernel<32, 16, int, float, float, flo ⋯
│    0.01% │    1.43 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.01% │    1.43 µs │     9 │ 158.95 ns ± 119.21 (   0.0 ‥ 238.42) │ [copy pageable to device memory]                                  ⋯
│    0.01% │    1.19 µs │     1 │                                      │ gpu_broadcast_kernel_cartesian(CompilerMetadata<DynamicSize, Dyna ⋯
│    0.00% │  715.26 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.00% │  715.26 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.00% │  715.26 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.00% │  715.26 ns │     1 │                                      │ gpu_broadcast_kernel_cartesian(CompilerMetadata<DynamicSize, Dyna ⋯
└──────────┴────────────┴───────┴──────────────────────────────────────┴────────────────────────────────────────────────────────────────────

After

julia --project=perf perf/gpu_bench.jl
Precompiling ArrayDiff finished.
  1 dependency successfully precompiled in 5 seconds. 53 already precompiled.
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  242.748 μs …  6.517 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     355.868 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   341.301 μs ± 81.345 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▂▅                                ▃██▅▁                     
  ▂▃██▆▃▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▂▁▁▁▁▁▁▂▁▂▁▂▃▄▆█████▆▅▄▃▃▃▃▃▄▄▄▄▃▃▃▂▂▂▂ ▃
  243 μs          Histogram: frequency by time          421 μs <

 Memory estimate: 31.61 KiB, allocs estimate: 1004.
Profiler ran for 31.12 ms, capturing 1264 events.

Host-side activity: calling CUDA APIs took 555.99 µs (1.79% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬──────────────────────────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                     │ Name                                                 │
├──────────┼────────────┼───────┼───────────────────────────────────────┼──────────────────────────────────────────────────────┤
│    0.43% │   132.8 µs │    14 │   9.49 µs ± 13.8   (  4.77 ‥ 57.22)   │ cuLaunchKernelEx                                     │
│    0.21% │    63.9 µs │     4 │  15.97 µs ± 10.82  (  9.06 ‥ 31.95)   │ cuMemcpyDtoHAsync                                    │
│    0.15% │    45.3 µs │    11 │   4.12 µs ± 6.19   (  1.43 ‥ 22.65)   │ cuMemAllocFromPoolAsync                              │
│    0.13% │   41.48 µs │     9 │   4.61 µs ± 1.52   (   3.1 ‥ 8.11)    │ cuMemcpyHtoDAsync                                    │
│    0.10% │   30.76 µs │     4 │   7.69 µs ± 5.14   (  3.81 ‥ 15.26)   │ cuMemsetD32Async                                     │
│    0.07% │   21.46 µs │     2 │  10.73 µs ± 3.03   (  8.58 ‥ 12.87)   │ cuMemcpyDtoDAsync                                    │
│    0.06% │   18.84 µs │     3 │   6.28 µs ± 1.85   (  4.77 ‥ 8.34)    │ cuLaunchKernel                                       │
│    0.04% │   11.21 µs │     9 │   1.25 µs ± 1.18   (  0.48 ‥ 4.29)    │ cuStreamSynchronize                                  │
│    0.03% │    8.82 µs │     1 │                                       │ cudaLaunchKernelExC                                  │
│    0.02% │    7.63 µs │     1 │                                       │ cudaLaunchKernel                                     │
│    0.02% │    4.77 µs │     8 │ 596.05 ns ± 845.34 (   0.0 ‥ 2622.6)  │ cudaGetLastError                                     │
│    0.01% │    4.29 µs │     1 │                                       │ cudaEventRecord                                      │
│    0.01% │     3.1 µs │     3 │   1.03 µs ± 0.55   (  0.72 ‥ 1.67)    │ cuKernelGetFunction                                  │
│    0.01% │    2.62 µs │     3 │  874.2 ns ± 902.64 (238.42 ‥ 1907.35) │ cudaGetDevice                                        │
│    0.01% │    1.91 µs │     3 │ 635.78 ns ± 364.19 (238.42 ‥ 953.67)  │ cuOccupancyMaxActiveBlocksPerMultiprocessorWithFlags │
│    0.01% │    1.91 µs │     3 │ 635.78 ns ± 137.65 (476.84 ‥ 715.26)  │ cudaDeviceGetAttribute                               │
│    0.00% │    1.43 µs │     4 │ 357.63 ns ± 238.42 (238.42 ‥ 715.26)  │ cuCtxPushCurrent                                     │
│    0.00% │  953.67 ns │     4 │ 238.42 ns ± 0.0    (238.42 ‥ 238.42)  │ cuCtxPopCurrent                                      │
│    0.00% │  715.26 ns │     1 │                                       │ cuKernelGetName                                      │
│    0.00% │  715.26 ns │     4 │ 178.81 ns ± 119.21 (   0.0 ‥ 238.42)  │ cuCtxGetDevice                                       │
│    0.00% │  476.84 ns │     4 │ 119.21 ns ± 238.42 (   0.0 ‥ 476.84)  │ cuDeviceGet                                          │
└──────────┴────────────┴───────┴───────────────────────────────────────┴──────────────────────────────────────────────────────┘

Device-side activity: GPU was busy for 146.15 µs (0.47% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                                                              ⋯
├──────────┼────────────┼───────┼──────────────────────────────────────┼────────────────────────────────────────────────────────────────────
│    0.08% │   25.75 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_32x128_8x5_nn_align1> ⋯
│    0.06% │   17.64 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.06% │   17.64 µs │     1 │                                      │ void magma_sgemmEx_kernel<float, float, float, true, false, 6, 4, ⋯
│    0.04% │   13.11 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.04% │   11.92 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_128x32_8x5_nt_align1> ⋯
│    0.04% │   11.21 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_128x32_8x5_nn_align1> ⋯
│    0.04% │   10.97 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.03% │     9.3 µs │     4 │   2.32 µs ± 3.39   (  0.48 ‥ 7.39)   │ [set device memory]                                               ⋯
│    0.02% │    4.77 µs │     1 │                                      │ partial_mapreduce_grid(identity, reducer, NamedTuple<__is_missing ⋯
│    0.01% │    4.53 µs │     4 │   1.13 µs ± 0.12   (  0.95 ‥ 1.19)   │ [copy device to pageable memory]                                  ⋯
│    0.01% │    2.62 µs │     2 │   1.31 µs ± 0.17   (  1.19 ‥ 1.43)   │ [copy device to device memory]                                    ⋯
│    0.01% │    2.38 µs │     1 │                                      │ partial_mapreduce_grid(identity, reducer, NamedTuple<__is_missing ⋯
│    0.01% │    1.91 µs │     1 │                                      │ partial_mapreduce_grid(identity, add_sum, void, CartesianIndices< ⋯
│    0.01% │    1.91 µs │     9 │ 211.93 ns ± 143.27 (   0.0 ‥ 476.84) │ [copy pageable to device memory]                                  ⋯
│    0.01% │    1.91 µs │     1 │                                      │ partial_mapreduce_grid(_, add_sum, Float32, CartesianIndices<2, T ⋯
│    0.01% │    1.67 µs │     1 │                                      │ void cublasLt::splitKreduce_kernel<32, 16, int, float, float, flo ⋯
│    0.00% │    1.43 µs │     1 │                                      │ partial_mapreduce_grid(identity, add_sum, Float32, CartesianIndic ⋯
│    0.00% │    1.43 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.00% │  953.67 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.00% │  953.67 ns │     1 │                                      │ gpu_broadcast_kernel_cartesian(CompilerMetadata<DynamicSize, Dyna ⋯
│    0.00% │  953.67 ns │     1 │                                      │ gpu_broadcast_kernel_cartesian(CompilerMetadata<DynamicSize, Dyna ⋯
│    0.00% │  715.26 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
│    0.00% │  476.84 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, Dynamic ⋯
└──────────┴────────────┴───────┴──────────────────────────────────────┴────────────────────────────────────────────────────────────────────

@codecov
Copy link
Copy Markdown

codecov Bot commented May 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.23%. Comparing base (0c6cc31) to head (bbebd29).

Additional details and impacted files
@@           Coverage Diff           @@
##             main      #79   +/-   ##
=======================================
  Coverage   92.23%   92.23%           
=======================================
  Files          25       25           
  Lines        3220     3220           
=======================================
  Hits         2970     2970           
  Misses        250      250           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@blegat blegat merged commit e349782 into main May 28, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant