fix(benchmarks): dedicated WASM timing tolerance in regression guard#1255
Conversation
…oise
The publish-time regression guard tripped on two more WASM timing metrics
that are CI runner noise, not real regressions:
- No-op rebuild (build): 15 → 25 (+67%), a 10ms delta at the noise floor
on a sub-30ms NOISY metric; historical wasm range is 5–22ms.
- Full build (incremental): 7664 → 9833 (+28%); wasm full-build history
spans 7.2s–14.0s, so 9.8s is inside the envelope.
Native figures did not trip and no build/incremental codepath changed
between 3.10.0 and 3.11.1, confirming runner variance. Same shape and
root cause as the existing 3.10.0/3.11.0 No-op rebuild and Full build
exemptions. Remove once 3.12.0+ data is captured against a committed
3.11.x baseline.
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
Greptile SummaryThis PR replaces the per-version whack-a-mole KNOWN_REGRESSIONS entries for WASM timing noise with a structural fix: a dedicated
Confidence Score: 5/5Safe to merge. The change is confined to the regression guard test file and only widens tolerances for WASM timing metrics — it cannot cause false negatives on native benchmarks, which retain the strict 25%/50% thresholds. The logic in thresholdFor is straightforward: WASM timing metrics return 70%, size metrics fall through to the strict 25% regardless of engine, and native always uses the original thresholds. The three call-sites in the test suites all pass engineKey consistently. No files require special attention. Important Files Changed
Reviews (2): Last reviewed commit: "fix(benchmarks): add dedicated WASM timi..." | Re-trigger Greptile |
Replaces the per-version, per-metric KNOWN_REGRESSIONS whack-a-mole for WASM timing noise with a structural fix: timing metrics measured under the WASM engine get a wider WASM_TIMING_THRESHOLD (70%) via an engine-aware thresholdFor. WASM wall-clock is 3-5x slower than native and dominated by interpreter + GC overhead, so identical shared-runner jitter lands as a far larger percentage swing (observed +27-67% run-to-run on byte-identical code). The native engine shares all extraction/resolution/query logic and keeps the strict 25%/50% thresholds, so it remains the canary for real regressions; the WASM widening still flags the 100-220% catastrophes the guard exists to catch. Size metrics (DB bytes/file) are engine-independent and excluded via SIZE_METRICS so they keep the strict threshold. Removes the now-superseded 3.11.1 No-op rebuild and Full build entries (both WASM-only timing trips). The remaining 3.11.x entries are kept: fnDeps depth 3/5 trip the native engine too (24.3->34.7, 24.7->34.7) and DB bytes/file is a size metric — neither is covered by the WASM widening. Verified by injecting the publish-run 3.11.1 numbers: the guard passes with the widening and fails on exactly No-op rebuild + Full build without it.
Summary
The v3.11.1 publish workflow kept failing the
test:regression-guardgate on WASM timing metrics — one or two different metrics each run — because WASM wall-clock jitter on shared CI runners is large in percentage terms. PR #1248 already exempted three 3.11.1 metrics by hand; this PR replaces that whack-a-mole with a structural fix.WASM_TIMING_THRESHOLD(70%), applied to timing metrics measured under the WASM engine via an engine-awarethresholdFor(label, engine). The three benchmark suites (build/query/incremental) now passengineKeythrough toassertNoRegressions.SIZE_METRICS=DB bytes/file) — engine-independent and deterministic, so they keep the strict threshold regardless of engine.3.11.1:No-op rebuildand3.11.1:Full buildentries (both WASM-only timing trips). Keeps3.11.1:fnDeps depth 3/5(trip the native engine too: 24.3→34.7, 24.7→34.7) and3.11.1:DB bytes/file(size metric) — neither is covered by the WASM widening.Test plan
RUN_REGRESSION_GUARD=1 vitest runon committed data — 17/17 passbiome check— clean