Skip to content

fix(benchmarks): dedicated WASM timing tolerance in regression guard#1255

Merged
carlos-alm merged 2 commits into
mainfrom
fix/regression-guard-3.11.1-noop-fullbuild
May 30, 2026
Merged

fix(benchmarks): dedicated WASM timing tolerance in regression guard#1255
carlos-alm merged 2 commits into
mainfrom
fix/regression-guard-3.11.1-noop-fullbuild

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

@carlos-alm carlos-alm commented May 30, 2026

Summary

The v3.11.1 publish workflow kept failing the test:regression-guard gate on WASM timing metrics — one or two different metrics each run — because WASM wall-clock jitter on shared CI runners is large in percentage terms. PR #1248 already exempted three 3.11.1 metrics by hand; this PR replaces that whack-a-mole with a structural fix.

  • Adds WASM_TIMING_THRESHOLD (70%), applied to timing metrics measured under the WASM engine via an engine-aware thresholdFor(label, engine). The three benchmark suites (build/query/incremental) now pass engineKey through to assertNoRegressions.
  • WASM runs every parse/query through the tree-sitter-wasm interpreter (3–5× slower than native, dominated by interpreter + GC overhead), so identical ±10–20ms runner jitter lands as a much larger percentage swing. Observed +27–67% run-to-run on byte-identical code.
  • Native is the canary. It shares all extraction/resolution/query logic with WASM and keeps the strict 25%/50% thresholds, so a real algorithmic regression still trips it. The WASM widening still flags the 100–220% catastrophes the guard exists to catch.
  • Size metrics excluded (SIZE_METRICS = DB bytes/file) — engine-independent and deterministic, so they keep the strict threshold regardless of engine.
  • Removes the now-superseded 3.11.1:No-op rebuild and 3.11.1:Full build entries (both WASM-only timing trips). Keeps 3.11.1:fnDeps depth 3/5 (trip the native engine too: 24.3→34.7, 24.7→34.7) and 3.11.1:DB bytes/file (size metric) — neither is covered by the WASM widening.

Test plan

  • RUN_REGRESSION_GUARD=1 vitest run on committed data — 17/17 pass
  • biome check — clean
  • Injected the publish-run 3.11.1 numbers (No-op 15→25, Full build 7664→9833, fnDeps/DB bytes): guard passes with the widening
  • Negative test: with the widening disabled, guard fails on exactly No-op rebuild + Full build (the original publish failures), confirming the tolerance is load-bearing
  • Re-run the publish workflow; regression-guard gate passes

…oise

The publish-time regression guard tripped on two more WASM timing metrics
that are CI runner noise, not real regressions:

  - No-op rebuild (build): 15 → 25 (+67%), a 10ms delta at the noise floor
    on a sub-30ms NOISY metric; historical wasm range is 5–22ms.
  - Full build (incremental): 7664 → 9833 (+28%); wasm full-build history
    spans 7.2s–14.0s, so 9.8s is inside the envelope.

Native figures did not trip and no build/incremental codepath changed
between 3.10.0 and 3.11.1, confirming runner variance. Same shape and
root cause as the existing 3.10.0/3.11.0 No-op rebuild and Full build
exemptions. Remove once 3.12.0+ data is captured against a committed
3.11.x baseline.
@claude
Copy link
Copy Markdown

claude Bot commented May 30, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 30, 2026

Greptile Summary

This PR replaces the per-version whack-a-mole KNOWN_REGRESSIONS entries for WASM timing noise with a structural fix: a dedicated WASM_TIMING_THRESHOLD of 70% applied to all WASM timing metrics except deterministic size metrics (DB bytes/file), which keeps the strict 25% threshold regardless of engine.

  • Adds WASM_TIMING_THRESHOLD = 0.7, SIZE_METRICS, and an engine-aware thresholdFor(label, engine) that returns the appropriate tolerance. All three benchmark suites (build, query, incremental) now pass engineKey down to assertNoRegressions.
  • Cleans up KNOWN_REGRESSIONS: removes the WASM-only timing exemptions for 3.11.1:No-op rebuild and 3.11.1:Full build, retaining only entries that also trip the native engine (3.11.1:fnDeps depth 3/5) or are engine-independent size metrics (3.11.1:DB bytes/file).

Confidence Score: 5/5

Safe to merge. The change is confined to the regression guard test file and only widens tolerances for WASM timing metrics — it cannot cause false negatives on native benchmarks, which retain the strict 25%/50% thresholds.

The logic in thresholdFor is straightforward: WASM timing metrics return 70%, size metrics fall through to the strict 25% regardless of engine, and native always uses the original thresholds. The three call-sites in the test suites all pass engineKey consistently.

No files require special attention.

Important Files Changed

Filename Overview
tests/benchmarks/regression-guard.test.ts Adds WASM_TIMING_THRESHOLD (70%), SIZE_METRICS exclusion set, and propagates engine key through thresholdFor/assertNoRegressions to structurally handle WASM timing jitter. Logic is consistent and well-guarded. No issues found.

Reviews (2): Last reviewed commit: "fix(benchmarks): add dedicated WASM timi..." | Re-trigger Greptile

Replaces the per-version, per-metric KNOWN_REGRESSIONS whack-a-mole for
WASM timing noise with a structural fix: timing metrics measured under the
WASM engine get a wider WASM_TIMING_THRESHOLD (70%) via an engine-aware
thresholdFor.

WASM wall-clock is 3-5x slower than native and dominated by interpreter +
GC overhead, so identical shared-runner jitter lands as a far larger
percentage swing (observed +27-67% run-to-run on byte-identical code). The
native engine shares all extraction/resolution/query logic and keeps the
strict 25%/50% thresholds, so it remains the canary for real regressions;
the WASM widening still flags the 100-220% catastrophes the guard exists to
catch. Size metrics (DB bytes/file) are engine-independent and excluded via
SIZE_METRICS so they keep the strict threshold.

Removes the now-superseded 3.11.1 No-op rebuild and Full build entries (both
WASM-only timing trips). The remaining 3.11.x entries are kept: fnDeps depth
3/5 trip the native engine too (24.3->34.7, 24.7->34.7) and DB bytes/file is
a size metric — neither is covered by the WASM widening.

Verified by injecting the publish-run 3.11.1 numbers: the guard passes with
the widening and fails on exactly No-op rebuild + Full build without it.
@carlos-alm carlos-alm changed the title fix(benchmarks): exempt 3.11.1 No-op rebuild and Full build wasm CI noise fix(benchmarks): dedicated WASM timing tolerance in regression guard May 30, 2026
@carlos-alm carlos-alm merged commit 8b8b93c into main May 30, 2026
21 checks passed
@carlos-alm carlos-alm deleted the fix/regression-guard-3.11.1-noop-fullbuild branch May 30, 2026 21:06
@github-actions github-actions Bot locked and limited conversation to collaborators May 30, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant