fix(benchmarks): exempt 3.11.1 regression-guard entries that fail on publish#1248
Conversation
Default model claude-sonnet-4-20250514 is deprecated; API rejects it with 0 tokens causing automated-review to fail with CLAUDE_SUCCESS=false.
…publish 3.11.0 has no query benchmark data in committed history, so findLatestPair falls back to 3.10.0 as the baseline for 3.11.1. The 3.10.0 numbers predate the corpus-scope change from #1134 (resolution fixtures excluded from the build sweep), making DB bytes/file and fnDeps depth 3/5 appear as regressions against the older baseline. The per-PR gate uses version 'dev', which triggers the assertNoRegressions baseline-version fallback so KNOWN_REGRESSIONS entries for the baseline release also apply — masking the failures in CI. Publish uses the real semver (3.11.1), so that fallback doesn't fire and the guard fails. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
Greptile SummaryThis PR exempts three
Confidence Score: 5/5Safe to merge — the changes are narrowly scoped exemptions with exhaustive root-cause documentation and a self-cleaning stale-entry mechanism already in place. The three new KNOWN_REGRESSIONS entries directly address the described publish-gate failure, the exemption logic in assertNoRegressions is straightforward and correct, the stale-entry test will automatically surface these entries for pruning after 3.12.0, and the claude.yml SHA-pin is a supply-chain improvement with no downside. The only finding is a minor wording inconsistency in a doc comment. The doc comment on the 3.11.1:DB bytes/file entry in regression-guard.test.ts references query benchmark data for a metric that lives in the build benchmark history — worth confirming which history file is actually missing 3.11.0 data before the entry is pruned at 3.12.0. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["publish.yml runs\n(version = 3.11.1)"] --> B["regression-guard.test.ts\nassertNoRegressions()"]
B --> C{"KNOWN_REGRESSIONS\n.has('3.11.1:metric')?"}
C -- "Before this PR → false" --> D["❌ Gate fails"]
C -- "After this PR → true" --> E["✅ Exempted"]
F["Per-PR CI gate\n(version = 'dev')"] --> G["baseline fallback\n.has('3.11.0:metric')"]
G --> H["✅ Already covered"]
I["Stale-entry test"] --> J{"minorGap > 1?"}
J -- "After 3.12.0" --> K["⚠️ 3.11.1 entries flagged → pruned"]
Reviews (3): Last reviewed commit: "fix(ci): pin claude-code-action to SHA d..." | Re-trigger Greptile |
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
Summary
3.11.1:DB bytes/file,3.11.1:fnDeps depth 3, and3.11.1:fnDeps depth 5toKNOWN_REGRESSIONSinregression-guard.test.tsWhy this fails on publish but not in CI
The per-PR benchmark gate runs with
--version dev.assertNoRegressionshas adev-only fallback: when comparingdevvs a baseline, KNOWN_REGRESSIONS entries keyed to the baseline version also exempt the metric. Sodev vs 3.11.0was covered by3.11.0:fnDeps depth 3etc.When
publish.ymlruns, it uses the real semver (3.11.1). The fallback doesn't fire for non-devversions, andKNOWN_REGRESSIONS.has('3.11.1:fnDeps depth 3')is false → gate fails.Why the baseline is 3.10.0, not 3.11.0
3.11.0 has no query benchmark data in committed history, so
findLatestPairskips it and falls back to 3.10.0. The 3.10.0 numbers predate the corpus-scope change from #1134 (resolution fixtures excluded from the build sweep), soDB bytes/fileandfnDepsvalues look inflated against that older baseline.Exemption rationale
3.11.1:DB bytes/file3.11.1:fnDeps depth 3/5All three entries carry "remove once 3.12.0+ data confirms stable numbers against a 3.11.x baseline" in their doc comments — the stale-entry test will flag them automatically after 3.12.0 ships.
Test plan
RUN_REGRESSION_GUARD=1 npm run test:regression-guard— 17/17 pass locally