Skip to content

fix(benchmarks): exempt 3.11.1 regression-guard entries that fail on publish#1248

Merged
carlos-alm merged 5 commits into
mainfrom
fix/claude-action-model
May 30, 2026
Merged

fix(benchmarks): exempt 3.11.1 regression-guard entries that fail on publish#1248
carlos-alm merged 5 commits into
mainfrom
fix/claude-action-model

Conversation

@carlos-alm

Copy link
Copy Markdown
Contributor

Summary

  • Adds 3.11.1:DB bytes/file, 3.11.1:fnDeps depth 3, and 3.11.1:fnDeps depth 5 to KNOWN_REGRESSIONS in regression-guard.test.ts

Why this fails on publish but not in CI

The per-PR benchmark gate runs with --version dev. assertNoRegressions has a dev-only fallback: when comparing dev vs a baseline, KNOWN_REGRESSIONS entries keyed to the baseline version also exempt the metric. So dev vs 3.11.0 was covered by 3.11.0:fnDeps depth 3 etc.

When publish.yml runs, it uses the real semver (3.11.1). The fallback doesn't fire for non-dev versions, and KNOWN_REGRESSIONS.has('3.11.1:fnDeps depth 3') is false → gate fails.

Why the baseline is 3.10.0, not 3.11.0

3.11.0 has no query benchmark data in committed history, so findLatestPair skips it and falls back to 3.10.0. The 3.10.0 numbers predate the corpus-scope change from #1134 (resolution fixtures excluded from the build sweep), so DB bytes/file and fnDeps values look inflated against that older baseline.

Exemption rationale

Entry Root cause
3.11.1:DB bytes/file Corpus denominator drop from #1134: ~745 files → ~607 files; bytes constant, per-file ratio inflates
3.11.1:fnDeps depth 3/5 3.10.0 baseline predates 3.11.0 steady-state; no fn_deps implementation change

All three entries carry "remove once 3.12.0+ data confirms stable numbers against a 3.11.x baseline" in their doc comments — the stale-entry test will flag them automatically after 3.12.0 ships.

Test plan

  • RUN_REGRESSION_GUARD=1 npm run test:regression-guard — 17/17 pass locally
  • publish.yml regression-guard step passes on the next release run

carlos-alm and others added 3 commits May 29, 2026 20:00
Default model claude-sonnet-4-20250514 is deprecated; API rejects it
with 0 tokens causing automated-review to fail with CLAUDE_SUCCESS=false.
…publish

3.11.0 has no query benchmark data in committed history, so findLatestPair
falls back to 3.10.0 as the baseline for 3.11.1. The 3.10.0 numbers predate
the corpus-scope change from #1134 (resolution fixtures excluded from the
build sweep), making DB bytes/file and fnDeps depth 3/5 appear as regressions
against the older baseline.

The per-PR gate uses version 'dev', which triggers the assertNoRegressions
baseline-version fallback so KNOWN_REGRESSIONS entries for the baseline
release also apply — masking the failures in CI. Publish uses the real semver
(3.11.1), so that fallback doesn't fire and the guard fails.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@claude

claude Bot commented May 30, 2026

Copy link
Copy Markdown

Claude encountered an error —— View job


I'll analyze this and get back to you.

@greptile-apps

greptile-apps Bot commented May 30, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR exempts three 3.11.1-versioned metrics from the publish-time regression guard — DB bytes/file, fnDeps depth 3, and fnDeps depth 5 — and pins the claude-code-action GitHub Action to a commit SHA instead of the mutable @beta tag.

  • regression-guard.test.ts: Adds 3.11.1:DB bytes/file, 3.11.1:fnDeps depth 3, and 3.11.1:fnDeps depth 5 to KNOWN_REGRESSIONS with detailed doc comments explaining that the apparent regressions are measurement artifacts: the missing 3.11.0 build/query benchmark history causes findLatestPair to fall back to the pre-perf(bench): exclude resolution-benchmark fixtures from dogfooding sweep #1134 3.10.0 baseline (smaller file corpus denominator inflating DB bytes/file, older baseline inflating fnDeps values). The existing stale-entry test at line 593 will automatically surface these exemptions for pruning once a 3.12.0 baseline is committed.
  • claude.yml: Replaces the mutable @beta tag on anthropics/claude-code-action with a pinned commit SHA — a supply-chain hardening improvement unrelated to the benchmark fix.

Confidence Score: 5/5

Safe to merge — the changes are narrowly scoped exemptions with exhaustive root-cause documentation and a self-cleaning stale-entry mechanism already in place.

The three new KNOWN_REGRESSIONS entries directly address the described publish-gate failure, the exemption logic in assertNoRegressions is straightforward and correct, the stale-entry test will automatically surface these entries for pruning after 3.12.0, and the claude.yml SHA-pin is a supply-chain improvement with no downside. The only finding is a minor wording inconsistency in a doc comment.

The doc comment on the 3.11.1:DB bytes/file entry in regression-guard.test.ts references query benchmark data for a metric that lives in the build benchmark history — worth confirming which history file is actually missing 3.11.0 data before the entry is pruned at 3.12.0.

Important Files Changed

Filename Overview
tests/benchmarks/regression-guard.test.ts Adds three well-documented KNOWN_REGRESSIONS entries for 3.11.1; logic is correct for the publish-time gate, with a minor doc-comment wording inconsistency (says "query benchmark data" for the build-benchmark-derived DB bytes/file metric).
.github/workflows/claude.yml Pins claude-code-action from the mutable @beta tag to a specific commit SHA — standard supply-chain hardening, no functional change.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["publish.yml runs\n(version = 3.11.1)"] --> B["regression-guard.test.ts\nassertNoRegressions()"]
    B --> C{"KNOWN_REGRESSIONS\n.has('3.11.1:metric')?"}
    C -- "Before this PR → false" --> D["❌ Gate fails"]
    C -- "After this PR → true" --> E["✅ Exempted"]
    F["Per-PR CI gate\n(version = 'dev')"] --> G["baseline fallback\n.has('3.11.0:metric')"]
    G --> H["✅ Already covered"]
    I["Stale-entry test"] --> J{"minorGap > 1?"}
    J -- "After 3.12.0" --> K["⚠️ 3.11.1 entries flagged → pruned"]
Loading

Fix All in Claude Code

Reviews (3): Last reviewed commit: "fix(ci): pin claude-code-action to SHA d..." | Re-trigger Greptile

carlos-alm and others added 2 commits May 29, 2026 23:41
@beta is a moving tag; the unpin caused the automated-review job to pick up
a version with a deprecated default model (claude-sonnet-4-20250514), which
the API rejected with 0 tokens and CLAUDE_SUCCESS=false. Pinning to the
SHA that @beta currently resolves to locks in the working version.
@carlos-alm

Copy link
Copy Markdown
Contributor Author

@claude

@claude

claude Bot commented May 30, 2026

Copy link
Copy Markdown

Claude encountered an error —— View job


I'll analyze this and get back to you.

@carlos-alm carlos-alm merged commit d93b257 into main May 30, 2026
21 checks passed
@carlos-alm carlos-alm deleted the fix/claude-action-model branch May 30, 2026 06:15
@github-actions github-actions Bot locked and limited conversation to collaborators May 30, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant