chore(evals): Update model evaluations 2026-06-23 by rhacs-bot · Pull Request #139 · stackrox/stackrox-mcp

rhacs-bot · 2026-06-23T07:38:55Z

Automated weekly model evaluation update.

Models evaluated: gpt-5-mini
Date: 2026-06-23

This PR was automatically generated by the Model Evaluation workflow.

coderabbitai · 2026-06-23T07:39:10Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: d3b17893-6323-4a7c-977f-04bedb75dba5

📥 Commits

Reviewing files that changed from the base of the PR and between 9a7312e and fcce2ee.

📒 Files selected for processing (1)

docs/model-evaluation.md

📝 Walkthrough

Summary by CodeRabbit

Documentation
- Updated model evaluation documentation with the latest benchmark run results, reflecting current performance metrics and task assessment data
- Model capability evaluations and completion statistics have been refreshed to represent the most recent testing outcomes
- All evaluation metrics across assessed tasks are now current and reflect the latest benchmark results

Walkthrough

The docs/model-evaluation.md file's gpt-5-mini subsection is updated from the 2026-06-16 run to the 2026-06-23 run. The reported pass rate changes from 11/11 (100%) to 10/11 (90%), the failing task shifts to cve-nonexistent, and total input/output token counts are revised accordingly.

Changes

gpt-5-mini Evaluation Results Update

Layer / File(s)	Summary
gpt-5-mini 2026-06-23 evaluation results `docs/model-evaluation.md`	Replaces the 2026-06-16 subsection with 2026-06-23 run data: pass rate updated to 10/11 (90%), `cve-nonexistent` marked as the failing task, and token counts revised.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: updating model evaluations with a specific date (2026-06-23), which matches the changeset in docs/model-evaluation.md.
Description check	✅ Passed	The description is directly related to the changeset, providing relevant context about the automated evaluation update, models evaluated, and date, which aligns with the changes made to the evaluation results document.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chore/update-model-evaluation-2026-06-23

Warning

Review ran into problems

🔥 Problems

Central YAML configuration was ignored because it failed validation, so its settings were not applied. Fix the errors below and re-run the review:
unknown tag !<!skip-coderabbit-review> in ".coderabbit.yaml" (13:5)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

codecov-commenter · 2026-06-23T07:42:37Z

❌ 2 Tests Failed:

Tests completed	Failed	Passed	Skipped
380	2	378	12

View the full list of 2 ❄️ flaky test(s)

::policy 1
Flake rate in main: 100.00% (Passed 0 times, Failed 48 times)
Stack Traces | 0s run time
- test violation 1
- test violation 2
- test violation 3

::policy 4
Flake rate in main: 100.00% (Passed 0 times, Failed 48 times)
Stack Traces | 0s run time
- testing multiple alert violation messages 1
- testing multiple alert violation messages 2
- testing multiple alert violation messages 3

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

github-actions · 2026-06-23T07:50:15Z

E2E Test Results

Commit: fcce2ee
Workflow Run: View Details
Artifacts: Download test results & logs

=== Evaluation Summary ===

  ✓ list-clusters (assertions: 3/3)
  ✓ cve-cluster-list (assertions: 3/3)
  ✓ cve-detected-clusters (assertions: 3/3)
  ✓ cve-cluster-does-not-exist (assertions: 3/3)
  ✓ cve-cluster-does-exist (assertions: 3/3)
  ✓ cve-clusters-general (assertions: 3/3)
  ✓ cve-multiple (assertions: 3/3)
  ✗ cve-nonexistent (assertions: 3/3)
      one or more verification steps failed
  ✓ cve-detected-workloads (assertions: 3/3)
  ✓ cve-log4shell (assertions: 3/3)
  ✓ rhsa-not-supported (assertions: 2/2)

Tasks:      10/11 passed (90.91%)
Assertions: 32/32 passed (100.00%)
Tokens:     ~51306 (estimate - excludes system prompt & cache)
MCP schemas: ~12562 (included in token total)
Agent used tokens:
  Input:  10920 tokens
  Output: 19906 tokens
Judge used tokens:
  Input:  60875 tokens
  Output: 56677 tokens

Update model evaluations 2026-06-23

fcce2ee

rhacs-bot requested a review from janisz as a code owner June 23, 2026 07:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore(evals): Update model evaluations 2026-06-23#139

chore(evals): Update model evaluations 2026-06-23#139
rhacs-bot wants to merge 1 commit into
mainfrom
chore/update-model-evaluation-2026-06-23

rhacs-bot commented Jun 23, 2026

Uh oh!

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Review ran into problems

Uh oh!

codecov-commenter commented Jun 23, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

rhacs-bot commented Jun 23, 2026

Uh oh!

coderabbitai Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Review ran into problems

Uh oh!

codecov-commenter commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ 2 Tests Failed:

Uh oh!

github-actions Bot commented Jun 23, 2026

E2E Test Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading

codecov-commenter commented Jun 23, 2026 •

edited

Loading