Skip to content

chore(evals): Update model evaluations 2026-06-23#139

Open
rhacs-bot wants to merge 1 commit into
mainfrom
chore/update-model-evaluation-2026-06-23
Open

chore(evals): Update model evaluations 2026-06-23#139
rhacs-bot wants to merge 1 commit into
mainfrom
chore/update-model-evaluation-2026-06-23

Conversation

@rhacs-bot

Copy link
Copy Markdown
Contributor

Automated weekly model evaluation update.

Models evaluated: gpt-5-mini
Date: 2026-06-23

This PR was automatically generated by the Model Evaluation workflow.

@rhacs-bot rhacs-bot requested a review from janisz as a code owner June 23, 2026 07:38
@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: d3b17893-6323-4a7c-977f-04bedb75dba5

📥 Commits

Reviewing files that changed from the base of the PR and between 9a7312e and fcce2ee.

📒 Files selected for processing (1)
  • docs/model-evaluation.md

📝 Walkthrough

Summary by CodeRabbit

  • Documentation
    • Updated model evaluation documentation with the latest benchmark run results, reflecting current performance metrics and task assessment data
    • Model capability evaluations and completion statistics have been refreshed to represent the most recent testing outcomes
    • All evaluation metrics across assessed tasks are now current and reflect the latest benchmark results

Walkthrough

The docs/model-evaluation.md file's gpt-5-mini subsection is updated from the 2026-06-16 run to the 2026-06-23 run. The reported pass rate changes from 11/11 (100%) to 10/11 (90%), the failing task shifts to cve-nonexistent, and total input/output token counts are revised accordingly.

Changes

gpt-5-mini Evaluation Results Update

Layer / File(s) Summary
gpt-5-mini 2026-06-23 evaluation results
docs/model-evaluation.md
Replaces the 2026-06-16 subsection with 2026-06-23 run data: pass rate updated to 10/11 (90%), cve-nonexistent marked as the failing task, and token counts revised.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: updating model evaluations with a specific date (2026-06-23), which matches the changeset in docs/model-evaluation.md.
Description check ✅ Passed The description is directly related to the changeset, providing relevant context about the automated evaluation update, models evaluated, and date, which aligns with the changes made to the evaluation results document.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chore/update-model-evaluation-2026-06-23

Warning

Review ran into problems

🔥 Problems

Central YAML configuration was ignored because it failed validation, so its settings were not applied. Fix the errors below and re-run the review:
unknown tag !<!skip-coderabbit-review> in ".coderabbit.yaml" (13:5)

10 | drafts: true
11 | labels:
12 | - !skip-coderabbit-review
13 | ignore_title_keywords:
----------^
14 | - "WIP"
15 | - "DO NOT MERGE"


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@codecov-commenter

codecov-commenter commented Jun 23, 2026

Copy link
Copy Markdown

❌ 2 Tests Failed:

Tests completed Failed Passed Skipped
380 2 378 12
View the full list of 2 ❄️ flaky test(s)
::policy 1

Flake rate in main: 100.00% (Passed 0 times, Failed 48 times)

Stack Traces | 0s run time
- test violation 1
- test violation 2
- test violation 3
::policy 4

Flake rate in main: 100.00% (Passed 0 times, Failed 48 times)

Stack Traces | 0s run time
- testing multiple alert violation messages 1
- testing multiple alert violation messages 2
- testing multiple alert violation messages 3

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@github-actions

Copy link
Copy Markdown

E2E Test Results

Commit: fcce2ee
Workflow Run: View Details
Artifacts: Download test results & logs

=== Evaluation Summary ===

  ✓ list-clusters (assertions: 3/3)
  ✓ cve-cluster-list (assertions: 3/3)
  ✓ cve-detected-clusters (assertions: 3/3)
  ✓ cve-cluster-does-not-exist (assertions: 3/3)
  ✓ cve-cluster-does-exist (assertions: 3/3)
  ✓ cve-clusters-general (assertions: 3/3)
  ✓ cve-multiple (assertions: 3/3)
  ✗ cve-nonexistent (assertions: 3/3)
      one or more verification steps failed
  ✓ cve-detected-workloads (assertions: 3/3)
  ✓ cve-log4shell (assertions: 3/3)
  ✓ rhsa-not-supported (assertions: 2/2)

Tasks:      10/11 passed (90.91%)
Assertions: 32/32 passed (100.00%)
Tokens:     ~51306 (estimate - excludes system prompt & cache)
MCP schemas: ~12562 (included in token total)
Agent used tokens:
  Input:  10920 tokens
  Output: 19906 tokens
Judge used tokens:
  Input:  60875 tokens
  Output: 56677 tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants