Skip to content

Add blog post: Comparing frontier Claude models (June 2026)#1144

Merged
TomasTomecek merged 1 commit into
packit:mainfrom
TomasTomecek:compare-models
Jun 9, 2026
Merged

Add blog post: Comparing frontier Claude models (June 2026)#1144
TomasTomecek merged 1 commit into
packit:mainfrom
TomasTomecek:compare-models

Conversation

@TomasTomecek

@TomasTomecek TomasTomecek commented Jun 9, 2026

Copy link
Copy Markdown
Member

Summary

  • New blog post comparing Sonnet 4.6, Opus 4.6, and Opus 4.8 on a fixed set of 5 RHEL triage issues using the e2e test harness
  • Adds recharts-based interactive charts (duration, tool calls, input tokens, output tokens, cost)
  • Introduces src/components/ModelEvalCharts/ as the first reusable chart component in the repo
  • Post converted to .mdx to support JSX chart components

Test plan

  • Charts render correctly in browser (verified locally via npm start)
  • All 60 data points in charts verified against original raw tables
  • All GitHub profile links in acknowledgments verified (HTTP 200)

Example:

image

🤖 Generated with Claude Code

@centosinfra-prod-github-app

Copy link
Copy Markdown
Contributor

This was done on a fixed set of five RHEL triage issues using our [end-to-end
test harness](https://github.com/packit/ai-workflows/blob/main/e2e-ci-setup.md).
The 4.6 models ran with `REASONING_EFFORT=high`, Opus 4.8 doesn't
support this via the BeeAI framework yet so we used the default. The harness runs

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused, the default for 4.6 is high. Is the default for 4.8 different?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, the REASONING_EFFORT variable controls whether we enable native thinking or not, but if enabled high is the default value for the effort parameter passed to Anthropic models, IIUC.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually 4.8 can't have the reasoning changed because it's using the adaptive reasoning; I actually haven't checked if latest beeai already supports this but sadly the one we are using isn't

also there is a bug b/w litellm and beeai that lite uses vertex_ai and bee vertexai so it was impossible to enable the adaptive thinking

the only thing that worked was commenting out setting reasoning

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually haven't checked if latest beeai already supports this but sadly the one we are using isn't

I don't think this is BeeAI-related at all, it's LiteLLM thing.

the only thing that worked was commenting out setting reasoning

Without reasoning_effort set native thinking should be disabled, so that would mean Opus 4.8 didn't use it, unless there is a different default (that's certainly possible).

@jpodivin jpodivin left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good, some points could be clarified but good.

Comment on lines +22 to +23
The 4.6 models ran with `REASONING_EFFORT=high`, Opus 4.8 doesn't
support this via the BeeAI framework yet so we used the default. The harness runs

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to an issue would be nice.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, let me actually find this (or report)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt you'll find anything about a feature that doesn't exist upstream 😅 But it's really just a config parameter that is passed to LiteLLM, BeeAI doesn't do anything with it.


### Takeaway

For the triage workload we tested, Sonnet 4.6 offers the best price-performance

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have exact numbers, or at least of round figures to work with?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the embedded chart shows them:

image


The stark difference between the numbers is not just caused by the model
evolution but also the fact how non-deterministic task this triage is. We are
also not utilizing Opus 4.8's adaptive thinking.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure about that? Did Opus 4.8 run with native thinking disabled?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, let me check, something I forgot to verify

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, there was no native thinking; Jirka had a good point about linking to an upstream issue; which I cannot find so I'm opening one

@lbarcziova lbarcziova left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the graphs! Just one note.

Comment on lines +20 to +21
This was done on a fixed set of five RHEL triage issues using our [end-to-end
test harness](https://github.com/packit/ai-workflows/blob/main/e2e-ci-setup.md).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about adding a 1 more sentence for context, explaining what RHEL triage issues mean, i.e. what is the process of triaging about (explaining it's a complex process with multiple decision trees)


### Takeaway

For the triage workload we tested, Sonnet 4.6 offers the best price-performance

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, I didn't expect this!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly, surprised me quite a bit; but at the same time, I hope we'll be able to do this analysis on a much bigger scale

On the other hand, this is an evaluation harness, so we need to make a real
judgement in our day to day work while processing real issues.

None of this analysis would be possible without the incredible work of the

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🫶

Signed-off-by: Tomas Tomecek <ttomecek@redhat.com>
Assisted-by: Claude
@centosinfra-prod-github-app

Copy link
Copy Markdown
Contributor

@TomasTomecek TomasTomecek merged commit 9f9a406 into packit:main Jun 9, 2026
4 checks passed
@github-project-automation github-project-automation Bot moved this from New to Done in Packit pull requests Jun 9, 2026
@TomasTomecek TomasTomecek deleted the compare-models branch June 9, 2026 11:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

6 participants