Add blog post: Comparing frontier Claude models (June 2026) by TomasTomecek · Pull Request #1144 · packit/packit.dev

TomasTomecek · 2026-06-09T08:13:46Z

Summary

New blog post comparing Sonnet 4.6, Opus 4.6, and Opus 4.8 on a fixed set of 5 RHEL triage issues using the e2e test harness
Adds recharts-based interactive charts (duration, tool calls, input tokens, output tokens, cost)
Introduces src/components/ModelEvalCharts/ as the first reusable chart component in the repo
Post converted to .mdx to support JSX chart components

Test plan

Charts render correctly in browser (verified locally via npm start)
All 60 data points in charts verified against original raw tables
All GitHub profile links in acknowledgments verified (HTTP 200)

Example:

🤖 Generated with Claude Code

centosinfra-prod-github-app · 2026-06-09T08:16:26Z

Build succeeded.
https://gateway-cloud-softwarefactory.apps.ocp.cloud.ci.centos.org/zuul/t/packit-service/buildset/c923847bb7164959a6441df4b67b4f4d

✔️ pre-commit SUCCESS in 1m 48s

nforro · 2026-06-09T08:19:49Z

+This was done on a fixed set of five RHEL triage issues using our [end-to-end
+test harness](https://github.com/packit/ai-workflows/blob/main/e2e-ci-setup.md).
+The 4.6 models ran with `REASONING_EFFORT=high`, Opus 4.8 doesn't
+support this via the BeeAI framework yet so we used the default. The harness runs


I'm a bit confused, the default for 4.6 is high. Is the default for 4.8 different?

To be clear, the REASONING_EFFORT variable controls whether we enable native thinking or not, but if enabled high is the default value for the effort parameter passed to Anthropic models, IIUC.

actually 4.8 can't have the reasoning changed because it's using the adaptive reasoning; I actually haven't checked if latest beeai already supports this but sadly the one we are using isn't

also there is a bug b/w litellm and beeai that lite uses vertex_ai and bee vertexai so it was impossible to enable the adaptive thinking

the only thing that worked was commenting out setting reasoning

I actually haven't checked if latest beeai already supports this but sadly the one we are using isn't

I don't think this is BeeAI-related at all, it's LiteLLM thing.

the only thing that worked was commenting out setting reasoning

Without reasoning_effort set native thinking should be disabled, so that would mean Opus 4.8 didn't use it, unless there is a different default (that's certainly possible).

jpodivin

Good, some points could be clarified but good.

jpodivin · 2026-06-09T08:21:23Z

+The 4.6 models ran with `REASONING_EFFORT=high`, Opus 4.8 doesn't
+support this via the BeeAI framework yet so we used the default. The harness runs


Link to an issue would be nice.

good point, let me actually find this (or report)

I doubt you'll find anything about a feature that doesn't exist upstream 😅 But it's really just a config parameter that is passed to LiteLLM, BeeAI doesn't do anything with it.

jpodivin · 2026-06-09T08:21:58Z

+
+### Takeaway
+
+For the triage workload we tested, Sonnet 4.6 offers the best price-performance


Do we have exact numbers, or at least of round figures to work with?

the embedded chart shows them:

nforro · 2026-06-09T08:25:38Z

+
+The stark difference between the numbers is not just caused by the model
+evolution but also the fact how non-deterministic task this triage is. We are
+also not utilizing Opus 4.8's adaptive thinking.


Are we sure about that? Did Opus 4.8 run with native thinking disabled?

okay, let me check, something I forgot to verify

yes, there was no native thinking; Jirka had a good point about linking to an upstream issue; which I cannot find so I'm opening one

lbarcziova

I love the graphs! Just one note.

lbarcziova · 2026-06-09T08:28:30Z

+This was done on a fixed set of five RHEL triage issues using our [end-to-end
+test harness](https://github.com/packit/ai-workflows/blob/main/e2e-ci-setup.md).


what about adding a 1 more sentence for context, explaining what RHEL triage issues mean, i.e. what is the process of triaging about (explaining it's a complex process with multiple decision trees)

lbarcziova · 2026-06-09T08:28:44Z

+
+### Takeaway
+
+For the triage workload we tested, Sonnet 4.6 offers the best price-performance


interesting, I didn't expect this!

exactly, surprised me quite a bit; but at the same time, I hope we'll be able to do this analysis on a much bigger scale

lbarcziova · 2026-06-09T08:29:10Z

+On the other hand, this is an evaluation harness, so we need to make a real
+judgement in our day to day work while processing real issues.
+
+None of this analysis would be possible without the incredible work of the


Signed-off-by: Tomas Tomecek <ttomecek@redhat.com> Assisted-by: Claude

centosinfra-prod-github-app · 2026-06-09T09:23:37Z

Build succeeded.
https://gateway-cloud-softwarefactory.apps.ocp.cloud.ci.centos.org/zuul/t/packit-service/buildset/34312253e3674682ab0ec1baaa040876

✔️ pre-commit SUCCESS in 1m 16s

usercont-release-bot added this to Packit pull requests Jun 9, 2026

github-project-automation Bot moved this to New in Packit pull requests Jun 9, 2026

majamassarini approved these changes Jun 9, 2026

View reviewed changes

nforro reviewed Jun 9, 2026

View reviewed changes

jpodivin approved these changes Jun 9, 2026

View reviewed changes

nforro reviewed Jun 9, 2026

View reviewed changes

lbarcziova reviewed Jun 9, 2026

View reviewed changes

new blog post: Comparing frontier Claude models

06065b4

Signed-off-by: Tomas Tomecek <ttomecek@redhat.com> Assisted-by: Claude

TomasTomecek force-pushed the compare-models branch from 3da32d5 to 06065b4 Compare June 9, 2026 09:21

lbarcziova approved these changes Jun 9, 2026

View reviewed changes

nforro approved these changes Jun 9, 2026

View reviewed changes

TomasTomecek merged commit 9f9a406 into packit:main Jun 9, 2026
4 checks passed

github-project-automation Bot moved this from New to Done in Packit pull requests Jun 9, 2026

TomasTomecek deleted the compare-models branch June 9, 2026 11:53

		The 4.6 models ran with `REASONING_EFFORT=high`, Opus 4.8 doesn't
		support this via the BeeAI framework yet so we used the default. The harness runs


		### Takeaway

		For the triage workload we tested, Sonnet 4.6 offers the best price-performance

		This was done on a fixed set of five RHEL triage issues using our [end-to-end
		test harness](https://github.com/packit/ai-workflows/blob/main/e2e-ci-setup.md).

Conversation

TomasTomecek commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

centosinfra-prod-github-app Bot commented Jun 9, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpodivin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lbarcziova left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

centosinfra-prod-github-app Bot commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

TomasTomecek commented Jun 9, 2026 •

edited

Loading