Add blog post: Comparing frontier Claude models (June 2026)#1144
Conversation
|
Build succeeded. ✔️ pre-commit SUCCESS in 1m 48s |
| This was done on a fixed set of five RHEL triage issues using our [end-to-end | ||
| test harness](https://github.com/packit/ai-workflows/blob/main/e2e-ci-setup.md). | ||
| The 4.6 models ran with `REASONING_EFFORT=high`, Opus 4.8 doesn't | ||
| support this via the BeeAI framework yet so we used the default. The harness runs |
There was a problem hiding this comment.
I'm a bit confused, the default for 4.6 is high. Is the default for 4.8 different?
There was a problem hiding this comment.
To be clear, the REASONING_EFFORT variable controls whether we enable native thinking or not, but if enabled high is the default value for the effort parameter passed to Anthropic models, IIUC.
There was a problem hiding this comment.
actually 4.8 can't have the reasoning changed because it's using the adaptive reasoning; I actually haven't checked if latest beeai already supports this but sadly the one we are using isn't
also there is a bug b/w litellm and beeai that lite uses vertex_ai and bee vertexai so it was impossible to enable the adaptive thinking
the only thing that worked was commenting out setting reasoning
There was a problem hiding this comment.
I actually haven't checked if latest beeai already supports this but sadly the one we are using isn't
I don't think this is BeeAI-related at all, it's LiteLLM thing.
the only thing that worked was commenting out setting reasoning
Without reasoning_effort set native thinking should be disabled, so that would mean Opus 4.8 didn't use it, unless there is a different default (that's certainly possible).
jpodivin
left a comment
There was a problem hiding this comment.
Good, some points could be clarified but good.
| The 4.6 models ran with `REASONING_EFFORT=high`, Opus 4.8 doesn't | ||
| support this via the BeeAI framework yet so we used the default. The harness runs |
There was a problem hiding this comment.
good point, let me actually find this (or report)
There was a problem hiding this comment.
I doubt you'll find anything about a feature that doesn't exist upstream 😅 But it's really just a config parameter that is passed to LiteLLM, BeeAI doesn't do anything with it.
|
|
||
| ### Takeaway | ||
|
|
||
| For the triage workload we tested, Sonnet 4.6 offers the best price-performance |
There was a problem hiding this comment.
Do we have exact numbers, or at least of round figures to work with?
|
|
||
| The stark difference between the numbers is not just caused by the model | ||
| evolution but also the fact how non-deterministic task this triage is. We are | ||
| also not utilizing Opus 4.8's adaptive thinking. |
There was a problem hiding this comment.
Are we sure about that? Did Opus 4.8 run with native thinking disabled?
There was a problem hiding this comment.
okay, let me check, something I forgot to verify
There was a problem hiding this comment.
yes, there was no native thinking; Jirka had a good point about linking to an upstream issue; which I cannot find so I'm opening one
lbarcziova
left a comment
There was a problem hiding this comment.
I love the graphs! Just one note.
| This was done on a fixed set of five RHEL triage issues using our [end-to-end | ||
| test harness](https://github.com/packit/ai-workflows/blob/main/e2e-ci-setup.md). |
There was a problem hiding this comment.
what about adding a 1 more sentence for context, explaining what RHEL triage issues mean, i.e. what is the process of triaging about (explaining it's a complex process with multiple decision trees)
|
|
||
| ### Takeaway | ||
|
|
||
| For the triage workload we tested, Sonnet 4.6 offers the best price-performance |
There was a problem hiding this comment.
interesting, I didn't expect this!
There was a problem hiding this comment.
exactly, surprised me quite a bit; but at the same time, I hope we'll be able to do this analysis on a much bigger scale
| On the other hand, this is an evaluation harness, so we need to make a real | ||
| judgement in our day to day work while processing real issues. | ||
|
|
||
| None of this analysis would be possible without the incredible work of the |
Signed-off-by: Tomas Tomecek <ttomecek@redhat.com> Assisted-by: Claude
3da32d5 to
06065b4
Compare
|
Build succeeded. ✔️ pre-commit SUCCESS in 1m 16s |

Summary
src/components/ModelEvalCharts/as the first reusable chart component in the repo.mdxto support JSX chart componentsTest plan
npm start)Example:
🤖 Generated with Claude Code