Benchmarks Tell You the Ceiling

June 27, 2026

Claude Opus 4.8 has been out about a month. Most capable model Anthropic has shipped, built for long-running autonomous work. (Fable/Mythos was available for a brief moment.) Everyone posts the benchmark numbers, but nobody seemed to post on what it does in production.

I run Claude Sonnet as the backbone of a personal AI system on a Raspberry Pi 4. CRM, signal monitoring, email drafting, incident management, daily briefings. A second brain. Real workloads, not demos.

And that’s the point: the backbone is Sonnet, not Opus. The workhorse isn’t the frontier model.

Every new model is genuinely better, but the bigger unlock was never the model — it’s figuring out which tasks actually belong to a frontier model versus a lightweight one. Not to mention, which tasks are ready to be handed over.

Benchmarks tell you the ceiling. Production tells you where the ceiling actually matters.

So the real question: what workload in your stack would you actually trust to a frontier model today?

“AI tooling” ai Claude DevSecOps LLM multi-model personal AI production AI Raspberry Pi