Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Do your benchmark results indicate any level of regression on Opus 4.6 or 4.5 since their first release?


We only have some basic time filtering (https://gertlabs.com/?days=30), but most of our samples are from the last 2 months. This is a visualization we plan to add when we've collected more historical data.

But we did heavily resample Claude Opus 4.6 during the height of the degraded performance fiasco, and my takeaway is that API-based eval performance was... about the same. Claude Opus 4.6 was just never significantly better than 4.5.

But we don't really know if you're getting a different model when authenticated by OAUTH/subscription vs calling the API and paying usage prices. I definitely noticed performance issues recently, too, so I suspect it had more to do with subscription-only degradation and/or hastily shipped harness changes.


"but most of our samples are from the last 2 months."

There's your major issue. That's well within the brutal quantization window.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: