Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Useful benchmark. I noticed o3-high hallucinating too often for such a good model, but it is usually great with search. In my experience, Claude Opus & Sonnet 4 consistently lie, cheat, and try to hide their tracks. Maybe they are good in writing code but I don't trust them with other things.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: