Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

My first through after seeing this post was that it's a real world eval. We are running out of evals lately (arc-agi test, then sudden jump on frontier math, etc). So it's good to have such real world tests which show how far we are.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: