garbage benchmark, inconsistent mix of "agent tools" and models. if you wanted t...

paradite · 2025-08-22T05:00:28 1755838828

Hey. I like your roast on benchmarks.

I also publish my own evals on new models (using coding tasks that I curated myself, without tools, rated by human with rubrics). Would love you to check out and give your thoughts:

Example recent one on GPT-5:

https://eval.16x.engineer/blog/gpt-5-coding-evaluation-under...

All results:

https://eval.16x.engineer/evals/coding

jstummbillig · 2025-08-22T18:56:11 1755888971

Which benchmarks are not garbage?

I don't consider myself super special. I think it should be doable to create a benchmark that beats me having to test every single new model.