> Building your own evaluations makes sense if you're serving an LLM up to custo...

hamdingers · 2025-11-08T19:47:22 1762631242

So what? I'm the one that's using it, I happen to be a human, my human factor is the only one that matters.

At this point anyone using these LLMs every day have seen those benchmark numbers go up without an appreciable improvement in the day to day experience.

embedding-shape · 2025-11-08T20:16:25 1762632985

> So what? I'm the one that's using it, I happen to be a human, my human factor is the only one that matters.

Yeah no you're right, if consistency isn't important to you as a human, then it doesn't matter. Personally, I don't trust my "humanness" and correctness is the most important thing for me when working with LLMs, so that's why my benchmarks focus on.

> At this point anyone using these LLMs every day have seen those benchmark numbers go up without an appreciable improvement in the day to day experience.

Yes, this is exactly my point. The benchmarks the makers of these LLMs seems to always provide a better and better score, yet the top scores in my own benchmarks have been more or less the same for the last 1.5 years, and I'm trying every LLM I can come across. These "the best LLM to date!" hardly ever actually is the "best available LLM", and while you could make that judgement by just playing around with LLMs, actually be able to point to specifically why that is, is something at least I find useful, YMMV.