2.5 flash is particularly cheap and fast, I think 2.5 pro would have got all the...

Yokolos · 2025-08-06T08:33:14 1754469194

I get a lot of garbage out of 2.5 Pro and Claude Sonnet and ChatGPT. There's always this "this is how you solve it", I take a close look and it's clearly broken, I point it out and it's all "you're right, this is a common issue". Okay, so why do we have to do this song and dance a million times to arrive at the actually correct answer?

kazinator · 2025-08-06T04:34:32 1754454872

Why doesn't Flash get it correct, yet comes up with plausible sounding nonsense? That means it is trained on some texts in the area.

What would make 2.5 Pro (or anything else) categorically better would be if it could say "I don't know".

There will be things that Claude 3.7 or Gemini Pro will not know, and the interpolations they come up with will not make sense.

simianwords · 2025-08-06T05:59:45 1754459985

Model accuracy goes up as you use heavier models. Accuracy is always preferable and the jump from Flash to Pro is considerable.

You must rely on your own internal model in your head to verify the answers it gives.

On hallucination: it is a problem but again, it reduces as you use heavier models.

Macha · 2025-08-06T12:33:28 1754483608

> You must rely on your own internal model in your head to verify the answers it gives

This is what significantly reduces the utility, if it can only be trusted to answer things I know the answer to, why would I ask it anything?

simianwords · 2025-08-06T12:40:41 1754484041

its the same reason I find it useful to read comments in Reddit, ask people their advice and opinions.

I have written about it here: https://news.ycombinator.com/item?id=44712300

rockemsockem · 2025-08-07T04:30:31 1754541031

Verification is often easier/faster than coming up with the answer totally

simianwords · 2025-08-07T09:22:42 1754558562

true! generation of an answer is much harder than verification. i wonder if a parallel can be drawn to P vs NP problem.