I think this is a really interesting paper from Cohere, it really feels that at ...

AstroBen · 2025-04-30T13:09:40 1746018580

Any tips on coming up with good private evals?

pongogogo · 2025-04-30T13:28:50 1746019730

Yes, I wrote something up here on how Andrei Kaparthy evaluated grok 3 -> https://tomhipwell.co/blog/karpathy_s_vibes_check/

I would pick one of two parts of that analysis that are most relevant to you and zoom in. I'd choose something difficult that the model fails at, then look carefully at how the model failures change as you test different model generations.

ilrwbwrkhv · 2025-04-30T12:58:04 1746017884

Yup in my private evals I have repeatedly found that DeepSeek has the best models for everything and yet in a lot of these public ones it always seems like someone else is on the top. I don't know why.

__alexs · 2025-04-30T17:05:21 1746032721

Publishing them might help you find out.

refulgentis · 2025-04-30T17:54:33 1746035673

^ This.

If I had to hazard a guess, as a poor soul doomed to maintain several closed and open source models acting agentically, I think you are hyper focused on chat trivia use cases (DeepSeek has a very, very, hard time tool calling and they say as much themselves in their API docs)