> Program testing can be used to show the presence of bugs, but never to show th...

> Program testing can be used to show the presence of bugs, but never to show their absence. Edsger W. Dijkstra

Maybe we need something similar for benchmarks, and updated for today's LLMs, like:

> LLM benchmarks can be used to show what tasks they can do, but never to show what tasks they cannot.