Maybe we need something similar for benchmarks, and updated for today's LLMs, like:
> LLM benchmarks can be used to show what tasks they can do, but never to show what tasks they cannot.