Maybe we need something similar for benchmarks, and updated for today's LLMs, like:
> LLM benchmarks can be used to show what tasks they can do, but never to show what tasks they cannot.
Maybe we need something similar for benchmarks, and updated for today's LLMs, like:
> LLM benchmarks can be used to show what tasks they can do, but never to show what tasks they cannot.