Pencil and paper is just testing with tools enabled.

LadyCailin · 2025-11-08T16:23:07 1762618987

I’d say it’s fair for LLMs to be able to use any tool in benchmarks, so long as they are the ones to decide to use them.

zamadatix · 2025-11-08T16:34:15 1762619655

Agreed. I don't like when the prompt sets up a good portion of how to go about finding the answer by saying which tools to use and how. The LLM needs to decide when and how to use them, not the prompt.

daveguy · 2025-11-08T17:11:38 1762621898

I don't think it should be completely open ended. I mean, you could have an "ask_hooman" tool that solves a ton of problems with current LLMs. But that doesn't mean the LLM is capable with respect to the benchmark.

vntok · 2025-11-08T18:15:27 1762625727

Why not? One of the most intelligent things to do when stuck on a problem is to get outside help.

If allowing this behaviour raises a problem, you can always add constraints to the benchmark such as "final answer must come out under 15s" or something. The LLM can then make the decision to ask around in accordance to the time risk.

daveguy · 2025-11-08T19:28:23 1762630103

Because AI are good at devolving to the highest score, regardless of test intent. For most problems "ask_hooman", or especially the plural, would be much more effective. So, the degenerate case would dominate and tell you precisely zero about the intelligence of the AI. If a specific "tool" is more adept than the "AI" then "choose tool" will always be the correct answer. But I agree, a tight time constraint would help.

Dylan16807 · 2025-11-08T17:53:30 1762624410

On some level this makes sense, but on the other hand LLMs already have perfect recall of thousands of symbols built into them, which is what pencil and paper gives to a human test taker.

zamadatix · 2025-11-08T19:11:35 1762629095

If only context recall was actually perfect! The data is certainly stored well, accurately accessing the right part... maybe worse than a human :D.

Dylan16807 · 2025-11-08T19:15:51 1762629351

If you're not doing clever hacks for very long windows, I thought a basic design fed in the entire window and it's up to the weights to use it properly.

layer8 · 2025-11-08T16:22:20 1762618940

You seem to be addressing an argument that wasn’t made.

Personally, I’d say that such tool use is more akin to a human using a calculator.

zamadatix · 2025-11-08T16:26:20 1762619180

I'm not addressing an argument, just stating that's already a form of LLM testing done today for people wanting to look at the difference in results the same as the human analogy.

layer8 · 2025-11-08T16:28:45 1762619325

Okay, but then I don’t understand why you replied to my comment for that, there is no direct connection to what I wrote, nor to what bee_rider wrote.

zamadatix · 2025-11-08T16:29:50 1762619390

> To the contrary, reasoning ability should enable them to handle numbers of arbitrary size, just as it enables humans to do so, given some pencil and paper.

People interested can see the results of giving LLMs pen and paper today by looking at benchmarks with tools enabled. It's an addition to what you said, not an attack on a portion of your comment :).

layer8 · 2025-11-08T16:35:17 1762619717

I see now. My focus was on the effect of LLMs’ (and by analogy, humans’) reasoning abilities argued by bee_rider. The fact that tool use can enable more reliable handling of large numbers has no bearing on that, hence I found the reply confusing.

zamadatix · 2025-11-08T16:40:10 1762620010

Hmm, maybe it depends on the specific test and reasoning in it? I certainly think reasoning how and when to use allowed tools and when not to is a big part of the reasoning and verification process E.g. most human math scores allow for a pen and paper calculation, or even a calculator, and that can be a great way to say spot check a symbolic derivative and see it needs to be revisited without relying on the calculator/paper to do the actual reasoning for the testee. Or to see the equation for motion of a system can't possibly have been right with some test values (without which I'm not sure I'd have passed my mid level physics course haha).

At the very least, the scores for benchmarking a human on such a test with and without tools would be different to comparing an LLM without the analogous constraints. Which is (IMO) a useful note in comparing reasoning abilities and why I thought it was interesting to note this kind of testing is just called testing with tools on the LLM side (not sure there is an equally as standard term on the human testing side? Guess the same could be used for both though).

At the same time I'm sure other reasoning tests don't gain much from/expect use of tools at all. So it wouldn't be relevant for those reasoning tests.