> Also the result is somewhat counterintuitive. We know that by low level of und...

> Also the result is somewhat counterintuitive. We know that by low level of understanding, if we ask a student a hard question and he tried many times, the most accurate answer is often not the most popular one but a single answer.

This is a bad analogy.

Here’s what is actually happening with no “common sense but wrong” understanding of it:

- You have a set of probabilities per token.

- You randomize them.

This is not a “bad student being asked multiple times” it is a system with randomized probabilities, creating a probability distribution.

If you want to see what a probability distribution looks like (eg. An electron cloud) then sampling only once is the wrong way to do it.

You basically have two distributions; the first one is the LLM, the second one is the shape generated by adding the random factor in the temperature.

This allows you to escape the “local maxima” encoded in the LLM distribution to find highly probable solutions that are outside the sample space of the “zero temperature”.

If you want a better analogy, look up at the night sky full of stars. Draw circle in the sky; that’s the LLM distribution.

The result from a zero temperature will be the brightest point in that circle.

When you push the temperature up, you blur the sky randomly. Some points become brighter, some dimmer, but the radius of the circle increases.

If there is a very bright point outside the sample circle 10x brighter than the brightest point inside it then repeated random samples will repeatedly find it.

It makes perfect sense that an expanded probability distribution sampled repeatedly could find a “good average solution” if that solution is significantly better than the best “zero temp” solution.

This is the same reason we have 'temp' at all; by widening the solution space probability distribution, you can find better maxima. Turns out, sampling multiple times lets you have more chances to find better maxima.

This is more like "well that seems obviously like a good idea" than "somewhat counterintuitive"; it's just slow and expensive to do it.

You can also adjust the probability distribution by other existing methods, obviously, what's surprising here is not that it works, but that it seem to work so well; probably (and I note they did not try this in their paper), a multi-sample + voting on the output from other methods would also be highly effective.