I recently tried a Fermi estimation problem on a bunch of LLMs and they all failed spectacularly. It was crossing too many orders of magnitude, all the zeroes muddled them up.
E.g.: the right way to work with numbers like a “trillion trillion” is to concentrate on the powers of ten, not to write the number out in full.
Predicting the next character alone cannot achieve this kind of compression, because the probability distribution obtained from the training results is related to the corpus, and multi-scale compression and alignment cannot be fully learned by the backpropagation of this model
E.g.: the right way to work with numbers like a “trillion trillion” is to concentrate on the powers of ten, not to write the number out in full.