My attempt at an explanation: An LLM is really a latent space plus the means to ...

My attempt at an explanation:

An LLM is really a latent space plus the means to navigate it. Now a latent space is an n-dimensional space in which ideas and concepts are ordered so that those that are similar to each other (for example, "house" and "mansion") are placed near each other. This placing, by the way, happens during training and is derived from the training data, so the process of training is the process of creating the latent space.

To visualize this in an intuitive way, consider various concepts arranged on a 2D grid. You would have "house" and "mansion" next to each other, and something like "growling" in a totally different corner. A latent space -- say, GPT-4 -- is just like this, only it has hundreds or thousands of dimensions (in GPT-4's case, 1536), and that difference in scale is what makes it a useful ordering of so much knowledge.

To go back to reading images: the training data included images of webpages with corresponding code, and that code told the training process where to put the code-image pair. In general, accompanying labels and captions let the training process put images in latent space just as they do text. So, when you give GPT-4 a new image of a website and ask it for the corresponding HTML, it can place that image in latent space and get the corresponding HTML, which is lying nearby.