>> I cannot rectify this with people saying it only looks one word ahead. One
word must come next, but to do a good job modeling what that word will be,
wouldn’t you need to consider further ahead than that?
No, because you don't predict the probability of a token, you predict the
probability of a token _given_ a preceding sequence of tokens.
So, to decide whether to follow "I climbed the apple tree and picked" with "a"
or "an", you calculate the following (which are conditional probabilities;
read p(A|B) as "probability of A given B):
Now, if P₁ > P₂, you generate "a", otherwise you generate "an".
Note that the sentence "I climbed the apple tree and picked" is different than
the sentence "I climbed the pear tree and picked" so the following are
different probabilities, also:
It’s just overly reductive. LLMs do not work like Bayesian or Markov models that fail to scale. In order to predict what word a human will say next past a certain accuracy, you actually have to model more than text. You have to model that person’s behavior in a sense, or otherwise you can’t reach that accuracy (the tree example would not require that much - I used it since I had seen it before).
I like gwern’s[0] example of a dozen people jockeying for power at a dinner table. Part of the next token prediction includes predicting how these Machiavellian schemes play out and why. Next word prediction may give the model a license and reason to figure all that out.
>> It’s just overly reductive. LLMs do not work like Bayesian or Markov models that fail to scale.
Yes, they do. That's how language modelling works. It's the same principle whether it's a small or large language model. Scale only improves predictive accuracy, but it doesn't change what is predicted.
In fact, you don't need a large language model to model the use of a/an correctly. An n-gram model will do that just fine.
Edit:
See Equation 1, page 3, for the pre-training objective in the original GPT; see Equation 4, same page, for the fine-tuning objective:
Thanks for the offer, but it feels to me like you are missing the point I would like to be making. Maybe we are all just stochastic parrots in a sense? (FWIW, I do basically understand the notation).
I think the universe might be computable like Stephen Wolfram says - if this is the case - in theory someone could eventually walk you through the math and series of computations to make and then say, "see - humans are not intelligent, they are just biological machines without free will. With unlimited time and computation resources, we can run a human on a Turing machine in a virtual universe."
Even if the universe is not fully computable to the point we can perfectly simulate it on a Turing machine, we already do a pretty good job simulating lots of things, and it's not clear the loss of fidelity would make much difference. As I said, "intelligence" could be an emergent property of certain highly complex networks. I don't think the substrate of the network matters (digital vs biological) - it's just certain states may come with a feeling for the network.
Being able explain the output of the system by math is very powerful - but it fails to capture emergent properties of the universe we would generally prefer to believe in like consciousness.
Anyway - we see slime molds act as path optimizers; lots of algorithms we find useful have physical analogies. I don't know how the brain learns, but biologically, there must be some kind of gradient to follow and reinforce some neural connections while reducing others. Math is the language of the universe, but we believe in free will. Explaining the math of LLMs will not bring us into an agreement - this is unfortunately philosophical I think.
Thanks for sharing your perspective but (I don't guess you'll be surprised) I don't agree. Maths, formal languages in general (which is basically maths) is how we understand any process that is beyond our immediate experience, i.e. our senses. We can't see the quantum world, for example, but we can describe it with maths and try to understand its behaviour.
So while there may or may not be emergent behaviours in LLMs and so on, we will not really know that until we've put it down in maths, pointed to the maths, and said "there, that's the emergent behaviours". Until then all we have is speculation.
And I don't like speculation. I like to know how things work. I think that's that thing done in science, also, and that this is the way that we have made any progress as a species. I think I'm in good company in wanting to know how things work, I mean.
I don't agree we are "just stochastic parrots" either, not in any sense. Because we can abstract, and formalise, and prove, and demonstrate. You can't do that just by predicting. That's a power over and beyond modelling data.
I don't pretend to understand that "power", and I don't expect I will in my lifetime, but that's OK.
> .. we can abstract, and formalise, and prove, and demonstrate. You can't do that just by predicting. That's a power over and beyond modelling data.
I think we will find this not to be the case. And of course, many people like me do think that the ability to predict well enough implies the ability to do everything else.
Since I wrote my reply, I found two other posts on HN that seem relevant to that if you want to read more:
Here is a quote from that last one - I am still reading it since it is long, but
> ... I call this the prediction orthogonality thesis: A model whose objective is prediction can simulate agents who optimize toward any objectives, with any degree of optimality (bounded above but not below by the model’s power).*
> This is a corollary of the classical orthogonality thesis, which states that agents can have any combination of intelligence level and goal, combined with the assumption that agents can in principle be predicted. A single predictive model may also predict multiple agents, either independently (e.g. in different conditions), or interacting in a multi-agent simulation. A more optimal predictor is not restricted to predicting more optimal agents: being smarter does not make you unable to predict stupid systems, nor things that aren’t agentic like the weather.
>> I think we will find this not to be the case. And of course, many people
like me do think that the ability to predict well enough implies the ability
to do everything else.
That's a debate that's been going on for a long time. For a while I've kind of
waded in it unbeknownst to me, because it's mainly a thing in the philosophy
of science and I have no background in that sort of thing. So far I only got a
whiff of a much larger discussion when I watched Lex Friedman's interview with
Vladimir Vapnik. Here's a link:
Vapnik basically says there's two groups in science, the instrumentalists, who
are happy to build models that are only predictive, and the realists, who want
to build explanatory models.
To clarify, a predictive model is one that can only predict future
observations, based on past observations, possibly with high accuracy. An
explanatory model is one that not only predicts, but also explains past and
future observations, according to some pre-existing scientific theory.
For me, it makes sense that explanatory models are more powerful, by
definition. An explanatory model is also predictive, but a predictive model is
not explanatory. And once an explanatory model is found, once we understand
why things turn out the way they do, our ability to predict also improves,
tremendously so.
My favourite exmaple of this is the epicyclical model of astronomy, that
dominated for a couple thousand years. Literally. It went on from classical
Greece and Rome, all the way to Copernicus, who may have put the Earth in the
center of the universe, but still kept it on a perfect circular orbit with
epicycles. Epicycles persisted for so long because they were damn good at
predicting future observations, but they had no explanatory power and, as it
turned out, the whole theory was mere overfitting. It took Kepler, with his
laws of planetary motion, to finally explain what was going on. And then of
course, Newton waltzed in with his law of universal gravitation, and explained
Kepler's laws as a consequence of the latter. I guess I don't have to wax
lyrical about Newton and why his explanatory, and not simply predictive,
theory, changed everything, astronomy being just one science that was swept
away in the epochal wave.
So, no, I don't agree: prediction is what you do until you have an
explanation. It's not the final goal, and it's certainly not enough. The only
thing that will ever be enough is to figure out how the world works.
It’s an interesting debate to bring up, but I think not really the same kind of prediction without explanation I am talking about.
The second post I linked turned out to be really interesting, it both aligns with my thoughts while also adding new ideas and concepts. It makes a distinction between GPT and the agents it simulates, the simulator and the simulacra.
A good enough simulator can simulate an entity capable of explaining lots of things, depending on the limits of the simulator.
No, because you don't predict the probability of a token, you predict the probability of a token _given_ a preceding sequence of tokens.
So, to decide whether to follow "I climbed the apple tree and picked" with "a" or "an", you calculate the following (which are conditional probabilities; read p(A|B) as "probability of A given B):
Now, if P₁ > P₂, you generate "a", otherwise you generate "an".Note that the sentence "I climbed the apple tree and picked" is different than the sentence "I climbed the pear tree and picked" so the following are different probabilities, also:
So you'll get a different token generated depending on whether P₃ > P₄ or not.In short, yeah, you can choose whether it's going to be "a" or "an" according to the context of the sentence so-far.
But even in cases where you can't do that, when you don't have the context of "pear" or "apple", what you _can_ do is generate this string:
"I climbed the pear tree and picked a"
And then calculate the following probabilities:
And if P₅ > P₆ you'll generate "ball", otherwise generate "abacus". So you don't have to know what's coming next, just what you've generated so-far.Does that help?