>> I cannot rectify this with people saying it only looks one word ahead. One wo...

swid · on March 27, 2023

It’s just overly reductive. LLMs do not work like Bayesian or Markov models that fail to scale. In order to predict what word a human will say next past a certain accuracy, you actually have to model more than text. You have to model that person’s behavior in a sense, or otherwise you can’t reach that accuracy (the tree example would not require that much - I used it since I had seen it before).

I like gwern’s[0] example of a dozen people jockeying for power at a dinner table. Part of the next token prediction includes predicting how these Machiavellian schemes play out and why. Next word prediction may give the model a license and reason to figure all that out.

[0]: https://gwern.net/scaling-hypothesis

YeGoblynQueenne · on March 27, 2023

>> It’s just overly reductive. LLMs do not work like Bayesian or Markov models that fail to scale.

Yes, they do. That's how language modelling works. It's the same principle whether it's a small or large language model. Scale only improves predictive accuracy, but it doesn't change what is predicted.

In fact, you don't need a large language model to model the use of a/an correctly. An n-gram model will do that just fine.

Edit:

See Equation 1, page 3, for the pre-training objective in the original GPT; see Equation 4, same page, for the fine-tuning objective:

https://cdn.openai.com/research-covers/language-unsupervised...

The same objective is used throughout the GPT line.

Is that more clear?

YeGoblynQueenne · on March 28, 2023

@swid, would you like me to walk you through the notation of the training objectives in the linked paper? It's basically what I showed above.

Let me know if you need some pointers to understand the paper's notation.

swid · on March 28, 2023

Thanks for the offer, but it feels to me like you are missing the point I would like to be making. Maybe we are all just stochastic parrots in a sense? (FWIW, I do basically understand the notation).

I think the universe might be computable like Stephen Wolfram says - if this is the case - in theory someone could eventually walk you through the math and series of computations to make and then say, "see - humans are not intelligent, they are just biological machines without free will. With unlimited time and computation resources, we can run a human on a Turing machine in a virtual universe."

Even if the universe is not fully computable to the point we can perfectly simulate it on a Turing machine, we already do a pretty good job simulating lots of things, and it's not clear the loss of fidelity would make much difference. As I said, "intelligence" could be an emergent property of certain highly complex networks. I don't think the substrate of the network matters (digital vs biological) - it's just certain states may come with a feeling for the network.

Being able explain the output of the system by math is very powerful - but it fails to capture emergent properties of the universe we would generally prefer to believe in like consciousness.

Anyway - we see slime molds act as path optimizers; lots of algorithms we find useful have physical analogies. I don't know how the brain learns, but biologically, there must be some kind of gradient to follow and reinforce some neural connections while reducing others. Math is the language of the universe, but we believe in free will. Explaining the math of LLMs will not bring us into an agreement - this is unfortunately philosophical I think.

YeGoblynQueenne · on March 28, 2023

Thanks for sharing your perspective but (I don't guess you'll be surprised) I don't agree. Maths, formal languages in general (which is basically maths) is how we understand any process that is beyond our immediate experience, i.e. our senses. We can't see the quantum world, for example, but we can describe it with maths and try to understand its behaviour.

So while there may or may not be emergent behaviours in LLMs and so on, we will not really know that until we've put it down in maths, pointed to the maths, and said "there, that's the emergent behaviours". Until then all we have is speculation.

And I don't like speculation. I like to know how things work. I think that's that thing done in science, also, and that this is the way that we have made any progress as a species. I think I'm in good company in wanting to know how things work, I mean.

I don't agree we are "just stochastic parrots" either, not in any sense. Because we can abstract, and formalise, and prove, and demonstrate. You can't do that just by predicting. That's a power over and beyond modelling data.

I don't pretend to understand that "power", and I don't expect I will in my lifetime, but that's OK.

swid · on March 28, 2023

> .. we can abstract, and formalise, and prove, and demonstrate. You can't do that just by predicting. That's a power over and beyond modelling data.

I think we will find this not to be the case. And of course, many people like me do think that the ability to predict well enough implies the ability to do everything else.

Since I wrote my reply, I found two other posts on HN that seem relevant to that if you want to read more:

AI vs. AGI vs. Consciousness vs. Super-intelligence vs. Agency https://secondbreakfast.co/ai-vs-agi-vs-consciousness-vs-sup...

Simulators (Self-supervised learning may create AGI or its foundation) https://generative.ink/posts/simulators/

Here is a quote from that last one - I am still reading it since it is long, but

> ... I call this the prediction orthogonality thesis: A model whose objective is prediction can simulate agents who optimize toward any objectives, with any degree of optimality (bounded above but not below by the model’s power).*

> This is a corollary of the classical orthogonality thesis, which states that agents can have any combination of intelligence level and goal, combined with the assumption that agents can in principle be predicted. A single predictive model may also predict multiple agents, either independently (e.g. in different conditions), or interacting in a multi-agent simulation. A more optimal predictor is not restricted to predicting more optimal agents: being smarter does not make you unable to predict stupid systems, nor things that aren’t agentic like the weather.

YeGoblynQueenne · on March 28, 2023

>> I think we will find this not to be the case. And of course, many people like me do think that the ability to predict well enough implies the ability to do everything else.

That's a debate that's been going on for a long time. For a while I've kind of waded in it unbeknownst to me, because it's mainly a thing in the philosophy of science and I have no background in that sort of thing. So far I only got a whiff of a much larger discussion when I watched Lex Friedman's interview with Vladimir Vapnik. Here's a link:

https://youtu.be/STFcvzoxVw4?t=64

Vapnik basically says there's two groups in science, the instrumentalists, who are happy to build models that are only predictive, and the realists, who want to build explanatory models.

To clarify, a predictive model is one that can only predict future observations, based on past observations, possibly with high accuracy. An explanatory model is one that not only predicts, but also explains past and future observations, according to some pre-existing scientific theory.

For me, it makes sense that explanatory models are more powerful, by definition. An explanatory model is also predictive, but a predictive model is not explanatory. And once an explanatory model is found, once we understand why things turn out the way they do, our ability to predict also improves, tremendously so.

My favourite exmaple of this is the epicyclical model of astronomy, that dominated for a couple thousand years. Literally. It went on from classical Greece and Rome, all the way to Copernicus, who may have put the Earth in the center of the universe, but still kept it on a perfect circular orbit with epicycles. Epicycles persisted for so long because they were damn good at predicting future observations, but they had no explanatory power and, as it turned out, the whole theory was mere overfitting. It took Kepler, with his laws of planetary motion, to finally explain what was going on. And then of course, Newton waltzed in with his law of universal gravitation, and explained Kepler's laws as a consequence of the latter. I guess I don't have to wax lyrical about Newton and why his explanatory, and not simply predictive, theory, changed everything, astronomy being just one science that was swept away in the epochal wave.

So, no, I don't agree: prediction is what you do until you have an explanation. It's not the final goal, and it's certainly not enough. The only thing that will ever be enough is to figure out how the world works.

I expect it will take some time.

Thanks for the links, I'll have a look.

swid · on March 29, 2023

It’s an interesting debate to bring up, but I think not really the same kind of prediction without explanation I am talking about.

The second post I linked turned out to be really interesting, it both aligns with my thoughts while also adding new ideas and concepts. It makes a distinction between GPT and the agents it simulates, the simulator and the simulacra.

A good enough simulator can simulate an entity capable of explaining lots of things, depending on the limits of the simulator.