Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> My current (tentative) resolution of the surprise is that language encoded way more information about reality than we thought it did.

I think you are close to the mark, but you have been subtly mislead: language is not the data we are working with. We are working with text.

Once you fix that particular failure of word choice, everything else becomes much more clear: text contains much more information than language.

We aren't dealing with just any text, either: that would be noise. We're training LLMs on written text.

Natural language is infamous for one specific feature: ambiguity. There are many possible ways to write something, but we can only write one. We must choose: in doing so, we record the choice itself, and all of the entropy that informed it.

That entropy is the secret sauce: the extra data that LLMs are sometimes able to model. We don't see it, because we read language, not text.

The big surprise is that LLMs aren't able to write language: they can only write text. They don't get tripped up reading ambiguity, but they can't avoid writing it, either. Who chooses what an LLM writes? Is it a mystery character who lives in a black box, or a continuation of the entropy that was encoded into the text that LLM was trained on?



There’s an exercise that some people do when learning programming, which is to write down the steps to make a sandwich. Then the teacher follows the exact instructions to make a sandwich and most people don’t put enough detail for a computer to follow (I.e. open the fridge etc) and the teacher will run around bumping into things. That used to be a teaching exercise to show people the amount of precision required when telling a machine what to do.

Now with LLMs, I think one of the great leaps is the idea that it’s no longer necessary to be “pedantic” when giving computers instructions because LLMs have somehow learned to fill in the blanks with a similar shared “understanding” of the world that we have (I.e. cheese is stored in the fridge so you have to go open the fridge to fetch the cheese for the sandwich).


I don't get the "magic" people are seeing. It makes sense.

>LLMs have somehow learned to fill in the blanks

It's not somehow, it's because they have read a ton of books, documents, etc and can make enough links between cheese and refrigerator and follow that back to know that a refrigerator needs to be opened.

I have seen a lot of very clever AI examples using the latest tools, but I haven't seen anything that seems difficult to deconstruct.


> Now with LLMs, I think one of the great leaps is the idea that it’s no longer necessary to be “pedantic” when giving computers instructions

Yes, but they also can't. They can't be pedantic or follow explicit instructions. That's the other side of the coin that isn't being presented.

They can present the right elements of the story in the the right places, but they can't perform it.


It depends on the task. For certain programs you absolutely need to be pedantic in describing what needs to happen. There is a reason we don't program in natural language and that won't change with LLMs.


Don’t forget they hold vector spaces. So fridge and cheese score high together for cohesiveness, but fridge and Antarctica less so, but both have something to do with cold. Together with all the training on texts creates a good ability to make inferences and “conclusions”. It has a net of lines of meanings of all concepts we fed it that give it the ability that it has, without actually understanding.


> It has a net of lines of meanings of all concepts we fed it that give it the ability that it has, without actually understanding.

I'm increasingly convinced this is what understanding fundamentally is.


That conflates perception with perceiver. LLMs have only internalized [/encoded] our perceptions and expressions. From a model of the mind pov, the 'self' that we sense has an internal LLM-like tool. And it is that self that understands and not the tool.


> That conflates perception with perceiver.

I'm not sure I understand. Can you elaborate?

> From a model of the mind pov, the 'self' that we sense has an internal LLM-like tool. And it is that self that understands and not the tool.

I'm starting to think it's the other way around. I think it's somewhat widely accepted that our brains do most of the "thinking" and "understanding" unconsciously - our conscious self is more of an observer / moderator, occasionally hand-holding the thought process when the topic of interest is hard, and one isn't yet proficient[0] in it.

Keeping that in mind, if you - like me - feel that LLMs are best compared to our "inner voice", i.e. the bit on the boundary between conscious and unconscious that uses language as an interface to the former, then it's not unreasonable to expect that LLMs may, in fact, understand things. Not emulate, but actually understand.

The whole deal with a hundred thousand dimensional latent space? I have a growing suspicion that this is exactly the fundamental principle behind how understanding, thinking in concepts, and thinking in general works for humans too. Sure, we have multiple senses feeding into our "thinking" bit, but that doesn't change much.

At a conceptual, handwavy level (I don't know the actual architecture and math details well enough to offer more concrete explanations/stories), I feel there are too many coincidences to ignore.

Is this coincidence that someone trained an LLM and an image network, and found their independently learned latent spaces map to each other with a simple transforms? Maybe[1], but this also makes sense - both network segmented data about the same view of reality humans have. There is no reason for LLMs to have an entirely different way of representing "understanding" than img2txt or txt2img networks.

Assuming the above is true, is this coincidence that it offers a decent explanation for how humans developed language? You start with a image/sound/touch/other senses acquisition and association system forming a basic brain. Predicting next sensations, driving actions. As it evolves in size and complexity, dimensionality of its representation space grows, and at some point, the associations cluster in something of a world model. Let evolution iterate some (couple hundred thousand years) more, and you end up with brains that can build more complex world model, working with more complex associations (e.g. vibration -> sound -> tone -> grunt -> phrase/song). At this level, language seems like an obvious thing - it's taking complex associations of basic sensory input, and associating them wholesale with different areas of the latent space, so that e.g. a specific grunt now associates with danger, a different one with safety, etc. and once you have brains being able to do that naturally, it's pretty much straight line to a proper language.

Yes, this probably comes as a lot of hand-waving; I don't have the underlying insights properly sorted yet. But a core observation I want to communicate, and recommend people to ponder on, is continuity. This process gains capabilities in a continuous fashion, as it scales - which is exactly a kind of system you'd expect evolution to lock on to.

--

[0] - What is "proficiency" anyway? To me, being proficient in a field of interest is mostly about... shifting understanding of that field to unconscious level as much as possible.

[1] - This was one paper I am aware of; they probably didn't do good enough control, so it might turn out to be happenstance.


[I may have to take you up on your profile offer of out of band continuation of this as there is a lot here to delve into and it would make for interesting conversation.]

The model of the psyche that I subscribe to is ~Jungian, with some minor modifications. I distinguish between the un-conscious, the sub-conscious, and consciousness. The content of the unconscious is atemporal, where as the content of the (sub-)conscious is temporal. In this model, background processing occurs in the sub-conscious, -not- the un-conscious. The unconscious is a space of ~types which become reified in the temporal regime of (sub-)consciousness [via the process of projection]. The absolute center of the psyche is the Self and this resides in the unconscious; the self and the unconscious content are not directly accessible to us (but can be approached via contemplation, meditation, prayer, dreams, and visions: these processes introduce unconscious content into the conscious realm, which when successfully integrated engenders 'psychological wholeness'). The ego -- the ("suffering") observer -- is the central point of consciousness. Self realization occurs when ego assumes a subordinate position to the Self, abandons "attachment" to perceived phenomena & disavows "lordship" i.e. the false assumption of its central position, at which point the suffering ends. This process, in various guises, is the core of most spiritual schools. And we can not discount these aspects of Human mental experience, even if we choose to assume a critical distance from the theologies that are built around these widely reported phenomena. I am not claiming that this is a quality of all minds, but it seems it is characteristic of human minds.

The absolute minimum point that you should take away from this (even if the above model is unappealing or unacceptable or woo to you /g) is that we can always meaningfully speak of a psychology when considering minds. If we can not discern a psychology in the subject of our inquiry then it should not be considered a mind.

I do -not- think that we can attribute a pyschology to large language models.

~

Your comment on the mapping of the latent spaces is interesting, but as you note we should probably wait until this has been established before jumping into conclusions.

And also please excuse the handwavy matter in my comment as well. We're all groping in the semidarkness here.


Yeah, I guess you could see it that way. Object <> Symbolism (aka words, thoughts, concepts, art) <> Meaning. Meaning is knowing how an object relates to others. Language is a kind of information web already, where each word is a hyperlink into meaning.


This is a new idea that I had (or at least consciously noticed) for the first time a few days ago, but - I really don't think the meaning is in words. The words/terms themselves are more like information-free[0] points. The meaning is entirely determined by links. This works, because the links eventually lead you to ground truth - sensory inputs.


Even then, you can see some of the "pedantic" cases when it comes to actually understanding the nature of the the connections between those concepts. For example, it's very easy to get it to reverse shorter/taller or younger/older of clearly defined relationships.


One experiment that I would love to see is an LLM-like model for audio. Feed it hours and hours of lectures, sound effects, animal calls, music etc. You would be able to talk to it and it would ingest the raw waveform then produce audio as a response. Would it learn the fundamentals of music theory? Would it learn to produce "the sound of a bowling ball hitting a dozen windchimes?" Would it learn to talk in English and communicate with whales? We've already done text and images, now someone please do sound!


Uhhh...this is out there, from like a dozen different groups. Not going to do a full Googling for you on my phone because it's literally everywhere but "LLM for audio" gives https://ai.googleblog.com/2022/10/audiolm-language-modeling-... as the first result...some of this stuff is already really impressive.


> Would it learn the fundamentals of music theory?

No, but you might convince yourself it did.

It would map the patterns that exist in its training set. It would then follow those patterns. The result would look like a human understanding music theory, but it would not be that.

It would be stumbling around exactly the domain we gave it: impressive because that domain is not noise, it's good data. It still wouldn't be able to find its way around, only stumble.


> The result would look like a human understanding music theory, but it would not be that.

The question then becomes, what is understand? Is what a human does any different than what this LLM is doing?


Objectivity.

A human can do something with the model. An LLM can only present the model to you.


Not sure how that's any different than a model doing something with another model, as in AutoGPT. What part is objective? A model can be wrong just like a human can be wrong or spread falsehoods too.


A model can't be right or wrong, because it doesn't actually make any logical decisions.

These are categorizations that we make after the fact. If the model could do the same categorization work, then it could actively choose correct over incorrect.


Models could potentially make logical decisions too, if we connect them to something like a classical computer or a rules engine. I don't see any fundamental barriers to making models and computers in general similar to humans' way of understanding and reasoning too.


The idea that text contains more information than language is fascinating, wow. Thank you!


I don't really understand your distinction between language and text, but it sounds intrguing. Would you be able to give more detail? I searched but couldn't find anything that seemed to explain it.


Text is an instance of language. Think of it as the difference between the python language and a large collection of python programs. The language describes syntactic and semantic rules, the collection is a sampling of possible programs that encodes a significant amount of information about the world. You could learn a lot about the laws of nature, the internet, even human society and laws by examining all the python programs ever written.

An extreme version of the same idea is the difference between understanding DNA vs the genome of every individual organism that has lived on earth. The species record encodes a ton of information about the laws of nature, the composition and history of our planet. You could deduce physical laws and constants from looking at this information, wars and natural disasters, economic performance, historical natural boundaries, the industrial revolution and a lot more.


This is a fascinating point.

Let me see if I can play this back.

If a student studies DNA sequencing, they’ll learn about the compounds that make up DNA, how traits get encoded, etc.

Therefore the student might expect an AI trained on people’s DNA to be able to tell you about whether certain traits are more prevalent in one geography or the other.

However, since DNA responds to changes in environment, the AI would start to see time, population, and geography-based patterns emerge.

The AI for example could infer that a given person in the US who’s settled in NYC had ancestors from a given region of the world who left due to an environmental disaster just by looking at a given DNA sequence.

To the student this result would look like magic. But in the end, it’s a result of individual’s DNA having much more information encoded in it than just human traits.


text and language intersect. in some ways, text is a superset of language, mostly due to social, or what is also called pragmatic, factors that complement semantics. also, the semantics/syntax interface is everything else than clear cut, at least in natural human languages.


That relationship seems backwards to me...

Any text corpus is a subset of the language, under the normal definition that a language is the set of all possible sentences (or a set of rules to recognize or generate that set of possibilities). This text subset has an intrinsic bias as to which sentences were selected to represent real language use, which would be significant as a training set for an ML model.

So, perhaps you are saying that the text corpus carries more "world" information than the language, because of the implications you can draw from this selection process? The full language tells us how to encode meaning into sentences, but not what sentences are important to a population who uses language to describe their world. So, if we took a fuzz-tester and randomly generated possible texts to train a large language model, we would no longer expect it to predict use by an actual population. It would probably be more like a Markov chain model, generating bizarre gibberish that merely has valid syntax.

And, this is also seems to apply if you train the model on a selection from one population but then try to use the mode to predict a different population. Wouldn't it be progressively less able to predict usage as the populations have less overlap in their own biased use of language?


regarding the relationship: yes, and in most ways it probably is a subset. is there really such a set of rules that generate all possible sentences? in any case i wanted to say the materiality and cultural activity heavily influences what can and will be put into text and that is not strictly language. "selection process" might capture some, though i'm not sure whether all of it!


I think about this as shape and color. No one ever saw a shape that wasn’t colored and likewise there are no colored things that do not have a shape. Also, displaying text without a font is not possible. Text is the surface of the ocean where waves emerge, and while they have their own properties and may seem to naively have agency, they are an expression of the underlying ocean.


nicely put! many aspects of text at least historically have much to do with its materiality (also in a cognitive development sense, learning how to write etc.). what we can think about nowadays is that text and speech might not be a necessary materiality of language. language might depend more on conceptual systems. more like a substrate of intelligence and that might as well be nonhuman (to stay on topic).


Not the poster, but for me it comes down to a mix of clarity and permanence.

I teach both verbally (interactive question/answer) and I've also written text books.

Verbally by language is "loose". I'll say class when I mean object, unicode when I mean utf-8 and so on. Sentences are not all well formed, and sometimes change mid-thought. It's very "real time"

Writing is a lot more deliberate. I have to be sure of each fact I state. I often re-test things I'm only 95% sure about. I edit, restructure, remove, add, until I'm happy.

Of course all communication falls on a spectrum. Think phone call at one end, text book on the other. When I do a verbal lecture I'm usually careful with my speech, and when I post on hacker-news less rigorous.

Language covers all of it. Text skews to the more deliberate side. Cunningly the language models are trained using (mostly) text, not speech. That will have an impact on them.


from a linguistic standpoint a text is a whole lot more than language: it is an externalisation of thought that is fixed onto a medium using writing utensils and most of all, cultural norms in the form of a wild variety of different genres and forms of text, ranging from something like a stream of consciousness to something like a speech act. furthermore, text can be conceptually written or spoken and with the internet we got an explosion of text that is conceptually spoken. those are the things OP might be refering to in regards to the "entropy" that encodes much more than just the tokens themselves.


Spoken language have pitch, stress, mood, etc.. Written text contain some, but not all.

Text, on the other hand, can present in list or table, with varies formatting, indentation. You can't reproduce them in spoken text.


This has Embassytown vibes and I don't like it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: