Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>“It’s a glorified word predictor” is becoming increasingly maddening to read. Do tell— how can you prove humans are any different?

One difference between humans and LLMs is that humans have a wide range of inputs and outputs beyond language. The claim that humans are word predictors is not something I would want to dispute.

The claim that humans are nothing more than word predictors is obviously wrong though. When I go to buy food, it's not because I'm predicting the words "I'm hungry". It's because I'm predicting that I'll be hungry.

For me, the most interesting question is whether the way in which language is related to our perception of the physical and social world as well as our perception of ourselves in this world is a precondition for fully understanding the meaning of language.



Then this implies that you’d maybe think differently if LLMs could have different inputs, correct?

Which they are currently doing. GPT-4 can take visual input.

I totally agree that humans are far more complex than that, but just extend your timeline further and you’ll start to see how the gap in complexity / input variety will narrow.


> Then this implies that you’d maybe think differently if LLMs could have different inputs, correct?

They will not be LLMs then, though. But some other iteration of AI. Interfacing current LLMs with APIs does not solve the fundamental issue, as it is still just language they are based on and use.


>They will not be LLMs then, though.

Multi-modal LLMs are still called LLMs because they don't "interface with APIs" to add visual, audio, touch, etc input and output. They just encode pictures, sounds, and motor senses using the same tokens they encode text with and then feed it to the same unmodified LLM and it learns to handle those types of data just fine.

There are no APIs involved and the model is unchanged. It was designed as an LLM, the design hasn't changed, it still is an LLM, it's just had data fed to it that it can't tell from text and is running the same exact LLM inference process on it.

I can download any open source LLM right now and fine tune it on images faster than I could train an ImageNet from scratch because of something called transfer learning. Humans transfer learned speech after millions of generations of using other senses. That's not at all surprising or different from the way LLMs work.


But your talking about something they are not today, and quite likely we won’t be calling them LLM’s as the architecture is likely to change quite a lot before we reach a point they are comparable to human capabilities.


CLIP, which powers diffusion models, creates a joint embeddings space for text and images. There's a lot of active work on extending these multimodal embedding spaces to audio and video. Microsoft has a paper just a week or so ago showing that llm's with a joint embeddings trained on images can do pretty amazing things, and (iirc) with better days efficiency than a text only model.

These things are already here; it's just a matter of when they get out of the research labs... Which is happening fast.

https://arxiv.org/abs/2302.14045


Multiple so called modalities doesn’t necessarily address the shortcomings, if anything it just highlights that there are many steps, and each step has typically created significant changes to the prior architecture!


>Then this implies that you’d maybe think differently if LLMs could have different inputs, correct?

Yes, ultimately it does imply that. Probably not the current iteration of the technology, but I believe that there will one day be AIs that will close the loop so to speak.

It will require interacting with the world not just because someone gave them a command and a limited set of inputs, but because they decide to take action based on their own experience and goals.


It’s get scary when AI is so advanced that it can keep getting continuous input and output thru visual, audio and even feeling like pressure and temperature in a 3d setting.


It will get scary when that happens _and_ it has continuous learning and better short term memory :) Right now they models are all quite static.


> One difference between humans and LLMs is that humans have a wide range of inputs and outputs beyond language.

I share ability to move around and feel pain with apes and cats.

What I'm interested about is ability "reason" - analyze, synthesize knowledge, formulate plans, etc.

And LLMs demonstrated those abilities.

As for movement and so on, please check PaLM-E and Gato. It's already done, it's boring.

> it's not because I'm predicting the words "I'm hungry". It's because I'm predicting that I'll be hungry.

The way LLM-based AI is implemented gives us an ability to separate the feeling part from the reasoning part. It's possible to integrate them into one acting entity, as was demonstrated in SayCan and PaLM-E. Does your understanding of the constituent parts make it inferior?

E.g. ancient people thought that emotions were processed in heart or stomach. Now that we know that emotions are processed mostly in the brain, are we less human?


>What I'm interested about is ability "reason" - analyze, synthesize knowledge, formulate plans, etc. And LLMs demonstrated those abilities.

I disagree that they have demonstrated that. In my interactions with them, I have often found that they correct themselves when I push back, only to say something that logically implies exactly the same incorrect claim.

They have no model of the subject they're talking about and therefore they don't understand when they are missing information that is required to draw the right conclusions. They are incapable of asking goal driven questions to fill those gaps.

They can only mimic reasoning in areas where the sequence of reasoning steps has been verbalised many times over, such as with simple maths examples or logic puzzles that have been endlessly repeated online.


> I share ability to move around and feel pain with apes and cats.

> What I'm interested about is ability "reason" - analyze, synthesize knowledge, formulate plans, etc.

It's great that you are interested in that specific aspect. Many of us are. However, ignoring the far greater richness of human and animal existence doesn't give any more weight to the argument that humans are "just word predictors".


> I share ability to move around and feel pain with apes and cats.

You share the ability to predict words with LLMs.

Something being able to do [a subset of things another thing can do] does not make them the same thing.


But maybe the "I'm hungry" inner monologue is just word prediction, and this could be the most important thing about being human. Transforming some digestive nerve stimulus into a trigger (prompt?) for those words might not be important.


> One difference between humans and LLMs is that humans have a wide range of inputs and outputs beyond language.

So does Bing and multimodal models.

> The claim that humans are word predictors is not something I would want to dispute.

We have forward predictive models in our brains, see David Eagleman.

> The claim that humans are nothing more than word predictors is obviously wrong though. When I go to buy food, it's not because I'm predicting the words "I'm hungry". It's because I'm predicting that I'll be hungry.

Your forward predictive model is doing just that, but that's not the only model and circuit that's operating in the background. Our brains are ensembles of all sorts of different circuits with their own desires and goals, be it short or long term.

It doesn't mean the models are any different when they make predictions. In fact, any NN with N outputs is an "ensemble" of N predictors - dependent with each other - but still an ensemble of predictors. It just so happens that these predictors predict tokens, but that's only because that is the medium.

> fully understanding the meaning of language.

What does "fully" mean? It is well established that we all have different representations of language and the different tokens in our heads, with vastly different associations.


>So does Bing and multimodal models.

I'm not talking about getting fed pictures and videos. I'm talking about interacting with others in the physical world, having social relations, developing goals and interests, taking the initiative, perceiving how the world responds to all of that.

>What does "fully" mean?

Being able to draw conclusions that are not possible to draw from language alone. The meaning of language is not just more language or pictures or videos. Language refers to stuff outside of itself that can only be understood based on a shared perception of physical and social reality.


I fail to see how the first is useful.

For all intents and purposes your brain might as well be a Boltzmann brain / in a jar getting electrical stimuli. Your notion of reality is a mere interpretation of electrical signals / information.

This implies that all such information can be encoded via language or whatever else.

You also don’t take initiative. Every action that you take is dependent upon all previous actions as your brain is not devoid of operations until you “decide” to do something.

You merely call the outcome of your brain’s competing circuits as “taking initiative”.

GPT “took initiative” to pause and ask me for more details instead of just giving me stuff out.

As for the latter, I don’t think that holds. Language is just information. None of our brains are even grounded in reality either. We are grounded in what we perceive as reality.

A blind person has no notion of colour yet we don’t claim they are not sentient or generally intelligent. A paraplegic person who lacks proprioception and motor movements is not “as grounded” in reality as we are.

You see where this is going.

With all due to respect, you are in denial.


> You merely call the outcome of your brain’s competing circuits as “taking initiative”.

We give names to all kinds of outcomes of our brains competing circuits. But our brains competing circuits have evolved to solve a fundamentally different set of problems than an LLM was designed for: the problems of human survival.

> A blind person has no notion of colour yet we don’t claim they are not sentient or generally intelligent.

Axiomatic anthropocentrism is warranted when comparing humans and AI.

Even if every known form of human sensory input, from language to vision, sound, pheromones, pain, etc were digitally encoded and fed into its own large <signal> model and they were all connected and attached to a physical form like C3PO, the resulting artificial being - even if it were marvelously intelligent - should still not be used to justify the diminishment of anyone's humanity.

If that sounds like a moral argument, that's because it is. Any materialist understands that we biological life forms are ultimately just glorified chemical information systems resisting in vain against entropy's information destroying effects. But in this context, that's sort of trite and beside the point.

What matters is what principles guide what we do with the technology.


> We give names to all kinds of outcomes of our brains competing circuits. But our brains competing circuits have evolved to solve a fundamentally different set of problems than an LLM was designed for: the problems of human survival.

Our brain did not evolve to do anything. It happened that a scaled primate brain is useful for DNA propagation, that's it. The brain can not purposefully drive its own evolution just yet, and we have collectively deemed it unethical because a crazy dude used it to justify murdering and torturing millions.

If we are being precise, we are driving the evolution of said models based on their usefulness to us, thus their capacity to propagate and metaphorically survive is entirely dependent on how useful they are to their environment.

Your fundamental mistake is thinking that training a model to do xyz is akin to our brains "evolving". The better analogy would be that as a model is training by interactions to its environment, it is changing. Same thing happens to humans, it's just that our update rules are a bit different.

The evolution is across iterations and generations of models, not their parameters.

> should still not be used to justify the diminishment of anyone's humanity.

I am not doing that, on the contrary, I am elevating the models. The fact that you took it as diminishment of the human is not really my fault nor my intention.

The belief that elevating a machine or information to humanity is the reduction of some people's humanity or of humanity as a whole, is entirely your issue.

From my perspective, this only shows the sheer ingenuity of humans, and just how much effort it took for millions of humans to reach something analogous to us, and eventually build a potential successor to humanity.


> The belief that elevating a machine or information to humanity is the reduction of some people's humanity or of humanity as a whole, is entirely your issue.

It's not just my issue, it's all of our issue. As you yourself alluded to in your comment implying the Holocaust above, humans don't need much of a reason to diminish the humanity of other humans, even without the presence of AIs that marvelously exhibit aspects of human intelligence.

As an example, we're not far from some arguing against the existence of a great many people because an AI can objectively do their jobs better. In the short term, many of those people might be seen as a cost rather than people who should benefit from the time and leisure that offloading work to an AI enables.


> As an example, we're not far from some arguing against the existence of a great many people because an AI can objectively do their jobs better.

We are already here.

The problem is that everyone seems to take capitalism as the default state of the world, we don't live to live, we live to create and our value in society is dependent on our capacity to produce value to the ruling class.

People want to limit machines that can enable us to live to experience, to create, to love and share just so they keep a semblance of power and avoid a conflict with the ruling class.

This whole conundrum and complaints have absolutely nothing to do the models' capacity to meet or surpass us, but with fear of losing jobs because we are terrified of standing up to the ruling class.


>You also don’t take initiative. Every action that you take is dependent upon all previous actions as your brain is not devoid of operations until you “decide” to do something.

You would say that, wouldn't you? ;-)


The answer perhaps depends on how you define "understanding" and "meaning", and whether these concepts are separable from language at all.


that is what modalities mean.

these are being added on.

in particular, we can add many more than humans are able to handle.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: