There obviously is in humans. When you visually simulate things or e.g. simulate how food will taste in your mind as you add different seasonings, you are modeling (part of) the world. This is presumably done by having associations in our brain between all the different qualia sequences and other kinds of representations in our mind. I.e. we know we do some visuospatial reasoning tasks using sequences of (imagined) images. Imagery is one aspect of our world model(s).
We know LLMs can't be doing visuospatial reasoning using imagery, because they only work with text tokens. A VLM or other multimodal might be able to do so, but an LLM can't, and so an LLM can't have a visual world model. They might in special cases be able to construct a linguistic model that lets them do some computer vision tasks, but the model will itself still only be using tokenized words.
There are all sorts of other sensory modalities and things that humans use when thinking (i.e. actual logic and reasoning, which goes beyond mere semantics and might include things like logical or other forms of consistency, e.g. consistency with a relevant mental image), and the "world model" concept is supposed, in part, to point to these things that are more than just language and tokens.
> Obviously not true because of RL environments.
Right, AI generally can have much more complex world models than LLMs. An LLM can't even handle e.g. sensor data without significant architectural and training modification (https://news.ycombinator.com/item?id=46948266), at which point, it is no longer an LLM.
> When you visually simulate things or e.g. simulate how food will taste in your mind as you add different seasonings, you are modeling (part of) the world.
Modeling something as an action is not "having a world model". A model is a consistently existing thing, but humans don't construct consistently existing models because it'd be a waste of time. You don't need to know what's in your trash in order to take the trash bags out.
> We know LLMs can't be doing visuospatial reasoning using imagery, because they only work with text tokens.
All frontier LLMs are multimodal to some degree. ChatGPT thinking uses it the most.
> Modeling something as an action is not "having a world model".
It literally is, this is definitional. See e.g. how these terms are used in e.g. the V-JEPA-2 paper (https://arxiv.org/pdf/2506.09985). EDIT: Maybe you are unaware of what the term means and how it is used, it does not mean "a model of all of reality", i.e. we don't have a single world model, but many world models that are used in different contexts.
> A model is a consistently existing thing, but humans don't construct consistently existing models because it'd be a waste of time. You don't need to know what's in your trash in order to take the trash bags out.
Both sentences are obviously just completely wrong here. I need to know what is in my trash, and how much, to decide if I need to take it out, and how heavy it is may change how I take it out too. We construct models all the time, some temporary and forgotten, some which we hold within us for life.
> All frontier LLMs are multimodal to some degree. ChatGPT thinking uses it the most.
LLMs by definition are not multimodal. Frontier models are multimodal, but only in a very weak and limited sense, as I address in e.g. other comments (https://news.ycombinator.com/item?id=46939091, https://news.ycombinator.com/item?id=46940666). For the most part, none of the text outputs you get from a frontier model are informed by or using any of the embeddings or semantics learned from images and video (in part due to lack of data and cost of processing visual data), and only certain tasks will trigger e.g. the underlying VLMs. This is not like humans, where we use visual reasoning and visual world models constantly (unless you are a wordcel).
And most VLM architectures are multi-modal in a very limited or simplistic way still, with lots of separately pre-trained backbones (https://huggingface.co/blog/vlms-2025). Frontier models are nowhere near being even close to multimodal in the way that human thinking and reasoning is.
There obviously is in humans. When you visually simulate things or e.g. simulate how food will taste in your mind as you add different seasonings, you are modeling (part of) the world. This is presumably done by having associations in our brain between all the different qualia sequences and other kinds of representations in our mind. I.e. we know we do some visuospatial reasoning tasks using sequences of (imagined) images. Imagery is one aspect of our world model(s).
We know LLMs can't be doing visuospatial reasoning using imagery, because they only work with text tokens. A VLM or other multimodal might be able to do so, but an LLM can't, and so an LLM can't have a visual world model. They might in special cases be able to construct a linguistic model that lets them do some computer vision tasks, but the model will itself still only be using tokenized words.
There are all sorts of other sensory modalities and things that humans use when thinking (i.e. actual logic and reasoning, which goes beyond mere semantics and might include things like logical or other forms of consistency, e.g. consistency with a relevant mental image), and the "world model" concept is supposed, in part, to point to these things that are more than just language and tokens.
> Obviously not true because of RL environments.
Right, AI generally can have much more complex world models than LLMs. An LLM can't even handle e.g. sensor data without significant architectural and training modification (https://news.ycombinator.com/item?id=46948266), at which point, it is no longer an LLM.