> There's no such thing as a "world model" There obviously is in humans. When yo...

astrange · 2026-02-09T22:27:43 1770676063

> When you visually simulate things or e.g. simulate how food will taste in your mind as you add different seasonings, you are modeling (part of) the world.

Modeling something as an action is not "having a world model". A model is a consistently existing thing, but humans don't construct consistently existing models because it'd be a waste of time. You don't need to know what's in your trash in order to take the trash bags out.

> We know LLMs can't be doing visuospatial reasoning using imagery, because they only work with text tokens.

All frontier LLMs are multimodal to some degree. ChatGPT thinking uses it the most.

D-Machine · 2026-02-09T22:59:49 1770677989

> Modeling something as an action is not "having a world model".

It literally is, this is definitional. See e.g. how these terms are used in e.g. the V-JEPA-2 paper (https://arxiv.org/pdf/2506.09985). EDIT: Maybe you are unaware of what the term means and how it is used, it does not mean "a model of all of reality", i.e. we don't have a single world model, but many world models that are used in different contexts.

> A model is a consistently existing thing, but humans don't construct consistently existing models because it'd be a waste of time. You don't need to know what's in your trash in order to take the trash bags out.

Both sentences are obviously just completely wrong here. I need to know what is in my trash, and how much, to decide if I need to take it out, and how heavy it is may change how I take it out too. We construct models all the time, some temporary and forgotten, some which we hold within us for life.

> All frontier LLMs are multimodal to some degree. ChatGPT thinking uses it the most.

LLMs by definition are not multimodal. Frontier models are multimodal, but only in a very weak and limited sense, as I address in e.g. other comments (https://news.ycombinator.com/item?id=46939091, https://news.ycombinator.com/item?id=46940666). For the most part, none of the text outputs you get from a frontier model are informed by or using any of the embeddings or semantics learned from images and video (in part due to lack of data and cost of processing visual data), and only certain tasks will trigger e.g. the underlying VLMs. This is not like humans, where we use visual reasoning and visual world models constantly (unless you are a wordcel).

And most VLM architectures are multi-modal in a very limited or simplistic way still, with lots of separately pre-trained backbones (https://huggingface.co/blog/vlms-2025). Frontier models are nowhere near being even close to multimodal in the way that human thinking and reasoning is.