Digital data are only a tiny part of the influx of information that people inter...

ijustlovemath · on Dec 30, 2024

Sorry, what does that have to do with an OpenAI tech moat?

skydhash · on Dec 30, 2024

>>> The company that captures the most human-AI interaction data will have a TREMENDOUS moat.

>> When the big companies say they're running out of data, I think they mean it literally. They have hoovered up everything external and internal and are now facing the overwhelming mediocrity that synthetic data provides.

> Digital data are only a tiny part of the influx of information that people interact with.

ijustlovemath · on Dec 30, 2024

I'm not sure how you'd get that non-digital data, though. Fundamentally that sounds like a process that doesn't scale to tbe level that they need. Can you explain more?

skydhash · on Dec 30, 2024

Sorry, I wasn't clear enough. I'm saying that for most problems, a lot of the relevant data is not digital. I'm a software developer, and most of the time, the task is to transcript some real world process to a digital equivalent. But most of the time, you lose the richness of interactions to gain repeatability, correctness, speed,...

So what people bothers writing down are just a pale reflection of what has been, the reader has to relies on his experience and imagination to recreate it. If we take drawing for example, you may read all the books on the subject, you still have to practice to properly internalize that knowledge. Same with music, or even pure science (the axioms you start with are grounded in reality).

I believe LLMs are great at extracting patterns from written text and other forms of notation. They may be even good at translating between them. But as anyone who is polyglot may attest, literal translation is often inadequate because lot of terms are not equivalent. Without experiencing the full semantic meaning of both, you'll always be at risk at being confusing.

With traditional software, we were the ones providing meanings so that different tools can interact with each other (when I click this icon, a page will be printed out). LLMs are mostly translation machines, but with just a thin veneer of syntax rules and terms relationships, but with no actual meaning, because of all the information that they lack.

ijustlovemath · on Dec 30, 2024

I actually think LLMs power comes as a result of their deep semantic understanding. For example, embeddings of gendered language, like "king" and "queen," have a very similar vector difference to "man" and "woman". This is true across all sorts of concepts if you really dive into the embeddings. That doesn't come without semantic understanding.

As another example, LLMs are kind of magical when it comes to what I'd call "bad memory spelunking". Is there a video game, book, or movie from your childhood, which you only have some vague fragments of, which you'd like to rediscover? Format those fragments into a request for a list of candidates, and if your description contains just enough detail, you will activate that semantic understanding to uncover what you were looking for.

I'd encourage you to check out 3blue1brown's LLM series for more on this!

I think it's true they lack a lot of information and understanding, and that they probably won't get better without more data, which we are running out of. That's sort of the point I was originally trying to make.