More

malwrar · 2026-06-06T04:43:41 1780721021

Back when ChatGPT came out, I was so shocked by how _good_ it was for an “AI” product that I simply had to know how it worked. Over the next month I ended up drawing out a block diagram on a whiteboard I have in my office, with the math involved next to each step in the blackboard. I’d puzzle about each step along the way, and the triumph of completing the drawing was also that of this sense of deep understanding. I kept that drawing up for many months after, and would gaze at it often during meetings and idle moments in wonder.

This is to say: the autoregressive decoder-only transformer llm architecture as pioneered by openai is wildly simple for how revolutionary its results are. I was reading about non-learned classical SLAM systems (uses video + handcrafted math to produce 3d mappings of physical spaces while also locating the camera in those spaces) at the time, and comparatively speaking I’d say the math is about as complicated as ONE of the components in those complex formulations. The only reason frontier LLMs need 6-figure computers to run is because the model designers made the middle bit in those models REALLY BIG, dimensionally speaking. They just took the steam engine, made a few gargantuan versions of it, and are selling them as the ultimate source of power.

This was openai’s entire breakthrough. Making this particular model architecture larger leads to emergent capabilities like being able to pick the best ending to a story/set of instructions or answer questions about broad factual knowledge. I’ve been meanwhile watching these AI companies attempt, successfully, to sell this capability as some sort of robot consciousness hand-crafted by supergeniuses. The fact that they are getting away with it is almost as shocking to me as the discovery itself.

ekunazanu · 2026-06-06T08:16:00 1780733760

> This was openai’s entire breakthrough. Making this particular model architecture larger leads to emergent capabilities

Basically, the bitter lesson: https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson...

williamstein · 2026-06-06T16:23:00 1780762980

This interview https://youtu.be/oWOz2htozfI?si=qdQ0uZRoZOYeThOn from 2 days ago with a top researcher from OpenAI directly addresses the bitter lesson argument and the importance of scaling for the history of their models.

jochembrouwer · 2026-06-06T21:27:18 1780781238

So the take-away here is that we (as humans) try to model these AIs like humans, but eventually these AIs get better. Which to me seems like a logical conclusion if they can do "things" (like "learning" or pattern matching) much faster than we can (the compute). Then language in LLMs is a bottleneck, the AI is constrained by the language, and thus if we want to scale further we could let AI create its own language (we would then have to translate whatever it creates back to a language we understand). It is the same for instance if we check the language of Inuit (people who live in the north and make temorary shelters like igloos in the snow) they have multiple words/verbs to describe the snow, while in English we only have one (?): snow. In English we don't need more words (we can explain snow state using multiple words) but for the Inuit language it makes sense to create these new terms (would also make it easier and faster to communicate). So in some sense, all languages are then "newspeak" to whatever a general language is what researches or AI might come up with. If this sounds dumb let me know, but if you know some research in this general language direction (I'd assume general AI research) would love to see it!

Vachyas · 2026-06-07T02:42:21 1780800141

I think one hypothesis along these lines is that, if allowed, due to the limitations of human language you described, LLMs will gravitate towards "inventing" their own language (which, due to training pressures, may even resemble english from the outside, but contain deeper, "true", meaning within), but that we should do our best to prevent this even if it bottlenecks reasoning capabilities since it would cut off our ability to read its "true" thoughts and detect misalignment

See: https://openai.com/index/chain-of-thought-monitoring/

Quote below:

  Chain-of-thought (CoT) reasoning models “think” in natural language understandable by humans. Monitoring their “thinking” has allowed us to detect misbehavior such as subverting tests in coding tasks, deceiving users, or giving up when a problem is too hard.

  We believe that CoT monitoring may be one of few tools we will have to oversee superhuman models of the future.

  We have further found that directly optimizing the CoT to adhere to specific criteria (e.g. to not think about reward hacking) may boost performance in the short run; however, it does not eliminate all misbehavior and can cause a model to hide its intent. We hope future research will find ways to directly optimize CoTs without this drawback, but until then

  We recommend against applying strong optimization pressure directly to the CoTs of frontier reasoning models, leaving CoTs unrestricted for monitoring.

  We understand that leaving CoTs unrestricted may make them unfit to be shown to end-users, as they might violate some misuse policies. Still, if one wanted to show policy-compliant CoTs directly to users while avoiding putting strong supervision on them, one could use a separate model, such as a CoT summarizer or sanitizer, to accomplish that.

cavemandaveman · 2026-06-07T02:14:41 1780798481

My understanding is that the Inuit snow claim is a bit of a myth. Beyond English words like slush, sleet, powder, hardpack, flurry, blizzard, etc you can also say "fluffy snow", "wet snow", etc. The Inuit language is just smooshing the adjective so you get something like "wetsnow" as one word.

xnx · 2026-06-06T16:23:24 1780763004

Isn't the bitter lesson basically the same as "The Unreasonable Effectiveness of Data" from 2009?

swyx · 2026-06-06T18:55:03 1780772103

not exactly, bitter lesson is one meta-level up from "scale eats everything". this is a common misunderstanding of bitter lesson that rich sutton has been fighting ever since the thing was written. in rich's own words[1], the modern summary is

> Don’t be distracted by human knowledge, as AI has been historically.

> Instead focus on methods for creating knowledge that scale with computation, like search and learning.

so the lesson is choose methods that scale with computation, not just that blindly scaling up anything (data, params, people, whatever) works, it is choosing the right x axis and the right scaling laws consistently wins out in the long run despite short term wins from other methods.

1: https://x.com/RichardSSutton/status/2056419165502935198

jfim · 2026-06-06T05:06:47 1780722407

Indeed. It's pretty interesting to realize after implementing GPT-2 that the frontier models are scaled up versions of that, with various tweaks to improve performance, model-wise.

The secret sauce though is all the datasets, RL training, knowledge of what works from doing all kinds of ablation experiments, and a massive compute moat.

gobdovan · 2026-06-06T07:00:31 1780729231

The secret sauce is also having the necessary 'creativity' to not get ceased and desisted into oblivion and jail from all the copyrighted material you trained your model on. Btw, not making a moral judgement, [0] shows Michael and Dalton from YC discussing why Ilya Sutskever had to leave Google to pursue what's now ChatGPT

[0] https://youtu.be/E8pvgN1j-Ck?t=748

root-parent · 2026-06-06T13:53:53 1780754033

There is a whole moral judgement to be made here...lets hope Ilya wont get too pissed off if somebody leaks the work of his new initiative...information wants to be free and all that...

Also would love to know if the same Legal team advised on Gemini...

someguyiguess · 2026-06-06T14:27:30 1780756050

And to make anyone who threatened to expose them “commit suicide”

miltonlost · 2026-06-06T15:02:18 1780758138

He's a massive massive thief that people who have stolen far less from a convenience store have gone to prison for. The man is a villain.

achrono · 2026-06-06T06:16:44 1780726604

How do we know that today's frontier models are merely scaled up versions of that? Genuine question, since the labs have narrowed what they share over the years to now almost nothing, in terms of how the model was trained and how it works under the hood.

HarHarVeryFunny · 2026-06-06T14:16:09 1780755369

We know for sure the architecture of the open weights models since llama.cpp understands the architecture it needs to build to plug the weights into to run them. It's always possible that the latest closed model is doing something architecturally different than the open weights ones we know about, but judging by how close the large open weight models such as DeepSeek are to SOTA performance, this seems unlikely. When OpenAI first came out with their near-mythical "Strawberry" (aka "o1") thinking model there was all sorts of speculation that they had made some sort of architectural breakthough, but then DeepSeek replicated the capability and published how they did it, proving that it was just better training, not any architectural change.

There have been minor changes to the architecture over the years, but these are basically all efficiency tweaks such as various types of attention (some pioneered in the open by DeepSeek) that better scale to large context lengths, and the confusingly named "mixture of experts" architecture, but what's more notable really is how little the architecture has changed. The capability gains have been coming from better training and better data.

gobdovan · 2026-06-06T07:10:00 1780729800

DeepSeek research:

- V3 https://arxiv.org/abs/2412.19437

- V2 https://arxiv.org/abs/2405.04434

- R1 https://arxiv.org/abs/2501.12948 (RL applied to ML models was well-known beforehand, but they show it in the open, at scale, on big models)

Then, there's the incentive analysis. If you can see that these models empirically get better with scale, why would you swap the main architecture? Those events will be pretty rare. I'm not saying there's noone cooking a new architecture, just that it is a pretty rare event. And it would have to come from some researchers that would be happy to not publish their findings, which is not really what a sizable portion of elite researchers (obviously not all) are incentivized to do.

Of course, it's a bit of a verbal compression to claim simply 'scaled up'. They are recognisable scaled up transformers, but most new models come with a few tricks, but we're at the point where those usually are not an architectural rewrite and added to solve an explicit problem, like hallucination, not for big new capability gains.

swyx · 2026-06-06T18:57:29 1780772249

> If you can see that these models empirically get better with scale, why would you swap the main architecture? Those events will be pretty rare

c.f. hardware lotter https://arxiv.org/abs/2009.06489

matusp · 2026-06-06T07:11:59 1780729919

There are thousands of people working in top level labs. Somebody would leak it

ai_slop_hater · 2026-06-06T06:27:15 1780727235

No they are clearly not just scaled up versions of gpt 2; there are different LLM architectures like mixture of experts etc that appeared relatively recently. I am not an expert though, far from it.

otabdeveloper4 · 2026-06-06T06:35:53 1780727753

MoE and such are basically performance enhancements, they don't make the model smarter.

yababa_y · 2026-06-06T07:23:30 1780730610

separately trained experts can surpass performance in their activated regime and DOES result in a smarter model, the Claude system cards talk about this and eg there is https://openreview.net/forum?id=iydmH9boLb to read...

jmalicki · 2026-06-06T13:16:03 1780751763

Performance enhancements are huge though.

If you can make the existing model faster, you can then save your inference budget to then make your model bigger, which then makes it smarter.

A lot of how smart the models can be comes down to budget. If you can make your existing thing cheaper, you can instead make it bigger for the same price.

TheHalfDeafChef · 2026-06-06T13:42:13 1780753333

Not really “smarter” though? It’s just a big probability engine.

(Not trying to flame bait or anything. I just wouldn’t call LLM as exhibiting intelligence. It is great at making connections based on probability but doesn’t have a semantic understanding of what it is doing)

otabdeveloper4 · 2026-06-06T14:15:04 1780755304

> to then make your model bigger, which then makes it smarter

There's diminishing returns and at some point making a model bigger makes it dumber.

fizx · 2026-06-06T20:59:00 1780779540

Performance enhancements are what allow you to train a bigger model.

locknitpicker · 2026-06-06T15:01:30 1780758090

> The secret sauce though is all the datasets, RL training, knowledge of what works from doing all kinds of ablation experiments, and a massive compute moat.

ReAct loops and tool-calling are the critical development feature. They turn a model from something that generates text into something that can independently influence the world around them.

Without agent features, you have just a chatbot.

galaxyLogic · 2026-06-06T19:59:55 1780775995

The big breakthrough is we can interact with the agents using natural language - because of the LLM.

It is the combination of LLM and agent-harnesses that make it look really smart. Agent-harness is a programmatic device that lets us tap into the vast knowledge in the LLM.

It is probabaly true that many TV-commentators fail to appreciate this fact and therefore think LLMs are super-intelligent. No, it is the combination of LLM and the programmatic agent-haness that is the breakthrough.

An interesting thought is that the LLM could in theory code the agent-harrness, start it running every time we interact with it. Currently the agent-harrness I think is pretty static I think. In theory it could be dynamically created for every task. Would that make it better don't know.

locknitpicker · 2026-06-07T10:27:21 1780828041

> The big breakthrough is we can interact with the agents using natural language - because of the LLM.

Without ReAct and tool calling, all you have is a chatbot. That's useful, but it's just a chatbot.

ReAct loops and tool calling is what unblocks high value usecases. It enables systems to actually address free-form problem statements, gather data that is not a part of their training set, inspect the current state of services,and trigger actions in external systems. This goes well beyond mere chatbots.

> It is the combination of LLM and agent-harnesses that make it look really smart.

It's really not about "smart". It's about autonomous systems, and being able to consume and analyze new data, and trigger actions in external systems.

forestsitter · 2026-06-06T16:19:02 1780762742

Same. I recall reading a paper by Stephen Wolfram after ChatGPT came out where he goes over how it works and what it does. Such a good piece and really got me going with this stuff. https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-...

antirez · 2026-06-06T08:15:48 1780733748

There is a different way to look at this: that is, actually the Transformer is a minimal complication of what the based model is: in theory the neural network could be just a huge FFN, which is anyway the part of the Transformer that does the heavy lifting. But this would be impossibile to train both numerically and computationally, so the Transformer encodes enough priors for it to work: the causal attention, and the math tricks like the residuals and so forth. But the bottom line of all this is that the Transformer works because of the incredible semantical power of simple/huge FFNs.

dist-epoch · 2026-06-06T11:14:11 1780744451

Isn't that over-simplifying it a bit too much?

You can go another step - a FFN can be simulated on a Turing machine, thus it just exemplifies the incredible semantical power of the Turing machine model of computation. (in fact you don't even need a Turing machine, since there is no looping in one forward pass).

In theory you can run a huge FFN on the tiniest Turing machine, in practice it's much better to run a Transformer on the latest NVIDIA hardware. Or as they say "quantity (performance) has a quality all its own"

musebox35 · 2026-06-06T12:21:08 1780748468

I was about to post your last point / quote. Going multigpu is relatively not so though but once you go multi-node you have distributed storage/io/compute system which is highly non trivial. Add that the long training times now you have robustness/fault-tolerantness concerns with hardware failures and restarts. Today’s training systems are engineering marvels.

zbendefy · 2026-06-06T11:38:21 1780745901

Good point!

There is also the case for Markov chains being theoretically able to do these if tuned well. Or even SAT problem.

CGMthrowaway · 2026-06-06T14:57:31 1780757851

"LLM is just fancy autocomplete"

galaxyLogic · 2026-06-06T20:03:54 1780776234

LLM is an Oracle

10GBps · 2026-06-06T05:01:38 1780722098

Yep. It's nearly identical to the neural nets we were using in the 90s. Back then even a supercomputer wasn't big enough or fast enough to do what we do today.

I have to wonder though. Is this all a human brain is? A similar thing to an LLM just scaled exponentially larger. I mean a brain is not just neurons with simple connections to each other. The neurons, axons, dendrites, <insert_unexplained_thing>, etc in a brain are all holding and processing information in different ways and doing it nearly 100% in parallel. That's a really big model.

The biological discoveries show how complex a biological brain actually is. Even the tiny brains in a bee or spider are able to solve puzzles and use tools. That's crazy.

ctolsen · 2026-06-06T06:34:45 1780727685

No, it’s definitely not what a human brain is. That makes very little sense. The ways we interact with language (and thus conceptual memory) is completely and fundamentally different.

rfv6723 · 2026-06-06T07:30:52 1780731052

Is it different though?

If we look beyond written languages which are late inventions of human civilization, oral languages are continuous and build with blocks not words.

Chomskyan school misled the entire field of linguistics for decades by ignoring spoken languages.

0xbadcafebee · 2026-06-06T19:02:38 1780772558

Chomsky did the opposite of what you're saying. He didn't ignore spoken language. He said that human vocalization is independent of language, and that the way our brains can manipulate and use sound (a cognitive capability, not specifically an aural one) is the fundamental differentiator that allows us to make compound ideas, and our specific use of language is a byproduct.

Example: a programming language's capability to produce complex software does not come from some inherent quality of language. It comes from binary. 0's and 1's, representing basic logic, and that being built on top of with an abstract "tool" called a language. If the binary logic didn't work, the language wouldn't do anything.

A dolphin can make sounds, and technically has a language, but they can't manipulate or recursively compound concepts (as far as we can tell) in order to create modified ideas. If they could, they probably would have come up with vastly more advanced fishing methods than the (admittedly novel) ones they have now.

uoaei · 2026-06-06T08:28:20 1780734500

It is different, but there may be some universal principles that are relevant more abstractly among both cases. Of particular interest is the empirical notion that statistical models of a certain form will always tend to "average out noise" and "learn meaningful patterns" up to the capacity that those models have for representing said patterns. A parallel notion to this is the hypothesis dubbed "thermodynamic origins of life". The universal principle binding these two seemingly disparate topics is one that seems to underlie any sense of "learning" in physical systems: that semantics of those systems depend on their representational power, and the semantics they do come to represent are the results of adding up many pushes in one "direction" (phase space / state space / etc.) encoding a pattern, and adding up many random noise jiggles will cancel out but give you a first-order sense of variance of those semantic features as expressed by the environment.

As this description is so overly abstract, an exercise for the reader is to try to work through an explanation of how, say, a river delta comes to "learn" about its environment by "reacting" to the influences at its borders, and how it "encodes" whatever it is that it learns in the substrate that it inhabits.

zaphirplane · 2026-06-06T13:03:34 1780751014

But … how close a simulation is it. I can see why people are wondering

redox99 · 2026-06-06T09:06:46 1780736806

In the 90s you didn't have norm layers, residuals, attention, and some more.

So you're missing a lot of the building blocks that make LLMs. It's not a matter of just having the compute.

sirsinsalot · 2026-06-06T10:24:27 1780741467

I think the attention mechanism is so simple but so revolutionary that people forget it.

Like the best leaps in thinking, once it is made, is is immediately obvious and intuitive.

redox99 · 2026-06-06T11:56:22 1780746982

Almost everything in ML is like that. It seems so obvious in hindsight. It's maybe what I love most.

Residual connections are so simple, so obvious and so vital. Yet nobody came up with them until 2015?

sirsinsalot · 2026-06-06T15:37:39 1780760259

I suspect it was considered many times, but the sheer computation scale would make it feel like obscene brute force. It feels like the right shape but too wild to think about implementing.

I think as time went on, and hardware got better, it seemed more reasonable to actually think about a viable implementation of what I think was a widespread intuition anyone in ML had that everything's context is everything.

It just seemed like a theoretical thing until hardware caught up. Maybe. Perhaps I'm applying a retrospective excuse to why it took so long.

redox99 · 2026-06-06T18:59:30 1780772370

People definitely wanted to train deep networks before, but didn't know how. They evdn tried things like training layers independently.

I don't think it was intuitive to anyone back then, the vanishing gradient problem was a big deal since the dawn of NNs. I'm not sure what you mean by sheer computation, residuals allow you to have deep networks instead of shallow and wide ones. You can have equivalent parameter count.

bonoboTP · 2026-06-06T06:06:19 1780725979

Attention layers were not used in the 90s.

spacebacon · 2026-06-06T07:02:55 1780729375

LLMs are semiotic infrastructure. You won’t find a better analogy. The cognitive frame won’t hold.

otabdeveloper4 · 2026-06-06T06:37:28 1780727848

> I mean a brain is not just neurons with simple connections to each other.

No, it's not. There are many animals that have extremely complex and even learned behaviour that have literally zero neurons.

Clearly "neurons" is an oversimplification just-so story, not a scientific theory.

adammarples · 2026-06-06T09:19:24 1780737564

Apparently even single-celled protozoa can show learned trial and error behaviour.

formerly_proven · 2026-06-06T08:16:12 1780733772

Do you consider fungi animals or do you perhaps mean animals that don't have a brain/CNS?

otabdeveloper4 · 2026-06-06T14:14:05 1780755245

Yes, protozoans don't have brains and yet they exhibit complex behavior.

foxes · 2026-06-06T05:28:05 1780723685

Probably better to not simply reduce it by just saying X is Y then if it has all that extra complexity and capacity.

crossroadsguy · 2026-06-06T08:39:34 1780735174

What hopes/paths does a mere CS bachelor (not deep into stats/maths), and mid level dev (native mobile only; 10-15 years exp.), have about not only understanding it (maybe not fully) but getting possibly into this as a career? Not expecting churning out models and AI systems from the first weeks/months but entry/employment into this field?

(If I can be honest, and I am not being disparaging about anything lest it might seem so, I am looking at it from a career breakthrough/move perspective rather than an intellectual pursuit.)

2muchcoffeeman · 2026-06-06T10:01:43 1780740103

I think you need to ask what you actually want to do with the AI.

If you want to be a researcher and come out with the next breakthrough, get ready to go back to school and learn some math.

If you just need to learn how to use it well and build things with it, then you probably just need to have a high level understanding.

Same as programming. I’d bet most programmers have no idea about the physics that makes computers work.

bluerooibos · 2026-06-06T12:43:47 1780749827

> I think you need to ask what you actually want to do with the AI.

What about improving the efficiency of token consumption, etc., basically opportunities for improving cost/performance?

I keep thinking there has to be a better way to share context with models than dumping entire gigantic skill files of raw text or otherwise into them - I'm betting there's a bunch of low-hanging fruit there.

coliveira · 2026-06-06T13:21:32 1780752092

There may be some low hanging fruit, but they're not available to people without deep understanding of how the math works. Well paid people already spend a lot of time thinking about this.

la_fayette · 2026-06-06T17:46:38 1780767998

i am not sure acctually of the math is acctually that complicated/important. the math around neural networks is calculus/chain rule etc and for model comparison/validation one needs statistics. the required math for e.g. understand transformers is quite accessible.

sirsinsalot · 2026-06-06T10:19:56 1780741196

You missed the third and most important reason to learn: fun.

Which sums up HN these days.

malwrar · 2026-06-06T17:02:19 1780765339

Im also a mere mortal, and after putting a few years into it IMO I’d say people make it much more complicated than it actually is. I failed most of my math courses for lack of interest, but found passion later with the aforementioned SLAM stuff. I have no doubt you or any other programmer could learn this stuff, especially since you can ask ChatGPT clarifying questions.

I have no idea about careers at this point, I’m still doing fancy IT work as my day job I and look away from the future with dread. I also haven’t been looking for new roles on the open job market, so who knows maybe there’s multimillion pay packages for anyone who can articulate how attention works in an interview.

LatencyKills · 2026-06-06T12:58:17 1780750697

I have a BS in CS (and have been in the field for 25 years). I couldn't understand the transformer architecture until I built a few myself. Here are the books I worked through. I now feel I have a very good understanding of modern LLMs.

https://www.amazon.com/Build-Large-Language-Model-Scratch/dp...

https://www.amazon.com/Build-DeepSeek-Scratch-Abhijit-Dandek...

tinktank · 2026-06-07T04:48:03 1780807683

Has it given you enough of an understanding that you can pick up and follow research papers or did you have to do more to achieve that?

LatencyKills · 2026-06-07T10:40:41 1780828841

I went this route because I had difficulty visualizing the content of the Attention Is All You Need paper. After going through both books, I can now understand every part of that paper.

I'm currently working on a robotics project that uses Nvidia's GR00T N1 model, and I was able to understand the research paper. [0]

[0]: https://arxiv.org/abs/2503.14734

tinktank · 2026-06-07T14:50:30 1780843830

Thank you for the information.

wuschel · 2026-06-06T06:26:08 1780727168

Could you perhaps cite the core papers for LLMs beyond „Attention is all you need“?

sigmoid10 · 2026-06-06T06:45:31 1780728331

"Attention is all you need" is actually a bad paper if you want to learn about autoregressive LLMs specifically, because it describes a more complicated encoder-decoder architecture while modern LLMs are decoder only. So it's an unnecessarily hard way to get into the subject. "Language Models are Unsupervised Multitask Learners" is probably what you are looking for (aka the GPT-2 paper). This was the first time LLMs really showed what is possible, i.e. they can learn to generalize very well from unstructured data. So no more human labelling necessary, which until then was the primary bottleneck in ML. The paper also lists several key ingredients beyond transformers that are mostly still in place today. This also highlights that there was more to it than just "scaling the transformer algorithm" like many people claim. Most developments since then were about improving training data, until "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" drastically changed the architecture landscape again. Later big developments like thinking/reasoning/chain of thought/inference time compute (whatever you want to call it nowadays) are actually all about training again. They work using the exact same architecture.

redox99 · 2026-06-06T08:57:41 1780736261

Chain of Thought was kind of an obvious solution that everybody knew was necessary by the time chatgpt / gpt4 came out. It was just a matter of time that frontier labs actually shipped it.

MoE was also pretty straightforward, just a bit surprising how well it worked (that you can get away with just 1/32 active parameters), but most researchers would have come up with it on their own probably.

The true ground breaking papers are the first two you mentioned (transformers and gpt2), and InstructGPT was also very surprising that it worked so well.

sigmoid10 · 2026-06-06T23:11:38 1780787498

Reasoning is a little bit more than just "baked in" chain of thought prompting. The important takeaway here was that it is not realized at the architecture level of the neural network. And you could say that all these things regarding LLMs were pretty straightforward. But only in hindsight, otherwise there wouldn't have been so much time and effort spent on intermediaries. Breakthroughs mean people simply didn't know stuff before, even if it seems easy with the benefit of hindsight.

blackbear_ · 2026-06-06T06:55:00 1780728900

The GPT3 paper is a good starting point

Language Models are Few-Shot Learners https://arxiv.org/abs/2005.14165

I also enjoyed the papers for DeepSeek and GLM for an overview of all the tricks you need to make these things work

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models https://arxiv.org/abs/2512.02556

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models https://arxiv.org/abs/2508.06471

sharma-arjun · 2026-06-06T07:17:09 1780730229

Not a core paper, but I found Formal Algorithms for Transformers [1] (a Google paper from 2022) to have a great pedagogical style.

[1] https://arxiv.org/abs/2207.09238

barrenko · 2026-06-06T07:33:01 1780731181

I'll add in here https://web.stanford.edu/~jurafsky/slp3/, "Speech and Language Processing", with chapters that deal specifically with LLMs and transformers.

root-parent · 2026-06-06T13:51:06 1780753866

I had the same reaction as you, when I learned in detail, how all this works. But then I also learned about superposition and compressed sensing, and now...I am not so sure anymore...

"Beating Nyquist with Compressed Sensing" - https://youtu.be/A8W1I3mtjp8

GardenLetter27 · 2026-06-06T08:16:53 1780733813

It's not just the architecture but also the data - the decoder only approach lets you train in parallel over blocks of text (no RNN serial waiting), that allows you train on much, much more data.

darksim905 · 2026-06-06T05:06:42 1780722402

For anyone who is curious about the first paragraph here, this is actually a great video overview of how LLM works and the tokenization part.

Tangentially related: This part always seemed fuzzy to me, especially when dealing with data scientists and how they talk about how 'ML' looks at problems. I had this issue when working at a SIEM vendor where they kept going on about use case development having to be designed a certain way to catch things. It was all very frustrating.

cloche · 2026-06-06T16:41:30 1780764090

> this is actually a great video overview of how LLM works and the tokenization part

Did you mean to link to the video? I would be interested.

sesm · 2026-06-06T10:45:56 1780742756

I would argue that those are not emergent property of the model, but a property of how humans find insights in a plausible guess.

bluerooibos · 2026-06-06T12:39:57 1780749597

Since you spent a month digging into this, can you recommend any materials/projects to look into to get a decent grasp of how they work?

malwrar · 2026-06-06T17:12:43 1780765963

I’d recommend my method of just drawing out the block diagram and drawing out + digging into the math at each step! I’m the kind of person who needs to take time to ask lots of questions before stuff clicks, and if you are too I strongly recommend it.

I picked it up from trying to teach myself that SLAM stuff. The papers are very short, but highly information dense and at the time there was no ChatGPT to help me. I got through them by just creeping my way through the math with a whiteboard, and something about drawing it out and having it there in my office made it all click. Trying to watch piecemeal lectures on YouTube or grind through foundational books like MVG just didn’t work for me, I used them instead as references for my drawings.

Same happened when I tried learning this GPT stuff. karpathy’s videos were out at the time, but I couldn’t really stay focused on them or connect the math with the code. Most other descriptions I could find were focused on getting you to use their inference library or harness. Assembling the picture together on my whiteboard by focusing on drawing out the block diagram continues to be my personal favorite method for deep understanding of complex systems.

LatencyKills · 2026-06-06T12:54:53 1780750493

Not OP but I worked through Sebastian Raschka's "Build a Large Language Model (From Scratch)" [0] and Raj Abhijit Dandekar's "Build a DeepSeek Model (From Scratch)" [1] books.

I don't think there is anything in a transformer I couldn't explain in the smallest detail now.

[0]: https://www.amazon.com/Build-Large-Language-Model-Scratch/dp...

[1]: https://www.amazon.com/Build-DeepSeek-Scratch-Abhijit-Dandek...

hackinthebochs · 2026-06-06T13:14:15 1780751655

>I don't think there is anything in a transformer I couldn't explain in the smallest detail now.

If you're up for it I would love to know how and why positional encodings work

root-parent · 2026-06-06T14:06:57 1780754817

Learn about superposition and then you will see nobody really know why this stuff works. Its actually a good interview question to set the bar....

LatencyKills · 2026-06-06T13:32:26 1780752746

Well, as I suggested, working through the implementation yourself will give you that intuition. That said, I think the simplest way to explain why positional encodings are useful is that it gives the transformer just enough information to make attention meaningful without negatively impacting any parallel, content-based comparisons.

A vanilla self-attention layer is just a set of token vectors. Without positional info, swapping two identical embeddings changes very little about what attention can compute. We can "fix" this problem by using positional encodings. Text that has meaning isn't just a set of characters; the location and order of those characters is what provides meaning.

Gmolomo · 2026-06-06T08:39:22 1780735162

Sooooo just because you are able to understand it, it's not worth anything?

It doesn't has any impact?

Ah wait it does. Mh weird.

Why are you not creating a startup and get rich?

sarjann · 2026-06-06T08:47:16 1780735636

I mean there is a little something called compute. And other complexity that comes like writing code to efficiently distribute a model across machines.

pkoird · 2026-06-06T05:07:42 1780722462

aka "the bitter lesson"

dominotw · 2026-06-06T13:18:17 1780751897

> Over the next month I ended up drawing out a block diagram on a whiteboard I have in my office, with the math involved next to each step in the blackboard. I’d puzzle about each step along the way, and the triumph of completing the drawing was also that of this sense of deep understanding. I kept that drawing up for many months after, and would gaze at it often during meetings and idle moments in wonder.

how did you know about the steps and there was math involved. i am curious about your process and you came up with what exactly to learn to unravel the mystery.

coliveira · 2026-06-06T13:17:10 1780751830

Don't forget the stolen data from books and papers. You'll never get anything intelligent without using the stolen data they had access to.

giardini · 2026-06-06T17:28:37 1780766917

golergka · 2026-06-06T09:33:49 1780738429

After building some toy LLMs on my own I came to realise that architecture is not the hard part. Train is.

dist-epoch · 2026-06-06T11:24:46 1780745086

That's easy to say AFTER you know the architecture.

Einstein special relativity is taught these days in high-schools. Doesn't mean it wasn't the very hard part at some point in time.

As they say, shoulders of giants.

faurroar · 2026-06-06T05:14:17 1780722857

Architectures have evolved significantly since then. DeepSeek v4 =/= GPT-3. Even then, a great deal of complexity lies in everything surrounding the architectures e.g. how do you implement them performantly on modern accelerators, how do you distribute the model across a set of accelerators, how do you post-train, etc. And pre-training itself is a dark art. If you legitimately think that frontier labs are doing something equivalent to whatever you wrote on your whiteboard, you’re clueless.

jumploops · 2026-06-06T05:26:03 1780723563

Those are all just optimizations.

We still don’t really know why they work, we just know how to build them.

trollbridge · 2026-06-06T05:31:29 1780723889

We don't really know why language works with humans, either. If you raise a baby from birth, you kind of observe how it is learning language, but the process is also rather mysterious. My eldest son's first word was to actually imitate a cow mooing, and then after that to imitate a motor noise of a tractor or truck. And then after that a meow. (His first complete sentence was "King Graham fell"...)

My next child took a completely different path to language, including skipping all the non-verbal imitations.

And then at some point, you just suddenly can two-way communicate with them when you couldn't before, and then after that, they can engage in reasoning.

jumploops · 2026-06-06T06:29:17 1780727357

Completely agree!

It’s interesting to me how similar attempting to understand LLMs is to neuroscience.

“When we turn this bit off, this other thing happens… if we change these weights the Eiffel Tower is now in Rome”

We’re basically just probing around and trying to reverse engineer an emergent system.

To your point, this system may be quite different from model to model (human to human) although some similarities likely occur.

The comment I was responding to tried to belittle the OP’s understanding of transformers, by mentioning that running an LLM at scale is much harder than the simple white board diagram.

My point was simply that we don’t know why they work, and all the extra optimizations isn’t the “thing” that makes it emergent.

Simply scaling the “GPT” is good enough to see it, so the OP’s awe should stand.

(On a side note, what other architectures can we scale to find similar emergent behavior?)

galaxyLogic · 2026-06-06T20:24:13 1780777453

Isn't the LLM simply predicting what should be the next sentences after user's input, using its algorithm and data it has exatrcted from existing texts on the internet. The algorithm that does that could have many different designs, some better some worse for the purpose of predicting what output makes most sense next?

So what is it that we don't understand about why theyr work? The algorithm? We have the code. Why the specific algorithm makes such good predictions? I see it as a generalization of trying to predict who wins Kentucky Derby.

trollbridge · 2026-06-06T10:12:16 1780740736

Computer vision ends up displaying emergent behaviour. It just "figures out" things.

ai_slop_hater · 2026-06-06T06:30:30 1780727430

Human brain capabilities are truly amazing, imagine if people didn’t treat their children as if they are stupid and didn’t constantly lie to them, because kids are stupid right, they wouldn’t understand. What heights could be reached.

baq · 2026-06-06T06:43:20 1780728200

We don’t treat children like they’re stupid, we treat children like they’re children. A stupid adult is treated very differently than any child.

Adults are expected to have their world models approximately correct in terms of physical environment so they won’t accidentally kill themselves by falling off a cliff; then there are the social norms which adults are expected to conform to so everyone is kinda predictable to everyone else so adults don’t kill each other too often over food or mates. Understanding of neither is expected from children.

Izkata · 2026-06-06T19:12:50 1780773170

Another example, my parents taught me to read at about 4 years old. When I started kindergarten (the year before 1st grade in the US), the teachers and principal didn't believe I could read and I had to prove it by reading a book to them I'd never seen before.

I think they're right that kids (at least in the US) are generally treated as less capable than they are, and it ends up slightly delaying their development.

ai_slop_hater · 2026-06-06T07:03:46 1780729426

You may have been raised properly since you don’t get what I mean. I really envy kids with “Chinese parents” that had them learn math early on and not some bullshit like that if you put your tooth under your pillow, then a tooth fairy will come.

mejutoco · 2026-06-06T07:30:43 1780731043

I think those 2 are orthogonal. Math still works with Santa or the tooth fairy.

ai_slop_hater · 2026-06-06T07:46:23 1780731983

Maybe math works but critical thinking doesn’t. There are people who have lived for many decades without ever questioning insane b.s. they were taught as kids.

beezlewax · 2026-06-06T07:40:24 1780731624

It is possible to have learned both things you know.

skydhash · 2026-06-06T12:04:41 1780747481

I had to learn maths early (not chinese or asian) and also a bunch of scary stories to make me behave. I would have been glad to learn about fairies.

trollbridge · 2026-06-06T10:10:15 1780740615

They aren't stupid, but they aren't quite ready to handle the full responsibilities of the world and worry about things they don't need to worry about.

My son is very worried about black holes lately when he learned anything that goes into one can't get out. He's pretty concerned astronauts could get stuck in one some day. So I explained to him that Hawking radiation does actually mean you can eventually get out; it just takes some time.

I didn't think it pertinent to mention spaghettification, the fact anywhere near a black hole will be really hot, or that cosmic censorship means whatever Hawking-radiates from a black hole wouldn't be an astronaut anymore.

It was also fun to hear Hawking speak. He wanted to know if Hawking was a robot. I said no, but he has a robot talk for him. Not quite true, but close enough.

pmg101 · 2026-06-06T07:16:39 1780730199

Because god forbid that childhood, the one time in your life when you don't have any responsibilities, should be fun.

ai_slop_hater · 2026-06-06T07:31:40 1780731100

Waste 22 years of life without learning anything and then slave away at a 9-5 job you hate. Brilliant strategy. At least you had “fun”. Then blame billionaires or something.

skydhash · 2026-06-06T12:09:00 1780747740

Childhood only lasts 13 to 15 years where I am. By the time you’re in high school, you can be expected to be responsible in some matters. By 22 you have 7 years of experience in making decisions for yourself.

slopinthebag · 2026-06-06T05:56:41 1780725401

Hm, I wonder if it's more that we're shocked such a simple thing (relatively speaking) can work so well.

malwrar · 2026-06-06T17:18:12 1780766292

It was precisely that for me! Another commenter captures it well; “the bitter lesson” indeed.

otabdeveloper4 · 2026-06-06T06:46:43 1780728403

We do know how they work. They predict the next statistically most likely token.

The "bitter lesson" is that fake-it-till-you-make-it is a valid way of doing knowledge work.

(Or not make it, then people will just claim you're holding the LLM wrong and it's not the AI's fault.)

throw310822 · 2026-06-06T07:43:18 1780731798

> statistically most likely token.

Statistically most likely in what context, given which preconditions? Because each prompt sequence is unique so the probability of any token following it is unknown.

skydhash · 2026-06-06T12:13:01 1780747981

It’s not unknown because that’s what the model computes. It’s matrix multiplication just like shaders.

throw310822 · 2026-06-06T12:34:54 1780749294

And how do you know that the model computes it correctly?

skydhash · 2026-06-06T12:59:23 1780750763

Correctness is based on axioms and rules. You need to define your axioms and rules first before you can determine correctness.

If you’re talking about matrix multiplication, I can use mathematical rules and axioms and proves formally that the multiplication is correct. For next token prediction, I can prove that the set of tokens is finite and that the next token is always part of that set.

But things like grammar correctness, or semantic consistency over a few sentences are not hardcoded rules in the model. They’re emergent properties, mostly due to the amount and quality of data available for training. Quantization is mostly about how much we can shed without loosing a particular emergent properties (like dithering or psycho acoustic audio compression)

klempner · 2026-06-06T13:43:05 1780753385

Sufficiently good iterated next token prediction is an AI hard problem.

perching_aix · 2026-06-06T14:45:00 1780757100

This "they just predict the next statistically most likely token" is such an handwavey and willfully misleading explanation, it's unreal, and I'm so fucking tired of seeing it so incessantly repeated. It's beyond asinine.

You know it perfectly damn well that a typical person's idea of statistics is not some insanely high cardinality stateful prediction, but a "well a coin toss is a 50:50, and a lottery win is a 1:100000000". You also know it perfectly damn well that as a result, people will just think that all the sentences chatbots ever produced to them were then just somewhere in the massive training set, letter by letter. This insinuation is often even explicitly appealed to.

And that picture is outright false. It's a statistical process, yes, so saying that it does what it does by "just doing statistics" is gonna be a generally correct description, but that's not at all inquisitive to how exactly does it do it, nor is it the zinger you think it is. If you did the aforementioned, you'd just get milquetoast nonsense, like you can see in the countless Markov-chain primers. And while the models do have a lot of the training set lossily captured, they do also absolutely generalize (that's how they can do that lossy compression), and you can quite literally find representations of those generalizations in them, and also see them activate.

It's like summarizing how any program works by just saying "well it just manipulates ones and zeroes". Not very informative, is it? Or how programs are written by just programmers sitting in a cushy office, ryhtmically pressing keys on a keyboard. Not a very fair or insightful description, which you'll know if you've done any amount of programming in your life on your own. Extends to all other white collar jobs too.

It's also not even true in the most literal sense: models can and do absolutely choose a less than maximally likely next token, that's what the various decoding parameters are for. "Maximally likely next token" further conviently skipping over how that likelihood is established in the first place, i.e. the literal point of the question, going in a cute little circle.

I'm so over this "stochastic parrot" bullshit.

stevenhuang · 2026-06-06T21:08:46 1780780126

I don't even try anymore. The people who still parrot the stochastic parrot bit this late in the game will simply never understand it.

otabdeveloper4 · 2026-06-06T21:11:45 1780780305

LLMs predict next token one at a time. (Stochastically.) Literally. It's what they do. That's how they literally work.

If you don't believe me, download llama.cpp and see for yourself.

P.S. I write inference backends in C++ every day. The gall of people like you who figured out how to prompt Claude and think they're hot shit now is simply unbelievable.

perching_aix · 2026-06-06T23:18:19 1780787899

So you work on inference engines, and don't see at all what'd be hilariously disingenuous and reductive about describing how LLMs operate as "just parroting the most statistically likely next token"? It is literally* what they do, yes. And only literally, with a big asterisk of "non-colloquial meaning" after the word "statistically". Like how "significant" means something pretty different, albeit related, in academic writing vs everyday speech.

It's equivalent to professing how you just make apple pies from scratch, while your first step is to always reinvent the universe.

You're further magically blind to this operational fact being weaponized as a trope for furthering anti-ai sentiment (i.e. that it's a political dogwhistle at this point), and to thus you participating in that every time you repeat it?

* Ignoring the decoding caveat I already mentioned, along with the countless ways they're steered. There isn't jack that's likely about some of the responses they produce, and intentionally so. Including the whole chat partner act.

stevenhuang · 2026-06-06T23:30:19 1780788619

Look at his comments here.

Safe to say there's a cognitive block and until he tries to approach this topic in good faith he'll simply never understand. Lol.

https://news.ycombinator.com/item?id=48429027

perching_aix · 2026-06-06T23:49:26 1780789766

It's so beyond tiresome. It's a classic case of someone being technically correct, and abusing the gap between that, and what people actually gather from it, for sentiment manipulation (willfully or otherwise). And I have a pretty hard time believing at this point that it's the otherwise.

I really don't know what's so interesting about auto-complete or next token prediction that it captures these people's attention so much. They're so blatantly not the salient quality to these products that is of interest to the common discourse, it's just baffling.

stevenhuang · 2026-06-06T22:44:02 1780785842

I help write optimized CUDA kernels for proprietary hardware. They may "literally" work this way, but that is quite besides the point.

If you don't see why then you have exactly demonstrated my point in how practitioners like you simply lack the foundational understanding in philosophy, information theory, human consciousness, human cognition, neuroscience, necessary to bridge this conceptual gap.

(Rather, it is that we know so little of how consciousness or what intelligence even is, that we cannot possibly use first principles to preclude LLMs from possessing these qualities)

You don't understand the argument, so you keep repeating first order mechanistic observations that are irrelevant. If you don't want to understand the argument, don't be surprised when people refuse to engage with you, especially when it's evident to those more knowledgeable the position you hold is the ignorant one.

firemelt · 2026-06-06T08:52:18 1780735938

fucking well said

robwwilliams · 2026-06-06T14:32:47 1780756367

Great, and won’t we all be just as surprised when human self-attentional control turns out to be just as simple or just as complex! Our minds as a strange fabric built of threads of recursions without the benefit of any explicit clock.

malwrar · 2026-06-03T02:27:09 1780453629

How do you verify someone’s age reliably without identifying them? Unless there’s some standard around zero-knowledge proofs they’re implementing that I’m not aware of, they’re probably going to end up identifying everyone as part of this system, since kids will try and bypass it and parents will demand it be made more robust. Kids will still bypass it no matter what we do.

pocksuppet · 2026-06-03T17:16:43 1780507003

You don't. The law we're discussing doesn't verify someone's age. Please stop calling it an age verification law, because that's completely disingenuous FUD.

malwrar · 2026-06-02T20:33:09 1780432389

I don’t understand how video evidence from a mass surveillance network would have helped here. They found your car without it! Shouldn’t your issue be with the prosecutor, and thus your ineffective local government?

Otherwise, what’s to stop them from just telling you video evidence isn’t enough, because jurors have become accustom to thinking that video evidence can be faked by vindictive cops?

malwrar · 2026-06-01T11:32:33 1780313553

Really hate to say it, but I’ve stopped publishing my work too for this reason. I spend most of my time now building my own little software ark, and I aspire to no longer think of programming in the next few years. I feel like the creative economy in general will be unrecognizable in the near future, maybe nonexistent. I wonder what modes of collaboration on ideas might form in the next few years.

irdc · 2026-06-01T11:52:29 1780314749

Here is what the purveyors of AI don't seem to realise. You can bend copyright law all you want in order to train your models on whatever you can grab, but in the absence of genuine protection of their creative work authors are simply not going to be publishing at all.

buran77 · 2026-06-01T14:31:00 1780324260

I think they see it all too well. They still think they can make bank today while it lasts, whatever comes after is some other shareholder's problem. And if we're talking about open source, killing it might be a positive side effect, they'll be ready to sell you a closed source alternative when you no longer have options.

irdc · 2026-06-01T16:04:02 1780329842

I don't think we're going back to closed source. I think we're going back to guilds. Aka. closed knowledge.

lesostep · 2026-06-01T15:00:28 1780326028

Furthermore, if people not only stop publishing, but also take down already published works, it will create a moat around already existing Language Models

And the more they DDOS small websites — instead of respectfully scraping once — the more realistic my conspiracy theory looks.

egypturnash · 2026-06-01T14:13:27 1780323207

People who are making stuff because they want to share it are still going to be publishing. And fighting to be noticed in an unending torrent of slop.

irdc · 2026-06-01T14:21:00 1780323660

Without any material or immaterial benefits? And with one's work being ground up and turned into weights for the next version of the machine that's threatening one's employment?

egypturnash · 2026-06-02T04:10:08 1780373408

I personally am sharing stuff because I want people to read my comics, and maybe join my crowdfunding campaigns.

If I could put everyone pushing all this AI crap into a meat grinder, I would.

lelanthran · 2026-06-01T17:41:28 1780335688

> People who are making stuff because they want to share it are still going to be publishing.

Those people who do that are too few and far between to make a difference. The majority of open source devs aren't giving away the source without a license. That license is how they specify what they want in return.

dragonwriter · 2026-06-01T17:48:43 1780336123

> The majority of open source devs aren't giving away the source without a license.

100% of open source devs aren’t giving away the source without a license, since a licence—the grant of permissions for what is otherwise exclusive to author under the law—is what makes something open source.

> That license is how they specify what they want in return.

No, the license is how they legally give away permission to use material that is legally subjejct to their exclusive rights by virtue of creation. The license may be a contract license that, as you suggest, involves mutual exchange of value, but for many (especially permissive) open source licenses it is a gratuitous bounded grant of permission which has limits but does not involve giving something of value back to the creator.

lelanthran · 2026-06-01T18:28:02 1780338482

> No, the license is how they legally give away permission to use material that is legally subjejct to their exclusive rights by virtue of creation. The license may be a contract license that, as you suggest, involves mutual exchange of value, but for many (especially permissive) open source licenses it is a gratuitous bounded grant of permission which has limits but does not involve giving something of value back to the creator.

Wrong. What they want in return is either credit or derivatives of the software. It's disingenuous to suggest that all these authors specifying, in a legal document, the exact mechanism by which to pay them back don't know what they are asking.

If you're not happy with that trade, then don't make it.

dzhiurgis · 2026-06-01T12:41:57 1780317717

Great. More work for AI then.

kator · 2026-06-01T14:19:17 1780323557

The sad thing is I feel trapped on all sides of the debate, I wrote a book about LLMs and human creativity (spoiler Humans win for a long time) but I was going to do it as a blog series, instead I published https://www.amazon.com/dp/B0GXCSY4W8 because I felt at least I might get a bit back for literally 100’s of hours of my life I poured into the book and my editor and friends who read and provided reviews.

And I push a lot of open source code including a ton for the SWGEmu project, but now I’m of mixed mind to stop pushing anything public. I can’t decide, am I talking out of both sides of my mouth, it’s a confusing time to navigate for sure.

malwrar · 2026-06-01T16:11:58 1780330318

Indeed sad, congrats on publishing your book though. I’ve certainly felt a bit of that same angst myself.

I think SWGEmu (cool project, just learned of it from you!) do represent some optimism though. Maybe these sorts of passion projects will take over the space?

lelanthran · 2026-06-01T17:40:05 1780335605

> Really hate to say it, but I’ve stopped publishing my work too for this reason.

Me too; not that I've published a lot, but definitely more than most. That won't be happening anymore.

malwrar · 2026-05-31T18:23:12 1780251792

I don’t think people realize that these devices can even be used that way. I talk with people outside of the tech scene frequently, and they are routinely surprised when I tell them about this sort of capability. The ring doorbell Super Bowl commercial about finding lost dogs was a genuine shock to people! I think there’s a degree of visibility you need to get people’s attention on an issue, and it’s just difficult to see a doorbell as a threat for the average person.

malwrar · 2026-05-31T18:15:03 1780251303

You know how when you finish prompting some code generator to build something, and you look over what it has built and feel a sense of emptiness even if it does what you want? I think about what I wish the prototype looked like, and basically start describing details that I expect to exist (think longer versions of e.g. “this should be using our internal graph library, and I figure we can model this task as a traversal, how far have you strayed from this and why?”) and let the agent analyze what it built against my expectations. I’ve spent hours in conversation just “refining the context” this way, and then I channel that into an update process. I figure the prototype is just about proving out behavior, and this next phase is about refining it into the pieces I’ll use elsewhere. It’s kinda fun, I’d absolutely burn out a coworker if I grilled their PRs the way I roast AI contributions :P

malwrar · 2026-05-30T22:19:54 1780179594

I’m still pretty skeptical about OpenRouter. I have a client implemented for them so I can use them with my harnesses, but at the same time that client was generated and tested in an hour or so just like all of the other llm provider clients that I have. Using these services interchangeably by just swapping out clients has so far been working well for me. I think when it comes down to it, the only real inconvenience that they’re solving is where I put my credit card number. Is there something key that I’m missing about this service (besides it being a nexus of attention) that warrants this kind of investment? Or is this truly the bar for starting a successful AI company :P

malwrar · 2026-05-27T15:26:39 1779895599

I once had someone start _arguing with me_ about stuff using generative text on Slack or generated email replies. Not even to provide information, but just to write flat denials to even continue discussion on the subject. This person had a very distinctive writing style, and the shift to the AI writing style felt pretty obvious and uncanny.

I can’t describe how disturbing it was to realize that my voice suddenly no longer mattered, and that I was speaking to something that would never get tired of creatively dismissing my ideas without ever really addressing them. This behavior compounded and was unaddressed by anyone, no one I talked to seemed willing to try actually pushing back against it. Best solution they had was to have physical meetings w/ n>1 people on each side in the room. Trust plummeted, and eventually all meetings with that team were recorded and transcripted, and people started talking like they were on stage vs trying to solve shared problems. Work ground to a halt on even basic things, and I ended up leaving. This was on a pretty major project that has a name people here would know, but don’t ask me what!

thesamethrowawa · 2026-05-27T16:25:52 1779899152

This is honestly terrifying and a perfect example of where AI can end up leading us.

incognition · 2026-05-28T06:15:04 1779948904

usually it takes lawyers for this!

malwrar · 2026-05-25T11:55:40 1779710140

Lucky find! Just picked up one of those for a build, ohhhh boy was that a painful purchase. Thank god for my fortune to work in tech.

malwrar · 2026-05-09T19:44:38 1778355878

It has been funny to watch people’s attitudes on copyright change ever since ChatGPT blew up. All I used to hear and experience was copyright used by corporations to shut down open source projects threatening their business models, but now it is the savior of the little guy who is a victim of flagrant corporate violators. In the background, the wealthy and powerful disregard all of this and seem to do whatever they want, and the little guy looks at millions of dollars in legal costs to defend themselves in either case. Costs that are increasingly a rounding error to their opposition as they continue to grow by exploiting a broken system, and the “little guy” now includes whole industries.

I feel like adversarial interoperability more than free market capitalism should have been the death knell for most of the negatives highlighted in this post. Everyone is still so determined to make money from mere ideas however that we still use 1700s law designed to protect book publishers to enable the existence of “businesses” so warped in valuation that they are now trillion dollar entities yet always face the existential threat of copy+paste. What if the more profound truth is that tech is beneficial to humanity but inherently worthless to sell, and that our present woe’s shape is determined by the antiquated institutions built service this illusion of value? In an inevitable future age of generative AI as an accessible technology, as opposed to a business model with a moat, what even is our goal for such institutions? What sorts of creativity do we want motivate, and what meaningful regulatory constraints even are there to begin with? I hope we figure it out soon, because IP will be impossible to enforce post-deglobalization in any case.

pocksuppet · 2026-05-09T23:44:22 1778370262

Think it's just the hypocrisy. Either copyright for everybody or copyright for nobody is much more defensible than the current state of affairs, where infringing copyright is legal as long as you're rich. Some random guy in Nebraska had to pay $250,000 to a music company for downloading one MP3, but OpenAI can download all music that ever existed and pay nothing. Meanwhile they prosecute "Anna" who did the exact same thing, because "Anna" isn't politically well-connected.

Permit · 2026-05-10T09:11:16 1778404276

> where infringing copyright is legal as long as you're rich.

This isn’t true. A rich person and a poor person can train LLMs on copyrighted material in 2026. How they acquired those materials matters. Wealthy corporations hold no legal advantage in this space. For example, Anthropic recently settled for $1.5 billion due to acquiring books via piracy: https://www.nytimes.com/2025/09/05/technology/anthropic-sett...

My understanding is that an individual could likely pirate the same books without paying a dime (not due to differing legal standards but simply due to the fact it would be hard to identify them in many jurisdictions). In a practical sense it seems corporations are held to a higher standard in this regard.

The discrepancy is that some people equate training a model with piracy even though they are not the same thing. This is typically due to intellectual laziness (refusal to understand the differences) or willful misrepresentation (due to being an ideologically opposed to generative AI). No need to make such a mistake here though.

Anamon · 2026-05-10T23:29:35 1778455775

Of course it's not the same thing -- it's way worse.

The piracy comes first, and it's exactly the same thing. GenAI Corp. can't train models on illicitly obtained media before illicitly obtaining said media. And that very thing is already what private individuals got and get sued for millions over.

The GenAI Corp., having gotten away with that unpunished, then goes on to commit further violations by commercially exploiting the media with neither a license to do so, nor any intentions to pay the rights-holders for their use.

By the media conglomerates' own math, these GenAI companies should all be drowning in lawsuits over kazillions of bajillions of dollars.

Permit · 2026-05-11T01:36:50 1778463410

> The piracy comes first, and it's exactly the same thing. GenAI Corp. can't train models on illicitly obtained media before illicitly obtaining said media.

My contention is that this is not happening. Most generative AI companies do not source their training data from illegal torrents and the few that do are currently paying for it. Further, I suspect the companies that get away with it today are _smaller_ not larger.

Training data is typically sourced by scraping the publicly available web.

> Of course it's not the same thing -- it's way worse.

Setting aside your own moral standards here, we should at least be able to agree that from a legal standpoint training a model is not copyright infringement.

DFHippie · 2026-05-11T10:47:42 1778496462

> A rich person and a poor person can train LLMs on copyrighted material in 2026.

Updating an old adage for the modern age:

“The law, in its majestic equality, forbids rich and poor alike to sleep under bridges, to beg in the streets, and to steal their bread.” ― Anatole France

happytoexplain · 2026-05-10T17:57:09 1778435829

As others have said, it's not a change. There's no inconsistency in applying copyright to protect people. When Gigantic Company uses copyright to bully the little guy who isn't doing anything to materially harm Gigantic Company, that's bad. When AI steals the little guy's work, that's bad. They're both bad. That's consistent. It's also obvious that it's consistent - i.e. I don't believe people making the "AI copyright complaints are funny" quip are being honest. I believe they are simply engaging in petty social politics.

krapp · 2026-05-09T20:06:26 1778357186

>It has been funny to watch people’s attitudes on copyright change ever since ChatGPT blew up. All I used to hear and experience was copyright used by corporations to shut down open source projects threatening their business models, but now it is the savior of the little guy who is a victim of flagrant corporate violators.

That isn't a change. Both claims are true.

malwrar · 2026-05-09T20:47:13 1778359633

I agree. My point in short is that we seem to reflexively frame right and wrong on an axis defined by copyright, and somehow we’ve lost sight of the fact that the law itself is used much differently than we might otherwise want.

Technolibertarians confuse free market capitalism via copyright-enabled businesses as a viable strategy for individual freedom, and we find with time that only bastards win in a competition with loose rules and high stakes. Those concerned for the continued flourishing of human creativity in the face of LLMs confuse copyright as a means for small creators to have some ownership over their work, when it actually just seems to be a cudgel that can only be wielded by the wealthiest. Same losing fight, different flavor. I ask: why do we continue to allow “ownership of ideas” to underlie the moral basis of our conversations to begin with?

krapp · 2026-05-09T21:15:36 1778361336

I think it's more that we see copyright as a necessary evil that can be used to defend our rights, but will be abused by the powerful, regardless.

To me, the biggest sin of cyberlibertarianism is the assumption that "cyberspace" is de facto another universe, separate from material reality, that doesn't need to be affected by the mundane and vulgar rules of "meatspace." John Barlow refers to "your governments" as if using a computer actually separates him from the state in some meaningful way, as if he has ascended beyond the flesh and now looks down upon the world as a being of pure Mind. But of course, "cyberspace" is just computers, servers, infrastructure using power and resources and thus is inextricably subject to government and systems of law. Zion was never an escape.

So yes, because cyberspace doesn't actually change the rules of the game, we have to play the game, crooked as it is, with the hand we're dealt. The legal pretense of ownership and copyright is all we have. If you want to abandon the idea of "ownership" altogether, then the wealthiest and most powerful still wind up controlling everything by virtue of their wealth and power. What do you suggest?

DonHopkins · 2026-05-09T21:40:00 1778362800

There’s a great moment from the Pirate Bay trial that captures this:

Lawyer: "When was the first time you met IRL?"

Peter Sunde: "We don't use the expression IRL. We say AFK."

Judge: "IRL?"

Lawyer: "In real life."

Peter Sunde: "We don't like that expression. We say AFK — Away From Keyboard. We think that the internet is for real."

— Peter Sunde, The Pirate Bay trial (as shown in TPB AFK)

TPB AFK: The Pirate Bay Away From Keyboard; Directed by Simon Klose (2013).

https://archive.org/details/TpbAfkThePirateBayAwayFromKeyboa...

https://en.wikipedia.org/wiki/TPB_AFK

flomo · 2026-05-10T03:16:46 1778383006

The whole thing just shows a huge lack of imagination, at least for something which is supposedly a 'founding document'. Barlow's "cyberspace" is for irrelevant shit like furry larping or talking about the latest Deep Space 9. Its not a place where you do banking (or even watch DS9).

malwrar · 2026-05-10T17:02:51 1778432571

> John Barlow refers to "your governments" as if using a computer actually separates him from the state in some meaningful way, as if he has ascended beyond the flesh and now looks down upon the world as a being of pure Mind. But of course, "cyberspace" is just computers, servers, infrastructure using power and resources and thus is inextricably subject to government and systems of law. Zion was never an escape.

I don't understand what you're trying to say here, is it that "cyberspace" couldn't exist as anything "real" because governments can just shut down servers? That's why you can't buy drugs and credit card numbers online anymore, right? Sarcasm aside, you seem to be using the fallibility of the current-popular physical layer to dismiss the otherwise separate tangible "space" that does seem to exist when lots of people can communicate fluidly with each other across vast distances. Or is your critique centered on the ability of "cyberspace" to go beyond just communication and serve as a space one can actually "live" in?

> The legal pretense of ownership and copyright is all we have. If you want to abandon the idea of "ownership" altogether, then the wealthiest and most powerful still wind up controlling everything by virtue of their wealth and power.

Limiting abandonment of "ownership" to only "copyright" and IP generally, what do you propose the wealthy would control that would allow them to replicate present circumstances in "cyberspace"? The best I can think of would be communications infrastructure, and they didn't build that by themselves (at least in the US) to begin with.

For example, why would TikTok continue to be usable as a brainrot generator & propaganda tool when content is necessarily separate from the algorithm and presentation layers? Current bastards exploit their centralized control based on this house of cards ownership structure. Nothing is practically stopping users from cloning the contents from the cdn and writing a new frontend besides legal threats. This is true of almost every tech business that exists, and many of them themselves exploited this asymmetry during their founding. They exist because billionaires use the legal system to scare individual upstarts from threatening their business model.

roenxi · 2026-05-10T00:01:24 1778371284

> It has been funny to watch people’s attitudes on copyright change ever since ChatGPT blew up.

I doubt many individuals actually changed their opinions. Just that a large crowd of previously-silent people decided AI is a threat to them and they can attack it on copyright grounds. The AI revolution is a great argument against copyright law. The US's lax enforcement means that the incredible, world-changing tech could be built before the luddites got organised to try and stop it. The productive path appears to be illegal, but they took it anyway and we're all the better off for it.

happytoexplain · 2026-05-10T17:58:33 1778435913

The Luddites were rational and correct.

roenxi · 2026-05-12T22:35:11 1778625311

Probably. But not helpful.

Being rational and correct is a low bar, it is likely that all sides of any given debate a rational and correct to some extent. Someone dumping their entire life savings into a casino can be said to be rational and correct if the person doing it genuinely prioritises short term pleasure enough - still a stupid thing to do.

goda90 · 2026-05-10T04:43:13 1778388193

Why do you think we're all better off?

roenxi · 2026-05-10T06:04:10 1778393050

The reasons that jump out at me are that, as a society, we're setting up to produce a more stuff with less effort, provide higher quality advice to everyone at an absurdly low cost, revolutionise research and it looks like we're going to be able to get a step-change improvement in the quality of economic management which is huge in and of itself. The wins seem like they're going to be big.

goda90 · 2026-05-10T12:24:22 1778415862

> we're setting up to produce a more stuff with less effort

According to Jevon's paradox[0], this would lead to more consumption of resources. We're already straining at the limits of the Earth. Depletion and collapse won't be good for anyone.

> provide higher quality advice to everyone at an absurdly low cost

Given every LLM's propensity to hallucinate, the only quality advice is that which can be followed back to a human expert-vetted source. But we already have people who don't check sources and get bad advice.

> revolutionise research

Maybe, but AI is also being used in a mass spread of misinformation.

> a step-change improvement in the quality of economic management

I don't know exactly what you mean by this, but from what I'm seeing so far, this looks like it will massively increase wealth disparity, which is bad for most people.

[0] https://en.wikipedia.org/wiki/Jevons_paradox