If OpenAI employees have an inch of spine left, they better demand Sama to take the same stance on this as Dario. No mass surveillance and no autonomous weapons.
You have to be a craven, hollowed out husk of a person if you let the DoD demand your AI be used for killing people or surveillance of Americans. Even if you believe America serves a positive role as world police, even if you're pro-Trump, you just have to see what a terrible precedent this sets.
Here's where I would expect the CEOs of the other AI labs to stand by Anthropic and say no.
1. Compete directly with their highest margin API customers
2. Buy up real businesses with the billions of equity and cash that they're raising
3. Lobby the government for regulations to stifle competition
4. Beg for a bailout or 0% loans when the music stops
5. Follow the Zuck playbook of copying competitors, spying on their users, spamming and addicting them, then squeezing the whales out of everything that they have (there's a reason why Anthropic and OpenAI have a bunch of Facebook execs leading their product groups)
It's the new underpaid employee that you're training to replace you.
People need to understand that we have the technology to train models to do anything that you can do on a computer, only thing that's missing is the data.
If you can record a human doing anything on a computer, we'll soon have a way to automate it
My only objection here is that technology wont save us unless we also have a voice in how it is used. I don't think personal adaptation is enough for that. We need to adapt our ways to engage with power.
Both abundance and scarcity can be bad. If you can't imagine a world where abundance of software is a very bad thing, I'd suggest you have a limited imagination?
Aggressively expanding solar would make electrical power a solved problem, and other previously non-abatable sources of kinetic energy are innovating to use this instead of fossil fuels
It’s not worth it because we don’t have the Star Trek culture to go with it.
Given current political and business leadership across the world, we are headed to a dystopian hellscape and AI is speeding up the journey exponentially.
It's a strange economical morbid dependency. AI companies promises incredible things but AI agents cannot produce it themselves, they need to eat you slowly first.
Exactly. If there's any opportunity around AI it goes to those who have big troves of custom data (Google Workspace, Office 365, Adobe, Salesforce, etc.) or consultants adding data capture/surveillance of workers (especially high paid ones like engineers, doctors, lawyers).
> the new underpaid employee that you're training to replace you.
and who is also compiling a detailed log of your every action (and inaction) into a searchable data store -- which will certainly never, NEVER be used against you
i've been working in this field for a very long time, i promise you, if you can collect a dataset of a task you can train a model to repeat it.
the models do an amazing job interpolating and i actually think the lack of extrapolation is a feature that will allow us to have amazing tools and not as much risk of uncontrollable "AGI".
look at seedance 2.0, if a transformer can fit that, it can fit anything with enough data
How much practice have you got on software development with agentic assistance. Which rough edges, surprising failure modes, unexpected strengths and weaknesses, have you already identified?
How much do you wish someone else had done your favorite SOTA LLM's RLHF?
This benchmark doesn't have the latest models from the last two months, but Gemini 3 (with no tools) is already at 1750 - 1800 FIDE, which is approximately probably around 1900 - 2000 USCF (about USCF expert level). This is enough to beat almost everyone at your local chess club.
Wait, I may be missing something here. These benchmarks are gathered by having models play each other, and the second illegal move forfeits the game. This seems like a flawed method as the models who are more prone to illegal moves are going to bump the ratings of the models who are less likely.
Additionally, how do we know the model isn’t benchmaxxed to eliminate illegal moves.
For example, here is the list of games by Gemini-3-pro-preview. In 44 games it preformed 3 illegal moves (if I counted correctly) but won 5 because opponent forfeits due to illegal moves.
I suspect the ratings here may be significantly inflated due to a flaw in the methodology.
EDIT: I want to suggest a better methodology here (I am not gonna do it; I really really really don’t care about this technology). Have the LLMs play rated engines and rated humans, the first illegal move forfeits the game (same rules apply to humans).
The LLMs do play rated engines (maia and eubos). They provide the baselines. Gemini e.g. consistently beats the different maia versions.
The rest is taken care of by elo. That is they then play each other as well, but it is not really possible for Gemini to have a higher elo than maia with such a small sample size (and such weak other LLMs).
Elo doesn't let you inflate your score by playing low ranked opponents if there are known baselines (rated engines) because the rated engines will promptly crush your elo.
You could add humans into the mix, the benchmark just gets expensive.
I did indeed miss something. I learned after posting (but before my EDIT) that there are anchor engines that they play.
However these benchmarks still have flaws. The two illegal moves = forfeit is an odd rule which the authors of the benchmarks (which in this case was Claude Code) added[1] for mysterious reasons. In competitive play if you play an illegal move you forfeit the game.
Second (and this is a minor one) Maia 1900 is currently rated at 1774 on lichess[2], but is 1816 on the leaderboard, to the author’s credit they do admit this in their methodology section.
Third, and this is a curiosity, gemini-3-pro-preview seems to have played the same game twice against Maia 1900[3][4] and in both cases Maia 1900 blundered (quite suspiciously might I add) mate in one when in a winning position with Qa3?? Another curiosity about this game. Gemini consistently played the top 2 moves on lichess. Until 16. ...O-O! (which has never been played on lichess) Gemini had played 14 most popular lichess moves, and 2 second most popular. That said I’m not gonna rule out that the fact that this game is listed twice might stem from an innocent data entry error.
And finally, apart from Gemini (and Survival bot for some reason?), LLMs seem unable to pass Maia-1100 (rated 1635 on lichess). The only anchor bot before that is random bot. And predictably LLMs cluster on both sides of it, meaning they play as well as random (apart from the illegal moves). This smells like benchmaxxing from Gemini. I would guess that the entire lichess repertoire features prominently in Gemini’s training data, and the model has memorized it really well. And is able to play extremely well if it only has to play 5-6 novel moves (especially when their opponent blunders checkmate in 1).
> The two illegal moves = forfeit is an odd rule which the authors of the benchmarks (which in this case was Claude Code) added[1] for mysterious reasons. In competitive play if you play an illegal move you forfeit the game.
This is not true. This is clearly spelled out in FIDE rules and is upheld at tournaments. First illegal move is a warning and reset. Second illegal move is forfeit. See here https://rcc.fide.com/article7/
I doubt GDM is benchmarkmaxxing on chess. Gemini is a weird model that acts very differently from other LLMs so it doesn't surprise me that it has a different capability profile.
>> 7.5.5 After the action taken under Article 7.5.1, 7.5.2, 7.5.3 or 7.5.4 for the first completed illegal move by a player, the arbiter shall give two minutes extra time to his/her opponent; for the second completed illegal move by the same player the arbiter shall declare the game lost by this player. However, the game is drawn if the position is such that the opponent cannot checkmate the player’s king by any possible series of legal moves.
I stand corrected.
I’ve never actually played competitive chess, I’ve just heard this from people who do. And I thought I remembered once in the Icelandic championships where a player touched one piece but moved the other, and subsequently made to forfeit the game.
Replying in a split thread to clearly separate where I was wrong.
If Gemini is so good at chess because of a non-LLM feature of the model, then it is kind of disingenuous to rate it as an LLM and claim that LLMs are approaching 2000 ELO. But the fact it still plays illegal moves sometimes, is biased towards popular moves, etc. makes me think that chess is still handled by an LLM, and makes me suspect benchmaxxing.
But even if no foul play, and Gemini is truly a capable chess player with nothing but an LLM underneath it, then all we can conclude is that Gemini can play chess well, and we cannot generalize to other LLMs who play about the level of random bot. My fourth point above was my strongest point. There are only 4 anchor engines, one beats all LLMs, second beats all except Gemini, the third beats all LLMs except Gemini and Survival bot (what is Survival bot even doing there?) and the forth is random bot.
Gemini is an LLM. It playing chess is not relying on a non-LLM module of some sort. I'm just saying that as an LLM, Gemini has a peculiar profile compared to other LLMs (likely an artifact of its post-training process). In particular Gemini is very capable, but also quite misaligned (it will more often actively sabotage users).
> then all we can conclude is that Gemini can play chess well, and we cannot generalize to other LLMs who play about the level of random bot
That's overly reductive. That would be true if we didn't see improvement over time from the other LLMs but we clearly do. In particular, even if Gemini is benchmarkmaxxing, this means that LLMs from other labs will eventually get there as well. Benchmarkmaxxing can be thought of as "premature" reaching of benchmarks. But I can't think of a single benchmark that was benchmarkmaxxed that wasn't eventually saturated by every single LLM provider (because being able to benchmarkmaxx serves as an existence proof that there is an LLM capable of it and as more training gets done on the LLMs the other ones get there).
The problem with benchmaxxing is that lies about the capabilities of the technology. IF all we wanted was a machine that plays chess, we would just use a chess engine, which we have known how to make for decades. If Google wanted Gemini to be able to play chess, it would be much easier (and better; and hellavulat cheaper) to stick a traditional chess engine into their product and defer all chess to that engine.
The claim here (way up thread) was: “we have the technology to train models to do anything that you can do on a computer, only thing that's missing is the data”, and the implication is that logic and reasoning is an emerging properties of these models if given enough data and enough parameters. However the evidence seems to suggest otherwise. Logic and reasoning have to be specifically programmed into these models, and even with dataset as vast as online chess games (just lichess has 7.1 billion games), if that claim above were true, chess should be easy for LLMs, but it obviously isn’t. And that tells us something about the limitations of the technology.
That’s a devastating benchmark design flaw. Sick of these bullshit benchmarks designed solely to hype AI. AI boosters turn around and use them as ammo, despite not understanding them.
Relax. Anyone who's genuinely interested in the question will see with a few searches that LLMs can play chess fine, although the post-trained models mostly seem to be regressed. Problem is people are more interested in validating their own assumptions than anything else.
This exact game has been played 60 thousand times on lichess. The peace sacrifice Grok performed on move 6 has been played 5 million times on lichess. Every single move Grok made is also the top played move on lichess.
This reminds me of Stefan Zweig’s The Royal Game where the protagonist survived Nazi torture by memorizing every game in a chess book his torturers dropped (excellent book btw. and I am aware I just committed Godwin’s law here; also aware of the irony here). The protagonist became “good” at chess, simply by memorizing a lot of games.
1800 FIDE players do make illegal moves. I believe they make about one to two orders of magnitude less illegal moves than Gemini 3 does here. IIRC the usual statistic for expert chess play is about 0.02% of expert chess games have an illegal move (I can look that up later if there's interest to be sure), but that is only the ones that made it into the final game notation (and weren't e.g. corrected at the board by an opponent or arbiter). So that should be a lower bound (hence why it could be up to one order lower, although I suspect two orders is still probably closer to the truth).
Whether or not we'll see LLMs continue to get a lower error rate to make up for those orders of magnitude remains to be seen (I could see it go either way in the next two years based on the current rate of progress).
I think LLM's are just fundamentally the wrong AI technique for games like this. You don't want a prediction for the next move, you want the best move given knowledge of how things would play out 18 moves ahead if both players played the optimal move. Outside of an academic interest/curiosity, there isn't really a reason to use LLMs for chess other than thinking LLMs will turn into AGI (I doubt it)
A player at that level making an illegal move is either tired, distracted, drunk, etc. An LLM makes it because it does not really "understand" the rules of chess.
I suspect the majority of these illegal moves are in blitz or bullet tournaments in game 12 of the third day, and the player touches one peace but moves another, or hits the clock with the hand that didn’t make the move, or hits the clock without making a move. I don‘t think any expert level chess player grabs a captured rook and places it on the board, or moves a light squared bishop to a dark square, unless they are hustling at the park, in which case (it can be argued) moves like this with a slight of hand is part of the game.
Why do we care about this? Chess AI have long been solved problems and LLMs are just an overly brute forced approach. They will never become very efficient chess players.
The correct solution is to have a conventional chess AI as a tool and use the LLM as a front end for humanized output. A software engineer who proposes just doing it all via raw LLM should be fired.
And so for I am only convinced that they have only succeeded on appearing to have generalized reasoning. That is, when an LLM plays chess they are performing Searle’s Chinese room thought experiment while claiming to pass the Turing test
It's not entirely clear how LLMs that can play chess do so, but it is clearly very different from the way other machines do so. The construct a board, they can estimate a players skill and adjust accordingly, and unlike other machines and similarly to humans, they are sensitive to how a certain position came to be when predicting the next move.
It’s very clear how, chess moves and positions are vector encoded into their training data, when they are prompted with a certain board state, they respond with the most probable response to that. There is no reason.
Because of how LLM's work. I don't know exactly how they're using it for chess, but here's a guess. If you consider the chess game a "conversation" between two opponents, the moves written out would be the context window. So you're asking the LLM, "given these last 30 moves, what's the most likely next move?". Ie, you're giving it a string like "1. e4 e5, 2. Nf3 Nc6, 3. Bb5 a6, 4..?".
That's basically what you're doing with LLMs in any context "Here's a set of tokens, what's the most likely continuation?". The problem is, that's the wrong question for a chess move. If you're going with "most likely continuation", that will work great for openings and well-studied move sequences (there are a lot of well studied move sequences!), however, once the game becomes "a brand new game", as chess streamers like to say when there's no longer a game in the database with that set of moves, then "what's the most likely continuation from this position?" is not the right question.
Non-LLM AI's have obviously solved chess, so, it doesn't really matter -- I think Chess shows how LLM's lack of a world model as Gary Marcus would say is a problem.
Hm.. but do they need it.. at this point, we do have custom tools that beat humans. In a sense, all LLM need is a way to connect to that tool ( and the same is true is for counting and many other aspects ).
Yeah, but you know that manually telling the LLM to operate other custom tools is not going to be a long-term solution. And if an LLM could design, create, and operate a separate model, and then return/translate its results to you, that would be huge, but it also seems far away.
But I'm ignorant here. Can anyone with a better background of SOTA ML tell me if this is being pursued, and if so, how far away it is? (And if not, what are the arguments against it, or what other approaches might deliver similar capacities?)
This has been happening for the past year on verifiable problems (did the change you made in your codebase work end-to-end, does this mathematical expression validate, did I win this chess match, etc...). The bulk of data, RL environment, and inference spend right now is on coding agents (or broadly speaking, tool use agents that can make their own tools).
this is mostly because RLVR is driving all of the recent gains, and you can continue improving the model by running it longer (+ adding new tasks / verifiers)
so we'll keep seeing more frequent flag planting checkpoint releases to not allow anyone to be able to claim SOTA for too long
also i wasn't concerned about open chinese models till the latest iteration of agentic models.
most open claw users have no idea how easy it is to add backdoors to these models and now they're getting free reign on your computer to do anything they want.
the risks were minimal with last generation of chat models, but now that they do tool calling and long horizon execution with little to no supervision it's going to become a real problem
The only remaining risk? Considering wide range of bad actors and their intent, stealing your API keys is the last thing I'd worry about. People ended up in prison for things done on their computers, usually by them.
This is genuinely the only way to do it now in a way that will not virtually guarantee some new and exciting ways to subvert your system. I briefly toyed with an idea of giving agent a vm playground, but I scrapped it after a while. I gave mine an old ( by today's standards ) pentium box and small local model to draw from, but, in truth, the only thing it really does is limit the amount of damage it can cause. The underlying issue remains in place.
it's not the end of software, there will be infinitely more of it
it's the end of 80-90% margins that the valley coasted on for the last 20 years. Salesforces of the world will not lose to an LLM, they will lose to thousands of tiny teams that outship them and beat them on cost
instead of 7 figure contracts you'll have customized tailored tools for enterprises, and on the other end you'll have a custom nearly free CRM for every persona
this also means that VCs will stop investing in it, unless it's a platform with network effects and heavy lock in
Alternative take, in light of upthread -- Salesforce, SAP, et al. are positioned to be the biggest beneficiaries of this.
Because their product is actually two things: (1) a UI/app & (2) a highly curated data model.
My imagined future... they just stop building (1), or invest much less in it, and focus on (2).
If they can build a compelling data foundation (ingest / processing / storage / exposing) + do much less work to still cover 80% of UI functionality + offload the remaining 20% of work onto customers, that looks defensible financially and strategically.
There's a ton of feature requests that are driven by a few customers. Aka the "You're using it wrong. We don't care, we want it to do X" cases
There are very few VP+'s out there that would take on strategic data integrity risk in exchange for anything, and as new SaaS code quality likely goes down (lets be honest) the imprimatur of a "known name" on the data side becomes more important.
Agreed, and here's a real example from a tiny startup: Clickup's web app is too damn slow and bloated with features and UI, so we created emacs modes to access and edit Clickup workspaces (lists, kanban boards, docs, etc) via the API. Just some limited parts we care about. I was initially skeptical that it would work well or at all, but wow, it really has significantly improved the usefulness of Clickup by removing barriers.
Sure, depending on the particular product, having control and direct local access to the data would be desirable or deal breaker. But for this Clickup integration that's not so important to us (we can duplicate where necessary), while still using Clickup lets us use all the other features we can get via the web app.
> unless it's a platform with network effects and heavy lock in
I'm always slightly amused when buzzwords are thrown around vaguely such as "network effect" and "lock in". Those are not entirely a matter of a better sales pitch or bandwagoning. They're about the actual product.
> they will lose to thousands of tiny teams that outship them and beat them on cost
They won't, but this is the actual reason. Nobody likes dealing with support or maintenance, and having to reach out to tiny teams is death by a million papercuts for the end user too. The established players such as Salesforce, ServiceNow, etc. have a mature product that justifies the 7-figure contract price, and there are always lower tiers of the same product for those who are that price sensitive.
i'm talking about ubers, airbnbs, amazons, googles and facebooks of the world, marketplace software that aggregates supply and demand
> They won't, but this is the actual reason. Nobody likes dealing with support or maintenance, and having to reach out to tiny teams is death by a million papercuts for the end user too.
you will have thousands of linear like products eating the slow moving jiras of the world. great small product driven teams, not slop thrown together by your mom
AI raises the ceiling much further than the floor and it raises the floor a ton. the best software, movies, etc will still be produced by experts in their field, they'll just be able to do way more for less.
the bottleneck at large orgs is communication already, this will get even worse when time to produce stuff goes way down. big cos will drown in slop and are probably better off starting from scratch
Was it just me or did Opus start producing incredibly long responses before the crash. I was asking basic questions and it wouldn't stop trying to spit out full codebases worth of unrelated code. For some very simple questions about database schemas it ended up compacting twice on a 3 message conversation.
We really need new hardware optimized for sparse compute. Deep Learning models would work way better with much higher dimensional sparse vectors but current hardware only excels at dense GMMs and structured sparsity.
For what it's worth, we think it's unfortunately quite unlikely that frontier models will ever be trained with extreme unstructured sparsity, even with custom sparsity optimized hardware. Our main hope is that understanding sub-frontier models can still help a lot with ensuring safety of frontier models; an interpretable GPT-3 would be a very valuable object to have. It may also be possible to adapt our method to only explaining very small but important subsets of the model.
yeah it's not happening anytime soon, especially with the whole economy betting trillions of dollars on brute fore scaling of transformers on manhattan sized GPU farms that will use more energy than most mid western states.
Brains do it somehow, so sparsely / locally activated architectures are probably the way to go long term, but we're decades away from that being commercially viable.
I'm not an expert at hardware, so take this with a grain of salt, but there are two main reasons:
- Discrete optimisation is always going to be harder than continuous optimization. Learning the right sparsity mask is fundamentally a very discrete operation. So even just matching fully continuous dense models in optimization efficiency is likely to be difficult. Though perhaps we can get some hope from the fact that MoE is also similarly fundamentally discrete, and it works in practice (we can think of MoE as incurring some penalty from imperfect gating, which is more than offset by the systems benefits of not having to run all the experts on every forward pass). Also, the optimization problem gets harder when the backwards pass needs to be entirely sparsified computation (see appendix B).
- Dense matmuls are just fundamentally nicer to implement in hardware. Systolic arrays have nice predictable data flows that are very local. Sparse matmuls with the same number of flops nominally only need (up to a multiplicative factor) the same memory bandwidth as an equivalent dense matmul, but they need to be able to route data from any memory unit to any vector compute unit - the locality of dense matmuls means that the computation of each tile only requires a small slice of both input matrices, so we only need to load those slices into shared memory; on the other hand, because GPU-to-GPU transfers are way slower, when we op-shard matmuls, we replicate the data that is needed. Sparse matmuls would need either more replication within each compute die, or more all-to-all internal bandwidth. This means spending way more die space on huge crossbars and routing. This would cost a lot of die space, though thankfully, the crossbars consume much less power than actual compute, so perhaps this could match dense in energy efficiency and not make thermals worse.
It also seems very likely that once we create the interpretable GPT-1 (or 2, or 3) we will find that making everything unstructured sparse was overkill, and there are much more efficient pretraining constraints we can apply to models to 80/20 the interpretability. In general, a lot of my hope routes through learning things like this from the intermediate artifact (interpretable GPT-n).
To be clear, it doesn't seem literally impossible that with great effort, we could create custom hardware, and vastly improve the optimization algorithms, etc, such that weight-sparse models could be vaguely close in performance to weight-dense models. It's plausible that with better optimization the win from arbitrary connectivity patterns might offset the hardware difficulties, and I could be overlooking something that would make the cost less than I expect. But this would require immense effort and investment to merely match current models, so it seems quite unrealistic compared to learning something from interpretable GPT-3 that helps us understand GPT-5.
Yes it would require completely new hardware and most likely ditching gradient descent for alternative optimization methods, though I'm not convinced that we'd need to turn to discrete optimization.
Some recent works that people might find interesting:
A note on the hardware part: it does not require NN-specific hardware akin to neuromorphic. Sparse compute oriented architectures already have been developer for other reasons, such as large scale graph analysis or inference. It will still require significant effort to use it to train large models, but it would not be starting from scratch.
Yes! I'de been advocating for it inside the industry for a decade, but it is an uphill battle. The researchers can't easily publish that kind of work (even Google researchers) because you don't have the hardware that can realistically train decently large models. The hardware companies don't want to take the risk a rethinking the architecture CPU or accelerator for sparse compute because there are no large existing customers.
There also needs to be tools that can author that code!
Im starting to dust off some ideas I developed over a decade ago to build such a toolkit. Recently realized “egads, my stuff can express almost every major gpu / cpu optimization that’s relevant for modern deep learning… need to do a new version with an eye towards adoption in that area”. Plus every flavor of sparse.
Also need to figure out if some of the open core ideas i have in mind would be attractive to early stage investors who focus on the so-called deep tech end of the space. Definitely looks like ill have to do ye olde ask friends and acquaintances if they can point me to those folks approach since cold reach out historically is full of fail
There has been plenty of evidence over the year. I don't have my bibliography handy right now, but you can find them looking for sparse training or lottery ticket hypothesis papers.
The intuition is that ANNs make better predictions on high dimensional data, sparse weights can train the sparsity pattern as you train the weights, that the effective part of dense models are actually sparse (CFR pruning/sparsification research), and that dense models grow too much in compute complexity to further increase model dimension sizes.
If you can give that bibliography I'd love to read it. I have the same intuition and a few papers seem to support it but more and explicit ones would be much better.
What do you mean by work better here? If it's for better accuracy then no they are not better at the same weight dimensions.
The big thing is that sparse models allow you to train models with significantly larger dimensionality, blowing up the dimensions several orders of magnitudes. More dimensions leading to better results does not seem to be under a lot of contention, the open questions are more about quantifying that. It's simply not shown experimentally because the hardware is not there to train it.
The big thing is that sparse models allow you to train models with significantly larger dimensionality, blowing up the dimensions several orders of magnitudes.
Do you have any evidence to support this statement? Or are you imagining some not yet invented algorithms running on some not yet invented hardware?
Sparse matrices can increase in dimension while keeping the same number of non-zeroes, that part is self evident. Sparse weights models can be trained, you probably are already aware of RigL and SRigL, there is similar other related work on unstructured and structured sparse training. You could argue that those adapt their algorithm to be executable on GPUs and that none are training at x100 or x1000 dimensions. Yes, that is the part that requires access to sparse compute hardware acceleration, which exists as prototypes [1] or are extremely expensive (Cerebras).
Unstructured sparsity cannot be implemented in hardware efficiently if you still want to do matrix multiplication. If you don’t want to do matrix multiplication you first need to come up with new algorithms, tested in software. This reminds me of what Numenta tried to do with their SDRs - note they didn’t quite succeed.
> Unstructured sparsity cannot be implemented in hardware efficiently if you still want to do matrix multiplication.
Hard disagree. It certainly is a magnitude harder to design hardware for sp x sp MM, yes; it requires a paradigm shift to do sparse compute efficiently, but there are hardware architectures both in research and commercially available that do it efficiently. The same kind of architecture is needed to scale op graph compute. You see solutions at the smaller scale in FPGA and reconfigurable/dataflow accelerators, larger scale in Intel's PIUMA and Cerebras. I've been involved in co-design work of Graphblas on the software side and one of the aforementioned hardware platforms: the main issue with developing SpMSpM hardware lies more with the necessary capital and engineering investments being prioritized to current frontier AI model accelerators, not because of lack of proven results.
All of the best open source LLMs right now use mixture-of-experts, which is a form of sparsity. They only use a small fraction of their parameters to process any given token.
Mixture of experts sparsity is very different from weight sparsity. In a mixture of experts, all weights are nonzero, but only a small fraction get used on each input. On the other hand, weight sparsity means only very few weights are nonzero, but every weight is used on every input. Of course, the two techniques can also be combined.
Correct. I was more focused on giving an example of sparsity being useful in general, because the comment I was replying didn't specifically mention which kind of sparsity.
For weight sparsity, I know the BitNet 1.58 paper has some claims of improved performance by restricting weights to be either -1, 0, or 1, eliminating the need for multiplying by the weights, and allowing the weights with a value of 0 to be ignored entirely.
Another kind of sparsity, while on the topic is activation sparsity. I think there was an Nvidia paper that used a modified ReLU activation function to make more of the models activations set to 0.
“Useful” does not mean “better”. It just means “we could not do dense”. All modern state of the art models use dense layers (both weight and inputs). Quantization is also used to make models smaller and faster, but never better in terms of quality.
Based on all examples I’ve seen so far in this thread it’s clear there’s no evidence that sparse models actually work better than dense models.
Yes, mixture of experts is basically structured activation sparsity. You could imagine concatenating the expert matrices into a huge block matrix and multiplying by an input vector where only the coefficients corresponding to activated experts are nonzero.
From that perspective, it's disappointing that the paper only enforces modest amounts of activation sparsity, since holding the maximum number of nonzero coefficients constant while growing the number of dimensions seems like a plausible avenue to increase representational capacity without correspondingly higher computation cost.
EDIT: don't have time to write it up, but here's gemini 3 with a short explanation:
To simulate the brain's efficiency using Transformer-like architectures, we would need to fundamentally alter three layers of the stack: the *mathematical representation* (moving to high dimensions), the *computational model* (moving to sparsity), and the *physical hardware* (moving to neuromorphic chips).
Here is how we could simulate a "Brain-Like Transformer" by combining High-Dimensional Computing (HDC) with Spiking Neural Networks (SNNs).
### 1\. The Representation: Hyperdimensional Computing (HDC)
Current Transformers use "dense" embeddings—e.g., a vector of 4,096 floating-point numbers (like `[0.1, -0.5, 0.03, ...]`). Every number matters.
To mimic the brain, we would switch to *Hyperdimensional Vectors* (e.g., 10,000+ dimensions), but make them *binary and sparse*.
* **Holographic Representation:** In HDC, concepts (like "cat") are stored as massive randomized vectors of 1s and 0s. Information is distributed "holographically" across the entire vector. You can cut the vector in half, and it still retains the information (just noisier), similar to how brain lesions don't always destroy specific memories.
* **Math without Multiplication:** In this high-dimensional binary space, you don't need expensive floating-point matrix multiplication. You can use simple bitwise operations:
* **Binding (Association):** XOR operations (`A ⊕ B`).
* **Bundling (Superposition):** Majority rule (voting).
* **Permutation:** Bit shifting.
* **Simulation Benefit:** This allows a Transformer to manipulate massive "context windows" using extremely cheap binary logic gates instead of energy-hungry floating-point multipliers.
### 2\. The Architecture: "Spiking" Attention Mechanisms
Standard Attention is $O(N^2)$ because it forces every token to query every other token. A "Spiking Transformer" simulates the brain's "event-driven" nature.
* **Dynamic Sparsity:** Instead of a dense matrix multiplication, neurons would only "fire" (send a signal) if their activation crosses a threshold. If a token's relevance score is low, it sends *zero* spikes. The hardware performs *no* work for that connection.
* **The "Winner-Take-All" Circuit:** In the brain, inhibitory neurons suppress weak signals so only the strongest "win." A simulated Sparse Transformer would replace the Softmax function (which technically keeps all values non-zero) with a **k-Winner-Take-All** function.
* *Result:* The attention matrix becomes 99% empty (sparse). The system only processes the top 1% of relevant connections, similar to how you ignore the feeling of your socks until you think about them.
### 3\. The Hardware: Neuromorphic Substrate
Even if you write sparse code, a standard GPU (NVIDIA H100) is bad at running it. GPUs like dense, predictable blocks of numbers. To simulate the brain efficiently, we need *Neuromorphic Hardware* (like Intel Loihi or IBM NorthPole).
* **Address Event Representation (AER):** Instead of a "clock" ticking every nanosecond forcing all neurons to update, the hardware is asynchronous. It sits idle (consuming nanowatts) until a "spike" packet arrives at a specific address.
* **Processing-in-Memory (PIM):** To handle the high dimensionality (e.g., 100,000-dimensional vectors), the hardware moves the logic gates *inside* the RAM arrays. This eliminates the energy cost of moving those massive vectors back and forth.
### Summary: The Hypothetical "Spiking HD-Transformer"
I’m not sure why you’re talking about efficiency when the question is “do sparse models work better than dense models?” The answer is no, they don’t.
Even the old LTH paper you cited trains a dense model and then tries to prune it without too much quality loss. Pruning is a well known method to compress models - to make them smaller and faster, not better.
Before we had proper GPUs everyone said the same thing about Neural Networks.
Current model architectures are optimized to get the most out of GPUs, which is why we have transformers dominating as they're mostly large dense matrix multiplies.
There's plenty of work showing transformers improve with inner dimension size but it's not feasible to scale them up further because it blows up parameter and activation sizes (including KV caches) so people to turn to low rank ("sparse") decompositions like MLA.
Lottery ticket hypothesis shows that most of the weights in current models are redundant and that we could get away with much smaller sparse models, but currently there's no advantage to doing so because on GPUs you still end up doing dense multiplies.
Yes, we know that large dense layers work better than small dense layers (up to a point). We also know how to train large dense models and then prune them. But we don’t know how to train large sparse models to be better than large dense models. If someone figures it out then we can talk about building hardware for it.
It isn't directly what you are asking for, but there is a similar relationship at work with respect to L_1 versus L_2 regularization. The number of samples required to train a model is O(log(d)) for L_1 and O(d) for L_2 where d is the dimensionality [1]. This relates to the standard random matrix results about how you can approximate high dimensional vectors in a log(d) space with (probably) small error.
At a very handwaving level, it seems reasonable that moving from L_1 to L_0 would have a similar relationship in learning complexity, but I don't think that has every been addressed formally.
My last dive into matrix computations was years ago, but the need was the same back then. We could sparsify matrices pretty easily, but the infrastructure was lacking. Some things never change.
On the software side I can recommend https://github.com/DrTimothyAldenDavis/GraphBLAS
It is hard to make a sparse linear algebra framework, but Tim Davis has been doing a great job collecting the various optimal algorithms I to a single framework that acts more like an algebra than a collection of kernels.
reply