Yea I hear this a lot, do people genuinely dismiss that there has been step change progress over 6-12 months timescale? I mean it’s night and day, look at benchmark numbers… “yea I don’t buy it” ok but then don’t pretend you’re objective
I think I'd be in the "don't buy it" camp, so maybe I can explain my thinking at least.
I don't deny that there's been huge improvements in LLMs over the last 6-12 months at all. I'm skeptical that the last 6 months have suddenly presented a 'category shift' in terms of the problems LLMs can solve (I'm happy to be proved wrong!).
It seems to me like LLMs are better at solving the same problems that they could solve 6 months ago, and the same could be said comparing 6 months to 12 months ago.
The argument I'd dismiss isn't the improvement, it's that there's a whole load of sudden economic factors, or use cases, that have been unlocked in the last 6 months because of the improvements in LLMs.
That's kind of a fuzzier point, and a hard one to know until we all have hindsight. But I think OP is right that people have been claiming "LLMs are fundamentally in a different category to where they were 6 months ago" for the last 2 years - and as yet, none of those big improvements have yet unlocked a whole new category of use cases for LLMs.
To be honest, it's a very tricky thing to weight into, because the claims being made around LLMs are very varied from "we're 2 months away from all disease being solved" to "LLMs are basically just a bit better than old school Markov chains". I'd argue that clearly neither of those are true, but it's hard to orient stuff when both those sides are being claimed at the same time.
The improvement in LLMs has come in the form of more successful one shots, more successful bug finding, more efficient code, less time hand-holding the model.
"Problem solving" (which definitely has improved, but maybe has a spikey domain improvement profile) might not be the best metric, because you could probably hand hold the models of 12 months ago to the same "solution" as current models, but you would spend a lot of time hand holding.
> The argument I'd dismiss isn't the improvement, it's that there's a whole load of sudden economic factors, or use cases, that have been unlocked in the last 6 months because of the improvements in LLMs.
Yes I agree here in principle here in some cases: I think there are certainly problems that LLMs are now better at but that don't reach the critical reliability threshold to say "it can do this". E.g. hallucinations, handling long context well (still best practice to reset context window frequently), long-running tasks etc.
> That's kind of a fuzzier point, and a hard one to know until we all have hindsight. But I think OP is right that people have been claiming "LLMs are fundamentally in a different category to where they were 6 months ago" for the last 2 years - and as yet, none of those big improvements have yet unlocked a whole new category of use cases for LLMs.
This is where I disagree (but again you are absolutely right for certain classes of capabilities and problems).
- Claude code did not exist until 2025
- We have gone from e.g. people using coding agents for like ~10% of their workflow to like 90-100% pretty typically. Like code completion --> a reasonably good SWE (with caveats and pain points I know all too well). This is a big step change in what you can actually do, it's not like we're still doing only code completion and it's marginally better.
- Long horizon task success rate has now gotten good enough that basically also enable the above (good SWE) for like refactors, complicated debugging with competing hypotheses, etc, looping attempts until success
- We have nascent UI agents now, they are fragile but will see a similar path as coding which opens up yet another universe of things you can only do with a UI
- Enterprise voice agents (for like frontline support) now have a low enough bounce rate that you can actually deploy them
So we've gone from "this looks promising" to production deployment and very serious usage. This may kind of be like you say "same capabilities but just getting gradually better" but at some point that becomes a step change. Before a certain failure rate (which may be hard to pin down explicitly) it's not tolerable to deploy, but as evidenced by e.g. adoption alone we've crossed that threshold, especially for coding agents. Even sonnet 4 -> opus 4.5 has for me personally (beyond just benchmark numbers) made full project loops possible in a way that sonnet 4 would have convinced you it could and then wasted like 2 whole days of your time banging your head against the wall. Same is true for opus 4.5 but its for much larger tasks.
> To be honest, it's a very tricky thing to weight into, because the claims being made around LLMs are very varied from "we're 2 months away from all disease being solved" to "LLMs are basically just a bit better than old school Markov chains". I'd argue that clearly neither of those are true, but it's hard to orient stuff when both those sides are being claimed at the same time.
Precisely. Lots and lots of hyperbole, some with varying degrees of underlying truth. But I would say: the true underlying reality here is somewhat easy to follow along with hard numbers if you look hard enough. Epoch.ai is one of my favorite sources for industry analysis, and e.g. Dwarkesh Patel is a true gift to the industry. Benchmarks are really quite terrible and shaky, so I don't necessarily fault people "checking the vibes", e.g. like Simon Willison's pelican task is exactly the sort of thing that's both fun and also important!
Well considering people that disagree with you “shills” is maybe a bad start and indicates you kind of just have an axe to grind. You’re right that there can be serious local issues for data centers but there are plenty of instances where it’s a clear net positive. There’s a lot of nuance that you’re just breezing over and then characterizing people that point this out as “shills”. Water and electricity demands do not have to be problematic, they are highly site specific. In some cases there are real concerns (drought-y areas like Arizona, impact on local grids and possibility of rate impacts for ordinary people etc) but in many cases they are not problematic (closed loop or reclaimed water, independent power sources, etc).
Well why would we ask you (no offense)? What is the questionable data you're talking about?
RAM and GPU prices going up, sure ok, but again: if you're claiming there is no net benefit from AI what is your evidence for that? These contracts are going through legally, so what basis do you have to prevent them from happening? Again I say its site specific: plenty of instances where people have successfully prevented data centers in their area, and lots of problems come up (especially because companies are secretive about details so people may not be able to make informed judgements).
What are the massive problems besides RAM + GPU prices, which again, what is the societal impact from this?
I think I completely agree with you but I think HN folks seriously underestimate the rate of progress. Believe what you will about the magnitude of capex but it’s coming and it’s coming fast. And we are extremely extremely close now. I agree we constantly have gotten timelines wrong, and I think it’s easily possible SOME capabilities may take longer but I think it’s hard to overstate just how much we are accelerating progress like in the next year or two.
But yea: self driving cars are still not here, see e.g. all the other AI booms
Difference here is we’re seeing it with our own eyes and using it right now. So much absolutely existential competition between companies (even within them!) and geopolitically.
Yeah they are, even if you don't have one yet. We can rathole into whether the need to hit level 5 before it "counts", but Waymos drive around multiple cities, today, and Tesla FSD works well enough that I'd rather drive next to a Tesla with FSD than a drunk driver.
If your evidence that AI isn't something to be worried about is saying self-driving cars aren't here, when they are, will then, we're fucked.
The future is here, it's just unevenly distributed. For cars, this manifests as they're physically not available everywhere yet. For programming, it's unevenly distributed according to how much training data there was in that language and that domain to scrape across the whole Internet.
Oh wait I’m not sure if I was clear I just mean: yes we’ve gotten lots of hyped claims like “FSD will be here in 5 years” in 2014 wrong but it is to our peril not to take the very short AI timelines seriously
Also — I think the arguments of yourself and another comment are also great analogies to AI situation, we can haggle over “ok but what is {FSD, AGI} really and in many ways it’s already here!”
I agree totally and I would just point out we’re at an even more intense moment in the AI space
That's one of my triggers that we've reach AGI. In may senses, self driving cars are here. In the vast majority of tasks self driving likely works fine. It's when you get to the parts where you need predictive capabilities, like figuring out what other idiots are about to do, or what some random object is doing in the road that our AI doesn't have the ability to deal with these things.
> Because software developers typically understand how to implement a solution to problem better than the client. If they don't have enough details to implement a solution, they will ask the client for details. If the developer decides to use an LLM to implement a solution, they have the ability to assess the end product.
Why do you think agents can’t do that? They can’t do this really well today but if the distance we went in 2025 stays similar it’ll be like a year before this starts getting decent and then like another 1 year before it’s excellent.
> Sure, you will see a few people using LLMs to develop personalized software for themselves. Yet these will be people who understand how to specify the problem they are trying to solve clearly, will have the patience to handle the quirks and bugs in the software they create
Hallucinations are not solved, memory is not solved, prompt injection is not solved, context limits are waaay too low at the same time tokens way too expensive to take advantage of context limits, etc. These problems have existed since the very early days of GPT-4 and there is no clear path to them being solved any time soon.
You basically need AGI and we are nowhere close to AGI.
All of the issue you talk about are true, and I don’t personally care about AGI it’s kind of a mishmash of a real thing and a nice package for investors but what I do care about is what has been released and what it can do
All of the issues you talk about: they aren’t solved but we’ve made amazing progress on all of them. Continual learning is a big one and labs are likely close to some POCs.
Token costs per unit performance rapidly goes down. GPT4 level perf costs you 10x less today than two years ago. This will continue to be the case as we just continually push efficiency up.
The AGI question “are we close” tbh to me these questions are just rabbit holes and bait for flame wars because no one can decide on what it means and then even if you do (e.g. super human perf on all economically viable tasks is maybe more of a solid staring point) everyone fights about the ecological validity of evals.
All I’m saying is: taking coding in a complete vacuum, we’re very very close to being at a point where it becomes so obviously beneficial and failure rates for many things fall below the critical thresholds that automating even the things people say make engineers unique (working with people to navigate ambiguous issues that they aren’t able to articulate well, making the right tradeoffs, etc) starts looking like less of a research challenge and more of an exercise in deployment
My experience with LLMs is that you can’t get them to clarify things by asking questions. They just assume facts and run with it.
My experience with software development is that a lot of things are learned by asking probing questions informed by your experience. How would an LLM understand the subtle political context behind a requirement or navigate unwritten rules discussed in meetings a year ago?
I think honestly the story would be much different with more product sense and better market intuition, Horizon is just a perfect example of pure idiocy. They may as well have just ported chat roulette.
Once Apple Vision Pro released I finally understood what VR really could be which is an incredible immersive escape. Once I watched an Apple Immersive movie, and then even a completely regular old 2D movie in theater mode at night in Joshua Tree, I got it. Obviously completely unattainable but it to me was very smart: low volume but execute the best version of your vision that you possibly can, and see how people respond to it. It proves out the vision and then you can start working down the price.
The only thing Meta VR got right is gaming: it's the only use case that works with the resolution & hardware at the price point that they're trying to occupy. AVP could obviously work too but look: I've nearly punched out a window with my quest pro. Sitting and playing a game is weird, standing and playing is tiring. What I like infinitely better is just: watching a movie. Escaping. Relaxing.
I still use my Quest after a year but it is mostly on the web and youtube 360. youtube 360 is actually quite cool given the fact no one really makes content for it.
I have no interest in games and anything inside Horizon is just not impressive.
I just don't understand how Meta spent this much money to get so little in return. VRChat has immense worlds compared to anything in Horizon. Everything in Horizon is just so amateur looking and lacking any kind of imagination.
I got the Quest because I wanted to try developing for VR but that is a total nightmare. Horizon/Unity/Unreal are all different forms of a nightmare. I suspect this is actually the problem. Development is just too hard to do much of anything interesting. Anything interesting I have made has been in vanilla javascript/three js/react three fiber.
Vision Pro level resolution + webxr I think has a huge amount of potential. I even like wearing the Quest. The physical act of wearing the headset is really no issue to me at all. That was what I figured I would get tired of.
The Quest is ultimately an amazing piece of hardware with amazingly bad software.
I'm definitely in the camp that this browser implementation is shit, but just a reminder: agent training does involve human coding data in early stages of training to bootstrap it but in its reinforcement learning phase it does not -- it learns closer to the way AlphaGo did, self play and verifiable rewards. This is why people are very bullish on agents, there is no limit technically to how well they can learn (unlike LLMs) and we know we will reach superhuman skill, and the crucial crucial reason for this is: verifiable rewards. You have this for coding, you do not have this for e.g. creative tasks etc.
So agents will actually be able to build a {browser, library, etc} that won't be an absolute slopfest, but the real crucial question is when. You need better and more efficient RL training, further scaling (Amodei thinks really scaling is the only thing you technically need here and we have about 3-4 orders of magnitude of headroom left before we hit insurmountable limits), bigger context windows (that models actually handle well) and possibly continual learning paradigms, but solutions to these problems are quite tangible now.
> those fully-paid-up members of the "AI revolution" cult
Ah ok, so an overly simplistic tribal take by someone with a clear axe to grind and no desire to consider any nuance. Anyone who disagrees with his take -- regardless of their actual positions on various AI-related topics -- is "fully-paid-up" and in a "cult". Is it so hard to consider the _possibility_ that Salesforce making a bad AI rollout doesn't imply the whole industry is doing the same thing? This doesn't completely ignore how varied real deployments are and how messy the reporting around them tends to be?
Overhyped claims abound -- Cursor, Google tweets about math problems being solved, agents cheating in SWEBench because they didn't sanitize git logs, etc. Some of it is careless, some probably dishonest, but the incentives cut both ways. When claims get debunked (e.g., the LMSYS/LMArena confusion around Llama 4 results), the reputational damage is immediate and brutal. No one benefits from making these bad claims that are easy to fact check, no one ever wants to do this. Lots of different stances and claims about how _close we are_ to various capabilities can easily be considered misleading -- fine! But you're going to completely ignore actual measured progress? The accomplishments that are defensible? The industry analysis that is careful and well thought out (see e.g. Epoch)?
> this dramatic deployment, followed by a rapid walk back, is happening across the entire economy.
Which companies? What deployments? Zero concrete cases. Firms make bad calls about AI for the same reason they make bad calls about M&A, pricing, org design; leadership everywhere constantly misjudge reality...will be true until the end of time... Pretty big leap to conclude this implies systemic delusion and anyone detracting is in a cult.
Yet another completely ignored core issue is how distorted the coverage of these things are. I read everything about the company I work for, stories routinely flatten nuanced, defensible, even boring decisions into morality plays because that's far more readable and engaging. Benioff could easily be overselling ordinary layoffs as AI transformation, it gives cover while also a great opportunity to make Salesforce look extremely competent (idiotic and has completely backfired). Yet none of that tells us what is actually happening operationally...
Bayesian methods are incredibly canonical in most fields I’ve been involved with (cosmology is one of the most beautiful paradises for someone looking for maybe the coolest club of Bayesian applications). I’m surprised there are still holdouts, especially in fields where the stakes are so high. There are also plenty of blog articles and classroom lessons about how frequentist trial designs kill people: if you are not allowed to deviate from your experiment design but you already have enough evidence to form a strong belief about which treatment is better, is that unethical? Maybe the reality is a bit less simplistic but ive seen many instantiations of that argument around.
I think this is the right take. In some narrow but constantly broadening contexts, agents give you a huge productivity edge. But to leverage that you need to be skilled enough to steer, design the initial prompt, understand the impact of what you produce, etc. I don't see agents in their current and medium term inception as being a replacement of engineering work, I see it as a great reshuffling of engineering work.
In some business contexts, the impact of more engineering labor on output gets capped at some point. Meaning once agent quality reaches a certain point, the output increase is going to be minimal with further improvements. There, labor is not the bottleneck.
In other business contexts, labor is the bottleneck. For instance it's the bottleneck for you as an individual: what kind of revenue could you make if you had a large team of highly skilled senior SWEs that operate for pennies on the dollar?
Labor will shift to where the ROI is highest is what I think you'll see.
To be fair, I can imagine a world where we eventually fully replace the "driver" of the agent in that it is good enough to fulfill the role of a ~staff engineer that can ingest very high level business context, strategy, politics and generate a high level system design that can then be executed by one or more agents (or one or more other SWEs using agents). I don't (at this point) see some fundamental rule of physics / economics that prevents this, but this seems much further ahead from where we are now.
https://www.deeplearning.ai/the-batch/alphatensor-for-faster...
https://deepmind.google/blog/alphaevolve-a-gemini-powered-co...
https://www.rubrik.com/blog/ai/25/teaching-ai-to-write-gpu-c...
reply