Hacker Newsnew | past | comments | ask | show | jobs | submit | DrammBA's commentslogin

Simon, you're starting to sound super disconnected from reality, this "I hit everything that looks like a nail with my LLM hammer" vibe is new.

My habits have changed quite a bit with Opus 4.5 in the past month. I need to write about it..

What's concerning to many of us is that you've (and others) have said this same thing s/Opus 4.5/some other model/

That feels more like chasing than a clear line of improvement. It's interrupted very different from something like "my habits have changed quite a bit since reading The Art of Computer Programming". They're categorically different.


It's because the models keep getting better! What you could do with GPT-4 was more impressive than what you could do with GPT 3.5. What you could do with Sonnet 3.5 was more impressive yet, and Sonnet 4, and Sonnet 4.5.

Some of these improvements have been minor, some of them have been big enough to feel like step changes. Sonnet 3.7 + Claude Code (they came out at the same time) was a big step change; Opus 4.5 similarly feels like a big step change.

(If you don't trust vibes, METR's task completion benchmark shows huge improvements, too.)

If you're sincerely trying these models out with the intention of seeing if you can make them work for you, and doing all the things you should do in those cases, then even if you're getting negative results somehow, you need to keep trying, because there will come a point where the negative turns positive for you.

If you're someone who's been using them productively for a while now, you need to keep changing how you use them, because what used to work is no longer optimal.


Models keep getting better but the argument I'm critiquing stays the same.

So does the comment I critiqued in the sibling comment to yours. I don't know why it's so hard to believe we just haven't tried. I have a Claude subscription. I'm an ML researcher myself. Trust me, I do try.

But that last part also makes me keenly aware of their limitations and failures. Frankly I don't trust experts who aren't critiquing their field. Leave the selling points to the marketing team. The engineer and researcher's job is to be critical. To find problems. I mean how the hell do you solve problems if you're unable to identify them lol. Let the marketing team lead development direction instead? Sounds like a bad way to solve problems

  > benchmark shows huge improvements
Benchmarks are often difficult to interpret. It is really problematic that they got incorporated into marketing. If you don't understand what a benchmark measures, and more importantly, what it doesn't measure, then I promise you that you're misunderstanding what those numbers mean.

For METR I think they say a lot right here (emphasis my own) that reinforces my point

  > Current frontier AIs are vastly better than humans at text prediction and knowledge tasks. They outperform experts on most *exam-style problems* for a fraction of the cost. ... And yet the best AI agents are not currently able to carry out substantive projects by themselves or directly substitute for human labor. *They are unable to reliably handle even relatively low-skill*, computer-based work like remote executive assistance. It is clear that capabilities are increasing very rapidly in some sense, but it is unclear how this corresponds to real-world impact.
So make sure you're really careful to understand what is being measured. What improvement actually means. To understand the bounds.

It's great that they include longer tasks but also notice the biases and distribution in the human workers. This is important in properly evaluating.

Also remember what exactly I quoted. For a long time we've all known that being good at leetcode doesn't make one a good engineer. But it's an easy thing to test and the test correlates with other skills that are likely to be learned to be good at those tests (despite being able to metric hack). We're talking about massive compression machines. That pattern match. Pattern matching tends to get much more difficult as task time increases but this is not a necessary condition.

Treat every benchmark adversarialy. If you can't figure out how to metric hack it then you don't know what a benchmark is measuring (and just because you know what can hack it doesn't mean you understand it nor that that's what is being measured)


I think you should ask yourself: If it were true that 1) these things do in fact work, 2) these things are in fact getting better... what would people be saying?

The answer is: Exactly what we are saying. This is also why people keep suggesting that you need to try them out with a more open mind, or with different techniques: Because we know with absolute first-person iron-clad certainty what is possible, and if you don't think it's possible, you're missing something.


I don't understand what your argument is.

It seems to be "people keep saying the models are good"?

That's true. They are.

And the reason people keep saying it is because the frontier of what they do keeps getting pushed back.

Actual, working, useful code completion in the GPT 4 days? Amazing! It could automatically write entire functions for me!

The ability to write whole classes and utility programs in the Claude 3.5 days? Amazing! This is like having a junior programmer!

And now, with Opus 4.5 or Codex Max or Gemini 3 Pro we can write substantial programs one-shot from a single prompt and they work. Amazing!

But now we are beginning to see that programming in 6 months time might look very different to now because these AI system code very differently to us. That's exactly the point.

So what is it you are arguing against?

I think you said you didn't like that people are saying the same thing, but in this post it seems more complicated?


> And now, with Opus 4.5 or Codex Max or Gemini 3 Pro we can write substantial programs one-shot from a single prompt and they work. Amazing!

People have been doing this parlor trick with various "substantial" programs [1] since GPT 3. And no, the models aren't better today, unless you're talking about being better at the same kinds of programs.

[1] If I have to see one more half-baked demo of a running game or a flight sim...


"And no, the models aren't better today"

Can you expand on that? It doesn't match my experience at all.


It’s a vague statement that I obviously cannot defend in all interpretations, but what I mean is: the performance of models at making non-trivial applications end-to-end, today, is not practically better than it was a few years ago. They’re (probably) better at making toys or one-shotting simple stuff, and they can definitely (sometimes) crank out shitty code for bigger apps that “works”, but they’re just as terrible as ever if you actually understand what quality looks like and care to keep your code from descending into entropy.

I think "substantial" is doing a lot of heavy lifting in the sentence I quoted. For example, I’m not going to argue that aspects of the process haven’t improved, or that Claude 4.5 isn't better than GPT 4 at coding, but I still can’t trust any of the things to work on any modestly complex codebase without close supervision, and that is what I understood the broad argument to be about. It's completely irrelevant to me if they slay the benchmarks or make killer one-shot N-body demos, and it's marginally relevant that they have better context windows or now hallucinate 10% less often (in that they're more useful as tools, which I don't dispute at all), but if you want to claim that they're suddenly super-capable robot engineers that I can throw at any "substantial" problem, you have to bring evidence, because that's a claim that defies my day-to-day experience. They're just constantly so full of shit, and that hasn't changed, at all.

FWIW, this line of argument usually turns into a mott and bailey fallacy, where someone makes an outrageous claim (e.g. "models have recently gained the ability to operate independently as a senior engineer!"), and when challenged on the hyperbole, retreats to a more reasonable position ("Claude 4.5 is clearly better than GPT 3!"), but with the speculative caveat that "we don't know where things will be in N years". I'm not interested in that kind of speculation.


Have you spent much time with Codex 5.1 or 5.2 in OpenAI Codex or a Claude Opus 4".5 in Claude code over the last ~6 weeks?

I think they represent a meaningful step change in what models can build. For me they are the moment we went from building relatively trivial things unassisted to building quite large and complex system that take multiple hours, often still triggered by a single prompt.

Some personal examples from the past few weeks.

- A spec-compliant HTML5 parsing library by Codex 5.2: https://simonwillison.net/2025/Dec/15/porting-justhtml/

- A CLI-based transcript export and publishing tool by Opus 4.5: https://simonwillison.net/2025/Dec/25/claude-code-transcript...

- A full JavaScript interpreter in dependency/free Python (!) https://github.com/simonw/micro-javascript - and here's that transcript published using the above-mentioned tool: https://static.simonwillison.net/static/2025/claude-code-mic...

- A WebAssembly runtime in Python which I haven't yet published

The above projects all took multiple prompts, but were still mostly built by prompting Claude Code for web on my iPhone in between Christmas family things.

I have a single-prompt one:

- A Datasette plugin that integrates Cloudflare's CAPTCHA system: https://github.com/simonw/datasette-turnstile - transcript: https://gistpreview.github.io/?2d9190335938762f170b0c0eb6060...

I'm not confident any of these projects would have worked with the coding agents and models we had had four months ago. There is no chance they would've worked with the January 2025 available models.


I’ve used Sonnet 4.5 and Codex 5 and 5.1, but not in their native environment [1].

Setting aside the fact that your examples are mostly “replicate this existing thing in language X” [2], again, I’m not saying that the models haven’t gotten better at crapping out code, or that they’re not useful tools. I use them every day. They're great tools, when someone actually intelligent is using them. I also freely concede that they're better tools than a year ago.

The devil is (as always) in the details: how many prompts did it take? what exactly did you have to prompt for? how closely did you look at the code? how closely did you test the end result? Remember that I can, with some amount of prompting, generate perfectly acceptable code for a complex, real-world app, using only GPT 4. But even the newest models generate absolute bullshit on a fairly regular basis. So telling me that you did something complex with an unspecified amount of additional prompting is fine, but not particularly responsive to the original claim.

[1] Copilot, with a liberal sprinkling of ChatGPT in the web UI. Please don’t engage in “you’re holding it wrong” or "you didn't use the right model" with me - I use enough frontier models on a regular basis to have a good sense of their common failings and happy paths. Also, I am trying to do something other than experiment with models, so if I have to switch environments every day, I’m not doing it. If I have to pay for multiple $200 memberships, I’m not doing it. If they require an exact setup to make them “work”, I am unlikely to do it. Finally, if your entire argument here hinges on a point release of a specific model in the last six weeks…yeah. Not gonna take that seriously, because it's the same exact argument, every six weeks. </caveats>

[2] Nothing really wrong with this -- most programming is an iterative exercise of replicating pre-existing things with minor tweaks -- but we're pretty far into the bailey now, I think. The original argument was that you can one-shot a complex application. Now we're in "I can replicate a large pre-existing thing with repeated hand-holding". Fine, and completely within my own envelope for model performance, but not really the original claim.


Is there an endpoint for AI improvement? If we can go from functions to classes to substantial programs then it seems like just a few more steps to rewriting whole software products and putting a lot of existing companies out of business.

"AI, I don't like paying for my SAP license, make me a clone with just the features I need".


Two things seem to be in contention:

  - Models keep getting better[0]
  - Models since GPT 3 are able to replace junior developers
It's true that both of these can be true at the same time but they are still in contention. We're not seeing agents ready to replace mid level engineersand quite frankly I've yet to see a model actually ready to replace juniors. Possibly low end interns but the major utility of interns is to trial run employment. Frankly it still seems like interns and juniors are advancing faster than these models in the type of skills that matter for companies (not to mention that institutional knowledge is quite valuable). But there's interns that started when GPT 3.5 came out that are seniors now.

The problem is we've been promised that these employees would be replaced[1] any day now, yet that's not happening.

People forget, it is harder to advance when you're already skilled. It's not hard to go from non-programmer to a junior level. Hard to go from junior to senior. And even harder to advance to staff. The difficulty level only increases. This is true for most skills and this is where there's a lot of naivity. We can be advancing faster while the actual capabilities begin to crawl forward rather than leap.

[0] Implication is not just at coding test style questions but also in more general coding development.

[1] Which has another problem in the pipeline. If you don't have junior devs and are unable to replace both mid and seniors by the time that a junior would advance to a senior then you have built a bubble. There's a lot of big bets being made that this will happen yet the evidence is not pointing that way.


Opus 4.5 is categorically a much better model from benchmarks and personal experience than Opus 4.1 & Sonnet models. The reason you're seeing a lot of people wax about O4.5 is that it was a real step change in reliable performance. It crossed for me a critical threshold in being able to solve problems by approaching things in systematic ways.

Why do you use the word "chasing" to describe this? I don't understand. Maybe you should try it and compare it to earlier models to see what people mean.


  > Why do you use the word "chasing" to describe this?
I think you'll get the answer to this if you read my comment and your response to understand why you didn't address mine.

Btw, I have tried it. It's annoying that people think the problem is not trying. It was getting old when GPT 3.5 came out. Let's update the argument...


Looking forward to hearing about how you're using Opus 4.5, from my experience and what I've heard from others, it's been able to overcome many obstacles that previous iterations stumbled on

Please do. I'm trying to help other devs in my company get more out of agentic coding, and I've noticed that not everyone is defaulting to Opus 4.5 or even Codex 5.2, and I'm not always able to give good examples to them for why they should. It would be great to have a blog post to point to…

Can you expound on Opus 4.5 a little? Is it so good that it's basically a superpower now? How does it differ from your previous LLM usage?

To repeat my other comment:

> Opus 4.5 is categorically a much better model from benchmarks and personal experience than Opus 4.1 & Sonnet models. The reason you're seeing a lot of people wax about O4.5 is that it was a real step change in reliable performance. It crossed for me a critical threshold in being able to solve problems by approaching things in systematic ways.


Reality is we went from LLMs as chatbots editing a couple files per request with decent results. To running multiple coding agents in parallel to implement major features based on a spec document and some clarifying questions - in a year.

Even IF llms don't get any better there is a mountain of lemons left to squeeze in their current state.


> I don't know if I'd call it breakage to just... not use them where they should be used.

Accessibility only has 2 states: "Working" and "Broken", there's no third "I didn't bother".


> It's not X - it's Y, it's Z.

> Consider: consideration. That's what you should consider.

Come on man


The second point I agree with, but I hate that the "It's not ... it's ..." construct is a sign of LLMs now, because I use it all the time

You ignored the 2nd part of their message, imagine this:

> We cut prices by 50%! Before $30, now $20

Would it be pedantic to call that price cut bullshit?


Sorry, I thought that my answer to the second question was implied by my answer to the first question.

To answer your question, no, it would not be pedantic to question that claim. It conforms to no common usage that I am aware of.


> It conforms to no common usage that I am aware of.

It conforms to:

> “cut prices by 600%” is understood perfectly well by most people (but not pedants) to mean “we undid price hikes of 600%.”

which I agree is no common usage that I am aware of


No, it does not conform. As I wrote earlier, I have not seen that usage for less than 100%. So 600% conforms; 50% does not.

That is, expressions like "twice as slow/thin/short/..." or "2x as slow/thin/short/..." or "200% as slow/thin/short/..." have a well-established usage that is understood to mean "half as fast/thick/tall/..."

But "50% as slow/thin/short/..." or "half as slow/thin/short/..." have no such established usage.

For some evidence to support my claim, please see this 2008 discussion on Language Log:

https://languagelog.ldc.upenn.edu/nll/?p=463#:~:text=A%20fur...

Since HN has a tendency to trim URLs and might prevent this link from taking you to the relevant portion of a rather lengthy article, I'll quote the salent bits:

"A further complexity: in addition to the N times more/larger than usage, there is also a N times less/lower than [to mean] '1/Nth as much as' usage"

"[About this usage, the Merriam-Webster Dictionary of English Usage reports that] times has now been used in such constructions for about 300 years, and there is no evidence to suggest that it has ever been misunderstood."


> I have not seen that usage for less than 100%. So 600% conforms; 50% does not.

> For some evidence to support my claim

Please note that the 2008 discussion you linked does not support your claim in any way, so 50% does conform.


I guess we will have to agree to disagree.

I believe that the history of English language usage is replete with examples such as "X times less than" when X > 1, but similar constructions for X <= 1 do not appear with appreciable frequency.

In any case, I think that continuing our conversation is unlikely to be productive, so this will be my last reply.

I will just say in closing that our conversation is a good example of why the MAGA folks have probably chosen phrasing such as this.


To be fair our conversation can be summarized as:

> only pedants misunderstand this, here's a 2 decade old source that doesn't support my claim, I rather not continue the conversation

so it was never meant to be productive


For reference, cosidering the backup has 86 million music files, at an average of 3 minutes per file it would take you around 490 years to listen to all the tracks.

> It’s less than $2 a month for 2TB.

What would be the egress fee to get your data back in case of disaster?


The cheapest slowest egress, bulk retrieval is $2.56 per terrabyte.

Glacier is meant for in case of emergency break glass. You would use lifecycle policies on S3 to go from fast/more expensive storage for like the first 90 days and then have it automatically go to Glacier.

Yes I know it’s more complicated and nuanced. I’m purposefully yada yada yada’ing


> on this rare occasion couldn't resist the joke

It was unintentional as per author

> Ouch. That is what I get for pushing something out during a meeting, I guess. That was not my point; the experiment is done, and it was a success. I meant no more than that.


The “Ouch.” was in reference to being compared to Phoronix.

Has anyone found them to be inaccurate, or fluffy to the point it degraded the content?

I haven’t - but then again, probably predominantly reading the best posts being shared on aggregators.


I don't know about the "ouch" but the rest of the comment seems pretty clear that they didn't intend to imply the clickbait.


That’s correct!

That’s why I was chatting about the “Ouch.”

because it was the only part of the comment that didn’t make sense to me in isolation,

so I opened the context, what he was replying to.


Nah I used to read Phoronix and the articles are a bit clickbaity sometimes but mostly it's fine. The real issue is the reader comments. They're absolute trash.


The comments section is the biggest problem, but also, in addition to clickbait, the site has a tendency to amplify and highlight anything that will produce drama, often creating a predictable tempest in a teapot.


> Edit: It's easy to downvote. I cited the relevant law. If I'm wrong, cite other law that explains why.

> Please don't comment about the voting on comments. It never does any good, and it makes boring reading.

https://news.ycombinator.com/newsguidelines.html


I don't think he's here looking for a solution.


Well I am looking for the laziest solution possible. Hoping not to have to build it, but I prefer building something than trying to configure 50 different things to make it work how I want it to for my whole family.

I just want one click albums. Apparently this is the hill I'll die on.


Storage template is not deprecated https://docs.immich.app/administration/storage-template/

Also according to https://immich.app/cursed-knowledge the notify issue was fixed July 2024.


Great news! I triggered the discussion with WAL at the time but lost interest. Great to see that deva understood the problem! Kudos. Will give it another try asap!!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: