More

antirez · 2026-02-02T15:42:59 1770046979

I would bet the popularity is due to coding agents. For the first time you can continue the work without typing much but just inspecting the output and provading further guidance with relatively short messages.

jFriedensreich · 2026-02-02T15:49:09 1770047349

I bet thats true, but that use case will be much much better served by the native vm based linux terminal, not sure what sandboxing you can use in termux.

antirez · 2026-01-31T12:30:43 1769862643

To really understand how complicated is this matter, put into the mix that before AI in Europe there was no shortage of knowledge to have all our cloud services (to the point that a decent part of key infrastructure software is developed in Europe or mainly by Europeans), social networks, ..., but yet it was never strongly wanted. To reach this point, something is really odd with the current US-EU tensions.

antirez · 2026-01-31T11:39:06 1769859546

Thanks, sharing a lot on X / BlueSky + YouTube but once the C course on YouTube will be finished, I'll start a new course on programming in this way. I need a couple more lessons to declare the C course closed (later I'll restart it likely, the advanced part). So I can start with the AP course.

CraftingLinks · 2026-01-31T16:09:28 1769875768

Looking forward! The C course is great!

antirez · 2026-01-31T16:35:41 1769877341

Appreciated :)

antirez · 2026-01-31T11:37:31 1769859451

Now many will downvote you because this is an algorithm and not some code. But the reality is that programming is in large part built looking at somebody else code / techniques, internalizing them, and reproducing them again with changes. So actually it works like that for code as well.

antirez · 2026-01-31T11:35:57 1769859357

Exactly that. And all the books about, for instance, operating systems, totally based on the work of others: their ideas where collected and documented, the exact algorithms, and so forth. All the human culture worked this way. Moreover there is a strong pattern of the most prolific / known open source developers being NOT against the fact that their code was used for training: they can't talk for everybody but it is a signal that for many this use is within the scope of making source code available.

jakkos · 2026-01-31T12:26:40 1769862400

> their ideas where collected and documented

Yeah, documented *and credited*. I'm not against the idea of disseminating knowledge, and even with my misgivings about LLMs, I wouldn't have said anything if this blog post was simply "LLMs are really useful".

My comment was in response to you essentially saying "all the criticisms of LLMs aren't real, and you should be uncompromisingly proud about using them".

> Moreover there is a strong pattern of the most prolific / known open source developers being NOT against the fact that their code was used for training

I think it's easy to get "echo-chambered" by who you follow online with this, my experience has been the opposite, i don't think it's clear what the reality is.

antirez · 2026-01-31T11:30:11 1769859011

You will say I programmed it, there is no longer for this distinction. But then you can add that you used automatic programming in the process. But shortly there will be no need to refer to this term similarly to how today you don't specify you used an editor...

Imustaskforhelp · 2026-01-31T11:44:29 1769859869

(Yes?) but the editor isn't claiming to take your job in 5 years.

Also I do feel like this is a very substantial leap.

This is sort of like the difference between some and many.

Your editor has some effect on the final result so crediting it/mentioning it doesn't really impact it (but people still do mention their editor choices and I know some git repo's with .vscode which can show that the creator used vscode, I am unfamiliar if the same might be true for other editors too)

But especially in AI, the difference is that I personally feel like its doing many/most work. It's literally writing the code which turns into the binary which runs on machine while being a black box.

I don't really know because its something that I am contradicted about too but I just want to speak my mind even if it may be a little contradicted on the whole AI distinction thing which is why I wish to discuss it with ya.

CraftingLinks · 2026-01-31T16:08:07 1769875687

LLMs translate specs into code, if you master conputational thinking like Antirez, you basically reduce LLMs to intelligent translators of the stated computational ideas and specifications into a(ny) formal language + the typing. In that scenario LLMs are a great tool and speedup the coding process. I like how the power is in semantics, whereas syntax becomes more and more a detail (and rightfully so)!

baq · 2026-01-31T11:48:47 1769860127

I like to think that the prompt is dark magic and the outputs are conjured. I get to feel like a wizard.

antirez · 2026-01-31T10:56:16 1769856976

> So when there is a bug / outage / error, due to "automatic programming" you are first in line and ready to accept accountability when it all goes wrong in production?

Absolutely yes. Automatic programming does not mean software developers are no longer accountable for their errors. Also because you can use AP in order to do ways more QA efforts than possible in the past. If you decide to just add things without a rigorous process, it is your fault.

RobinL · 2026-01-31T11:12:06 1769857926

Agree. Much of the value of devs is understanding the thing they're working on so they know what to do when it breaks, and knows what new features it can easily support. Doesn't matter whether they wrote the code, a colleague wrote it, or an AI.

catdog · 2026-01-31T12:57:00 1769864220

Yep writing the code might have gotten a little bit easier but was never was the hard part to begin with.

antirez · 2026-01-31T10:15:41 1769854541

Sciascia, btw, is one of the biggest thinkers and writers of '900. It is not really defined by his mafia-related novels and takes. He used to be friend with Borges, and was regarded as one of the top men in humanistic culture. Disclaimer: I was born in a town (Campobello di Licata) near his town (Racalmuto), but I'm not saying this because of this fact.

If you never read Sciascia, I suggest you starting from his last, tiny novel: "Una storia semplice". I believe there are English translations that can be found around as ebook or used on eBay.

Tom1380 · 2026-02-01T12:40:55 1769949655

It's not a typo. In Italian, we call the nineteen hundreds the nine hundreds in speech. So when we write it, we use '900s. As 900s without it would be the actual 900s

silcoon · 2026-01-31T20:15:08 1769890508

Truly great Italian literature. Also “The day of the Owl” is another famous Sciascia’s book with old mafia theme.

etherus · 2026-01-31T23:17:27 1769901447

As an aside, do you use dvorak as your keyboard layout? The ' for 1 typo is quite rare with qwerty, but I could see you meaning '1900s, though that becomes two characters in a short space. Thanks for the recommendation!

lIl-IIIl · 2026-02-01T08:02:21 1769932941

I think it's an Italian convention.

https://areasosta.com/faq/come-si-scrive-900-in-italiano

Tom1380 · 2026-02-01T20:02:36 1769976156

I thought I had replied to you, but somehow I ended up replying to Antirez's original comment. See my other comment

nine_k · 2026-01-31T20:55:43 1769892943

Nit: I suppose you mean 1900s, not just "'900". I mean, one could reasonably suspect that good writers existed in Italy in early 10th century, too.

Tom1380 · 2026-02-01T20:01:38 1769976098

See my other comment

antirez · 2026-01-29T15:19:50 1769699990

Why I do not believe this shows Anthropic serves folks a worse model:

1. The percentage drop is too low and oscillating, it goes up and down.

2. The baseline of Sonnet 4.5 (the obvious choice for when they have GPU busy for the next training) should be established to see Opus at some point goes Sonnet level. This was not done but likely we would see a much sharp decline in certain days / periods. The graph would look like dominated by a "square wave" shape.

3. There are much better explanations for this oscillation: A) They have multiple checkpoints and are A/B testing, CC asks you feedbacks about the session. B) Claude Code itself gets updated, as the exact tools version the agent can use change. In part it is the natural variability due to the token sampling that makes runs not equivalent (sometimes it makes suboptimal decisions compared to T=0) other than not deterministic, but this is the price to pay to have some variability.

levkk · 2026-01-29T16:39:27 1769704767

I believe the science, but I've been using it daily and it's been getting worse, noticeably.

warkdarrior · 2026-01-29T16:44:19 1769705059

Is it possible that your expectations are increasing, not that the model is getting worse?

GoatInGrey · 2026-01-29T16:59:25 1769705965

Possible, though you eventually run into types of issues that you recall the model just not having before. Like accessing a database or not following the SOP you have it read each time it performs X routine task. There are also patterns that are much less ambiguous like getting caught in loops or failing to execute a script it wrote after ten attempts.

merlindru · 2026-01-29T21:28:41 1769722121

yes but i keep wondering if that's just the game of chance doing its thing

like these models are nondeterministic right? (besides the fact that rng things like top k selection and temperature exist)

say with every prompt there is 2% odds the AI gets it massively wrong. what if i had just lucked out the past couple weeks and now i had a streak of bad luck?

and since my expectations are based on its previous (lucky) performance i now judge it even though it isn't different?

or is it giving you consistenly worse performance, not able to get it right even after clearing context and trying again, on the exact same problem etc?

F7F7F7 · 2026-01-29T23:08:42 1769728122

I’ve had Opus struggle on trivial things that Sonnet 3.5 handled with ease.

It’s not so much that the implementations are bad because the code is bad (the code is bad). It’s that it gets extremely confused and starts to frantically make worse and worse decisions and questioning itself. Editing multiple files, changing its mind and only fixing one or two. Reseting and overriding multiple batches of commits without so much as a second thought and losing days of work (yes, I’ve learned my lesson).

It, the model, can’t even reason with the decisions it’s making from turn to turn. And the more opaque agentic help it’s getting the more I suspect that tasks are being routed to much lesser models (not the ones we’ve chosen via /model or those in our agent definitions) however Anthropic chooses.

In these moments I mind as well be using Haiku.

davidee · 2026-01-29T18:27:04 1769711224

I have to concur. And to the question about understanding what its good and bad at; no, tasks that it could accomplish quickly and easily just a month ago, now require more detailed prompting and constant "erroneous direction correction."

It's almost as if, as tool use and planning capabilities have expanded, Claude (as a singular product) is having a harder time coming up with simple approaches that just work, instead trying to use tools and patterns that complicate things substantially and introduce much more room for errors/errors of assumption.

It also regularly forgets its guidelines now.

I can't tell you how many times it's suggested significant changes/refactors to functions because it suddenly forgets we're working in an FP codebase and suggests inappropriate imperative solutions as "better" (often choosing to use language around clarity/consistency when the solutions are neither).

Additionally, it has started taking "initiative" in ways it did not before, attempting to be helpful but without gathering the context needed to do so properly when stepping outside the instruction set. It just ends up being much messier and inaccurate.

I have to regularly just clear my prompt and start again with guardrails that have either: already been established, or have not been needed previously / are only a result of the over-zealousness of the work its attempting to complete.

conception · 2026-01-29T19:49:03 1769716143

I assume, after any compacting of the context window that the session is more or less useless at that point I’ve never had consistent results after compacting.

justinlivi · 2026-01-29T23:41:05 1769730065

Compacting equals death of the session in my process. I do everything I can to avoid hitting it. If I accidentally fly too close to the sun and compact I tend to revert and start fresh. As soon as it compacts it's basically useless

F7F7F7 · 2026-01-29T23:02:04 1769727724

Multiple concurrences a choir or a mob?

1pm EST time it’s all down hill until around 8 or 9pm EST time.

Late nights and weekends is smooth sailing.

bushbaba · 2026-01-30T03:46:16 1769744776

I’m finding Gemini and chatGPT web terminal to out perform Claude code. The context becomes too much for the LLM, and tries to make up for it by doing more file read ops.

samusiam · 2026-01-30T11:34:15 1769772855

Sounds like you might want to refactor the code if the individual files are too big and it can't find what it's looking for?

emp17344 · 2026-01-29T17:06:29 1769706389

Any chance you’re just learning more about what the model is and is not useful for?

jerf · 2026-01-29T17:23:14 1769707394

I dunno about everyone else but when I learn more about what a model is and is not useful for, my subjective experience improves, not degrades.

emp17344 · 2026-01-29T17:34:04 1769708044

Not when the product is marketed as a panacea.

data-ottawa · 2026-01-29T18:36:01 1769711761

There are some days where it acts staggeringly bad, beyond baselines.

But it’s impossible to actually determine if it’s model variance, polluted context (if I scold it, is it now closer in latent space to a bad worker, and performs worse?), system prompt and tool changes, fine tunes and AB tests, variances in top P selection…

There’s too many variables and no hard evidence shared by Anthropic.

acuozzo · 2026-01-29T18:47:04 1769712424

No because switching to the API with the same prompt immediately fixes it.

There's little incentive to throttle the API. It's $/token.

TIPSIO · 2026-01-29T18:44:22 1769712262

I too suspect the A/B testing is the prime suspect: context window limits, system prompts, MAYBE some other questionable things that should be disclosed.

Either way, if true, given the cost I wish I could opt-out or it were more transparent.

Put out variants you can select and see which one people flock to. I and many others would probably test constantly and provide detailed feedback.

All speculation though

F7F7F7 · 2026-01-29T23:13:09 1769728389

Whenever I see new behaviors and suspect I’m being tested on I’ll typically see a feedback form at some point in that session. Well, that and dropping four letter words.

I know it’s more random sampling than not. But they are definitely using our codebases (and in some respects our livelihoods) as their guinea pigs.

samusiam · 2026-01-30T11:36:11 1769772971

If that's the case, then as a benchmark operator you'd want to run the benchmark through multiple different accounts on different machines to average over A/B test noise.

make3 · 2026-01-29T22:22:17 1769725337

It would be very easy for them to switch the various (compute) cost vs performance knobs down depending on load to maintain a certain latency; you would see oscillations like this, especially if the benchmark is not always run exactly at the same time every day.

& it would be easy for them to start with a very costly inference setup for a marketing / reputation boost, and slowly turn the knobs down (smaller model, more quantized model, less thinking time, fewer MoE experts, etc)

littlestymaar · 2026-01-29T16:25:20 1769703920

> 1. The percentage drop is too low and oscillating, it goes up and down.

How do you define “too low”, they make sure to communicate about the statistical significance of their measurements, what's the point if people can just claim it's “too low” based on personal vibes…

eterm · 2026-01-29T16:20:09 1769703609

4. The graph starts January 8.

Why January 8? Was that an outlier high point?

IIRC, Opus 4.5 was released late november.

F7F7F7 · 2026-01-29T23:14:24 1769728464

Right after the Holiday double token promotion users felt (perceived) a huge regression in capabilities. I bet that triggered the idea.

pertymcpert · 2026-01-29T17:46:49 1769708809

People were away for the holidays. What do you want them to do?

littlestymaar · 2026-01-29T16:23:45 1769703825

Or maybe, juste maybe, that's when they started testing…

eterm · 2026-01-29T16:28:05 1769704085

Wayback machine has nothing for this site before today, and article is "last updated Jan 29".

A benchmark like this ought to start fresh from when it is published.

I don't entirely doubt the degradation, but the choice of where they went back to feels a bit cherry-picked to demonstrate the value of the benchmark.

littlestymaar · 2026-01-29T16:30:41 1769704241

Which makes sense, you gotta wait until you get enough data before you can communicate on the said data…

If anything it's coherent with the fact that they very likely didn't have data earlier than January the 8th.

antirez · 2026-01-28T22:26:18 1769639178

Can run with mmap() but it is slower. 4-bit quantized there is a decent ratio between the model size and the RAM, with a fast SSD one could try to see how it works. However when a model is 4-bit quantized there is often the doubt that it is not better than an 8-bit quantized model of 200B parameters, it depends on the model, on the use case, ... Unfortunately the street for local inference of SOTA model is being stopped by the RAM prices and the GPU request of the companies, leaving us with little. Probably today the best bet is to buy Mac Studio systems and then run distributed inference (MLX supports this for instance), or a 512 GB Mac Studio M4 that costs, like 13k$.

vardump · 2026-01-29T05:49:47 1769665787

I think 512 GB Mac Studio was M3 Ultra.

Anyways, isn't a new Mac Studio due in a few months? It should be significantly faster as well.

I just hope RAM prices don't ruin this...

notpublic · 2026-01-28T23:03:36 1769641416

Talking about RAM prices, you can still get a framework Max+ 395 with 128GB RAM for ~$2,459 USD. They have not increased the price for it yet.

https://frame.work/products/desktop-diy-amd-aimax300/configu...

Scipio_Afri · 2026-01-28T23:17:59 1769642279

Pretty sure those use to be $1999 ... but not entirely sure

notpublic · 2026-01-28T23:30:09 1769643009

Yep. You be right. Looks like they increased it earlier this month. Bummer!