Hacker Newsnew | past | comments | ask | show | jobs | submit | FuckButtons's commentslogin

Bold of you to assume this is a quick fix. How many software projects have you worked on that went from a buggy poorly optimized mess into a streamlined efficient system? I can think of exactly 0 from personal experience, all the ones I’ve worked on that were performant at the end had that in mind from their inception.

Have you tried this? How did it go?

Appropriate username.

Given the announcement from a few days ago of google trying to get external investment, this is their follow up, showing what that investment is good for. Also, it’s pretty light on details that are of much use to competitors. “We made an accurate simulation system to test our system in before deployment” would be pretty mundane if you were talking about any other field of engineering.

There have been advances recently (last year) in scaling deep rl by a significant amount, their announcement is in line with a timeline of running enough experiments to figure out how to leverage that in post training.

Importantly, this isn’t just throwing more data at the problem in an unstructured way, afaik companies are getting as many got histories as they can and doing something along the lines of, get an llm to checkpoint pull requests, features etc and convert those into plausible input prompts, then run deep rl with something which passes the acceptance criteria / tests as the reward signal.


So, the hidden mental model that the OP is expressing and failed to elucidate on is that llm’s can be thought of as compressing related concepts into approximately orthogonal subspaces of the vector space that is upper bounded by the superposition of all of their weights. Since training has the effect of compressing knowledge into subspaces, a necessary corollary of that fact is that there are now regions within the vector space that contain nothing very much. Those are the valleys that need to be tunneled through, ie the model needs to activate disparate regions of its knowledge manifold simultaneously, which, seems like it might be difficult to do. I’m not sure if this is a good way of looking at things though, because inference isn’t topology and I’m not sure that abstract reasoning can be reduced down to finding ways to connect concepts that have been learned in isolation.

I asked Gemini deep research to project when that will likely happen based on historical precedent. It guessed October 2027.

No one chose the economy we had before either.

You are who, what and where you are by virtue of historical accident. The serfs of the Middle Ages sure as shit didn’t want that economy either, nor did the unwashed masses, dying at such an alarming rate of dysentery and cholera that the population of major cities was only sustained through mass migration during industrialization in the 18th and 19th centuries. Nor did more or less any slave during human history.

The last ~150 years of economic freedom and prosperity for a large percentage of the population that has been wide spread across industrialized economies is the exception, not the rule, and the difference is and always has been the fact that industrialized economies have required large amounts of skilled specialists to make the systems they developed work.

If you change that calculus, there is no historical precedent to assume that human societies won’t revert to the mean.


>You are who, what and where you are by virtue of historical accident. The serfs of the Middle Ages sure as shit didn’t want that economy either

At least many of them fought hard to keep what they liked - plenty of peasant revolts and coming at feudal lords when they overdid it...

Early industrial workers didn't like their economy and conditions either, fought hard, organized, and got many changes to labour laws and conditions.

Us?


idk I feel like I read a few things from history about people rejecting the economy that was forced on them and demanding better, maybe it was AI generated?

You can ask Claude to work with you step by step and use /rewind. It only shows the diff though, which, hides some of the problem. Since diffs can seem fine in isolation, but when viewed in context can have obvious issues.

Ya I guess if you have the IDE open and monitor unstaged git, it's a similar workflow. The other cursor feature I use heavily is the ability to add specific lines and ranges of a file to the context. Feels like in the CLI this would just be pasted text and Claude would have to work a lot harder to resolve the source file and range

From my own usage, the former is almost always better than the latter. Because it’s less like a lobotomy and more like a hangover, though I have run some quantized models that seem still drunk.

Any model that I can run in 128 gb in full precision is far inferior to the models that I can just barely get to run after reap + quantization for actually useful work.

I also read a paper a while back about improvements to model performance in contrastive learning when quantization was included during training as a form of perturbation, to try to force the model to reach a smoother loss landscape, it made me wonder if something similar might work for llms, which I think might be what the people over at minimax are doing with m2.1 since they released it in fp8.

In principle, if the model has been effective during its learning at separating and compressing concepts into approximately orthogonal subspaces (and assuming the white box transformer architecture approximates what typical transformers do), quantization should really only impact outliers which are not well characterized during learning.


Interesting.

If this were the case however, why would labs go through the trouble of distilling their smaller models rather than releasing quantized versions of the flagships?


You can't quantize 1T model down to "flash" model speed/token price. 4bpw is about the limit of reasonable quantization, so 2-4x (fp8/16 -> 4bpw) weight size reduction. Easier to serve, sure, but maybe not offer as free tier cheap.

With distillation you're training new model, so size of it is arbitrary, say 1T -> 20B (50x) reduction which also can be quantized. AFAIK distillation is also simply faster/cheaper than training from scratch.


Hanlon's razor.

"Never attribute to malice that which is adequately explained by stupidity."

Yes, I'm calling labs that don't distill smaller sized models stupid for not doing so.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: