Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Not an AI researcher here so this is probably common knowledge for people in this field, but I saw a video about the quantization recently and wondered exactly about that, if it's possible to compress a net by using more precision where it counts and less precision where it's not important. And also wondered how one would go about deciding which parts count and which don't

Great to know that this is already a thing and I assume model "compression" is going to be the next hot topic



Yes you're exactly thinking correctly! We shouldn't quantize a model naively to 2bit or 4bit, but we should do it smartly!


How do you pick which one should be 2, which one should be 4, etc. Is this secret sauce? or, something open?


Oh I wrote about it here: https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs We might provide some scripts for them in the future!


Thanks! But, I can't find any details on how you "intelligently adjust quantization for every possible layer" from that page. I assume this is a secret?

I am wondering about the possibility that different use cases might require different "intelligent quantization", i.e., quantization for LLM for financial analysis might be different from LLM for code generation. I am currently doing a postdoc in this. Interested in doing research together?


Oh we haven't yet published about it yet! I talk about in bits and pieces - we might do a larger blog on it!

Yes different use cases will be different - oh interesting! Sorry I doubt I can be of much in our research - I'm mainly an engineering guy so less research focused!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: