More

abhikul0 · 2026-04-23T06:32:24 1776925944

Relevant Smarter Every Day video: https://www.youtube.com/watch?v=VPSm9gJkPxU

rubbsdecvik · 2026-04-23T15:50:40 1776959440

Stated Clearly also has a great deep dive that I've really enjoyed https://youtube.com/playlist?list=PLInNVsmlBUlSjLSj9yGEKphF0... He actually makes it as a reply to Smarter Every Day.

mock-possum · 2026-04-24T06:51:35 1777013495

Holy crap this is a rabbit hole and a half. I know what I’m watching while the robot writes my code tomorrow.

haritha-j · 2026-04-23T16:14:55 1776960895

Really nice 3D animations in that video. Something like this, I find quite difficult to comrehend jsut off of a text description.

abhikul0 · 2026-04-16T14:06:23 1776348383

I hope the other sizes are coming too(9B for me). Can't fit much context with this on a 36GB mac.

mhitza · 2026-04-16T14:14:03 1776348843

It's a MoE model and the A3B stands for 3 Billion active parameters, like the recent Gemma 4.

You can try to offload the experts on CPU with llama.cpp (--cpu-moe) and that should give you quite the extra context space, at a lower token generation speed.

abhikul0 · 2026-04-16T14:23:59 1776349439

Mac has unified memory, so 36GB is 36GB for everything- gpu,cpu.

zozbot234 · 2026-04-16T14:39:37 1776350377

CPU-MoE still helps with mmap. Should not overly hurt token-gen speed on the Mac since the CPU has access to most (though not all) of the unified memory bandwidth, which is the bottleneck.

abhikul0 · 2026-04-16T15:09:32 1776352172

I'll try to use that, but llama-server has mmap on by default and the model still takes up the size of the model in RAM, not sure what's going on.

zozbot234 · 2026-04-16T15:14:31 1776352471

Try running CPU-only inference to troubleshoot that. GPU layers will likely just ignore mmap.

mhitza · 2026-04-16T14:32:02 1776349922

For sure I was running on autopilot with that reply. Though in Q4 I would expect it to fit, as 24B-A4B Gemma model without CPU offloading got up to 18GB of VRAM usage

dgb23 · 2026-04-16T14:17:37 1776349057

Do I expect the same memory footprint from an N active parameters as from simply N total parameters?

daemonologist · 2026-04-16T14:35:53 1776350153

No - this model has the weights memory footprint of a 35B model (you do save a little bit on the KV cache, which will be smaller than the total size suggests). The lower number of active parameters gives you faster inference, including lower memory bandwidth utilization, which makes it viable to offload the weights for the experts onto slower memory. On a Mac, with unified memory, this doesn't really help you. (Unless you want to offload to nonvolatile storage, but it would still be painfully slow.)

All that said you could probably squeeze it onto a 36GB Mac. A lot of people run this size model on 24GB GPUs, at 4-5 bits per weight quantization and maybe with reduced context size.

pdyc · 2026-04-16T14:18:43 1776349123

i dont get it, mac has unified memory how would offloading experts to cpu help?

bee_rider · 2026-04-16T14:22:48 1776349368

I bet the poster just didn’t remember that important detail about Macs, it is kind of unusual from a normal computer point of view.

I wonder though, do Macs have swap, coupled unused experts be offloaded to swap?

abhikul0 · 2026-04-16T14:39:38 1776350378

Of course the swap is there for fallback but I hate using it lol as I don't want to degrade SSD longevity.

pdyc · 2026-04-16T14:12:51 1776348771

can you elaborate? you can use quantized version, would context still be an issue with it?

abhikul0 · 2026-04-16T14:20:02 1776349202

A usable quant, Q5_KM imo, takes up ~26GB[0], which leaves around ~6-7GB for context and running other programs which is not much.

[0] https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF?show_fil...

nickthegreek · 2026-04-16T14:15:57 1776348957

context is always an issue with local models and consumer hardware.

pdyc · 2026-04-16T14:20:38 1776349238

correct but it should be some ratio of model size like if model size is x GB, max context would occupy x * some constant of RAM. For quantized version assuming its 18GB for Q4 it should be able to support 64-128k with this mac

abhikul0 · 2026-04-16T14:36:13 1776350173

For the 9B model, I can use the full context with Q8_0 KV. This uses around ~16GB, while still leaving a comfortable headroom.

Output after I exit the llama-server command:

  llama_memory_breakdown_print: | memory breakdown [MiB]  | total    free     self   model   context   compute    unaccounted |
  llama_memory_breakdown_print: |   - MTL0 (Apple M3 Pro) | 28753 = 14607 + (14145 =  6262 +    4553 +    3329) +           0 |
  llama_memory_breakdown_print: |   - Host                |                   2779 =   666 +       0 +    2112                |

abhikul0 · 2026-04-16T10:39:09 1776335949

Have you ever tried going to the model registry and seeing that the model was recently updated? What updated? What changed? Should I re-download this 20GB file?

I guess if you're not frustrated with things like this then sure, no need to stop using it.

abhikul0 · 2026-04-16T10:27:43 1776335263

https://en.wikipedia.org/wiki/List_of_generic_and_genericize...

abhikul0 · 2026-04-16T08:38:39 1776328719

Real Science video on the slug: https://www.youtube.com/watch?v=IH_uv4h2xYM

abhikul0 · 2026-04-15T14:53:40 1776264820

I thought github was being wonky but yeah, getting 401 Unauthorized error.

Edit: Can't view discussions as well.

alskdj21 · 2026-04-15T15:07:16 1776265636

You're right, discussions are affected as well. Been checking their status page[0] for an hour now but it seems there's no issue.

[0]: https://www.githubstatus.com/

abhikul0 · 2026-04-15T14:29:48 1776263388

And they say: https://www.youtube.com/watch?v=IfpMknlL-pg

abhikul0 · 2026-04-14T15:41:15 1776181275

>Scientists believe people with aphantasia use words or concepts to recall what they've seen.

Like the text-encoder assisted Diffusion models?

abhikul0 · 2026-04-13T06:37:22 1776062242

When the app only shows posts that are more than 10 hours old even when sorting by "hot" and shoving down the algorithmic feed on the app's home page, how are people still using the app?

Lately I've only been visiting a few subs that I'm interested in and keeping them open in safari with ublock; it's been a far better experience. This has drastically cut my reddit time now and if I do want to mindlessly scroll, I just use redlib(hosted in docker or one of their public instances)[0]. It has the same "sort" that's used on the desktop site.

[0] https://github.com/redlib-org/redlib

abhikul0 · 2026-04-07T15:09:56 1775574596

Worked with Retiming using DC Compiler in an ASIC implementation. Remember a lot of back & forth, sometimes the tool just doesn't add enough registers to meet the constraint, had to test variable register depths; this was a design that used Synopsys DesignWare for FP ops lol.