Really?? For me it's terrible doing that. I also have 64GB RAM but meh. It's so ...

wing-_-nuts · on May 22, 2024

Oh man, I hate to say it, but it's likely your amd card. Yes, they can run LLMs and SD, but badly. Larger models are usable for me with partial offloading, but you're right that full loading the model in vram is really preferable.

wkat4242 · on May 22, 2024

I don't think so, because when I run it on the 4090 I get the same issue (in a system with 5800X3D and 64GB RAM also). I just don't use the 4090 for LLM because I have it for playing VR games and I don't want to tie it up for a 24/7 LLM server :) Also, it's very power-hungry. I do run that one on Windows and the Radeon server is Linux but I don't think that matters a lot. Using the same software stack too (ollama).

In fact the Radeon which cost me only 300 bucks new performs almost as well running LLMs as the 4090 which really surprised me! I think the fast memory (the Radeon has the same 1TB/s memory bandwidth as the 4090!) helps a lot there.

When I run a local model (significantly) bigger than the 24GB VRAM on the 4090 it won't even load for 15 minutes while the 4090 is pegged at 100% all the time. Eventually I just gave up.

wing-_-nuts · on May 22, 2024

>When I run a local model (significantly) bigger than the 24GB VRAM on the 4090 it won't even load for 15 minutes while the 4090 is pegged at 100% all the time. Eventually I just gave up.

Yeah the key here is partial offloading. If you're trying to offload more layers than your GPU has memory for, you're gonna have a bad time. I find it kind of infuriating that this is still kind of a black art. There's definitely room for better tooling here.

Regardless, with 24GB of vram, I try to limit my offloading to 20GB and let the rest go to ram. Maybe it's the nature of the 8x7B model I run that makes it better at offloading than other large models. I'm not sure. I wouldn't try the 70B models for sure.