>When I run a local model (significantly) bigger than the 24GB VRAM on the 4090 it won't even load for 15 minutes while the 4090 is pegged at 100% all the time. Eventually I just gave up.
Yeah the key here is partial offloading. If you're trying to offload more layers than your GPU has memory for, you're gonna have a bad time. I find it kind of infuriating that this is still kind of a black art. There's definitely room for better tooling here.
Regardless, with 24GB of vram, I try to limit my offloading to 20GB and let the rest go to ram. Maybe it's the nature of the 8x7B model I run that makes it better at offloading than other large models. I'm not sure. I wouldn't try the 70B models for sure.
Yeah the key here is partial offloading. If you're trying to offload more layers than your GPU has memory for, you're gonna have a bad time. I find it kind of infuriating that this is still kind of a black art. There's definitely room for better tooling here.
Regardless, with 24GB of vram, I try to limit my offloading to 20GB and let the rest go to ram. Maybe it's the nature of the 8x7B model I run that makes it better at offloading than other large models. I'm not sure. I wouldn't try the 70B models for sure.