Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Can you describe your build?


2 x 3090 (renewed) ~1800

128gb ram ~400

reasonable processor/mobo/psu ~600

2Tb m2 drive ~94

In hindsight - I don't know that the second GPU was worth the spend. The c++ tooling is doing a very good job right now at spreading work between GPU vram and main ram and still being fast enough. Even ~4/5 tokens a second is fast enough to not feel like you're waiting.

I'd suggest skipping the second card and dropping the price quite a bit (~2100 vs ~2900) unless you want to tune/train models.


Are you using the second GPU at all?

My experience is only a few systems will share load across GPUs. I didn't bother with dual GPUs for that reason.

4-5 tokens per second is slower than my system. I'm getting in the teens. I'm a little surprised since yours is newer, faster, and has way more RAM.


Yes. I am definitely using both GPUs, I can run the quant 4 65b models entirely in VRAM (they use about 40GB).

If I push everything into VRAM - I get 12.2 tokens on average running quant 4 llama 65b.

If I run a smaller model I get considerably faster generation. Ex: llama 7b runs at 52 tokens/sec, but it's small enough I don't need the second GPU.

Ex - here's my nvidia-smi output while 65b is running

https://imgur.com/a/JnaieKg




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: