Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I went self hosted.

It was about time to build a new desktop anyways (roughly 4 to 6 years before the old one goes to frolic at the server farm in the basement) and $2,000 will easily buy a machine that can run the quantized 65b models right now. So I spent slightly more than I normally do on this latest box and it's happily spitting out 10+ tokens a second.

You're not going to beat GPT-4 yet, but you have direct control over where your info goes, what model you're running, compliance with work policies against using public AI, and relatively cheap fixed costs.

Not to mention, the local version works with no internet and isn't subject to provider outages (not entirely true - but you're the provider and can resolve).

Seems like an easy win for anyone who might be buying a desktop for graphic/gaming anyways.



How does it seem to compare with GPT 3.5? That really seems like the baseline for what's usable.


My experience with open source models places them as a little bit worse than GPT 3, and nowhere close to GPT3.5.

That said:

- For many uses, it doesn't matter. For many of the ways I use it, I don't care. For basic use (e.g. clean up an email for me), it's basically the same. For things like complex reasoning, algorithms, or foreign languages, the hosted service is critical.

- GPT3-grade models have more soul. OpenAI trained GPT3.5 and 4 to never do anything offensive, and that has a lot of negative side effects, well-documented in research. The way I'd describe it, though, is the difference between talking to a call center rep and your grandma (with mild Alzheimer's, perhaps). They both have their place.

- Different models are often helpful in workflows.

My experience is anecdotal. Please don't take it as more than one data point. If other people post their anecdotal experiences, you'll get the plural of "anecdote."


> OpenAI trained GPT3.5 and 4 to never do anything offensive, and that has a lot of negative side effects, well-documented in research.

I'm absolutely disgusted by OpenAI for this "do no offense" approach. How can people so smart be so damn uneducated?

Then again, this industry has disgusted me for a long time so it's not really a surprise.


Would you write a check to fund a business that could potentially self-destruct via lawsuits alone? In the end, the best model will not be owned by a mega corp like MicrOpenAI. It may be the most popular, but it will be the equivalent of the sanitized version of history students learn in school. The best model will have no problem telling you, very factually, that the hallways of Versailles used to smell like sh--.


If you're a diligent tester, none of the open source models can touch GPT 3.5 yet, however, in practical terms, some of the 60b and 30b parameter models are almost indistinguishable from GPT 3.5 from the layperson's perspective. Now, if you consider the uncensored models, then you actually have some capabilities that GPT 3.5 and 4 are completely lacking.

Based on the rate of progress in the open source world, it won't be more than a year before we have an open source model that is truly superior to GPT 3.5


Yup - I'd say this feels about right.

The commercial & api based models are still more capable general purpose tools. But the current open tooling can do some nifty stuff, and the community around it is moving at a breakneck speed still.

In some areas, it's acceptably good. In some areas it's not. But it's getting better really fast.


That's why my current plan is to get ChatGPT 4 to help me set up my local open source implementations of Orca and Stable Diffusion. Got MusicGen running locally anyway; that was pretty easy.


>Now, if you consider the uncensored models, then you actually have some capabilities that GPT 3.5 and 4 are completely lacking.

Can you elaborate on what those capabilities are?


It depends heavily on the model you're running, and to some extent what you're doing with it. It also depends to on prompt effort. The quantized llama 65b model (you can do it yourself, or pull something like https://huggingface.co/TheBloke/llama-65B-GGML) is probably the highest quality for general purpose, but it does take a fair bit of effort to prompt since it's not tuned for a use-case.

It's also not licensed commercially, so I avoid some things with it (ex: I do a lot of personal learning/investigation, but it doesn't touch or write anything related to work or personal projects)

The open models are a little further behind, but it's interesting to see them spin off into niches where they have strengths based on tuning/training.


I don’t know about self-hosted, but it the company I contact at has 3.5 instance. It has better understanding of code examples I provide than ChatGPT. The company hasn’t tweaked the code to our standards.


Would you be willing to create a guide? I think this would be of great help.


I started here https://github.com/ggerganov/llama.cpp

Which won't run everything, but will run model in the GGML format such as https://huggingface.co/TheBloke/llama-65B-GGML

The steps are basically:

1. Download a model

2. Make sure you have the latest nvidia driver for your machine, along with the cuda toolkit. This will vary by OS but is fairly easy on most linux distros.

3. compile https://github.com/ggerganov/llama.cpp following their instructions (in particular, look for LLAMA_CUBLAS for enabling GPU support)

4. Run the model following their instructions. There are several flags that are important, but you can also just use their server example that was added a few days ago - it gives a fairly solid chat interface.


I'll make a simpler guide:

1) Go to https://gpt4all.io/index.html

2) Click the downloader for your OS

3) Run the installer

4) Run gpt4all, and wait for the obnoxiously slow startup time

... and that's it. On my machine, it works perfectly well -- about as fast as the web service version of GPT. I have a decent GPU, but I never checked if it's using it, since it's fast enough.


Super interesting! Can you point me to some of the models and repos you used to do this?


For base tooling, things like:

https://huggingface.co/ (finding models and downloading them)

https://github.com/ggerganov/llama.cpp (llama)

https://github.com/cmp-nct/ggllm.cpp (falcon)

For interactive work (art/chat/research/playing around), things like:

https://github.com/oobabooga/text-generation-webui/blob/main... (llama) (Also - they just added a decent chat server built into llama.cpp the project)

https://github.com/invoke-ai/InvokeAI (stable-diffusion)

Plus a bunch of hacked together scripts.

Some example models (I'm linking to quantized versions that someone else has made, but the tooling is in the above repos to create them from the published fp16 models)

https://huggingface.co/TheBloke/llama-65B-GGML

https://huggingface.co/TheBloke/falcon-40b-instruct-GPTQ

https://huggingface.co/TheBloke/Wizard-Vicuna-30B-Uncensored...

etc. Hugging face has quite a number, although some require filling out forms for the base models for tuning/training.


Thank you!


Sounds interesting. Would appreciate any tutorials or guides.


Does it ever refuse to answer you?


Can you describe your build?


2 x 3090 (renewed) ~1800

128gb ram ~400

reasonable processor/mobo/psu ~600

2Tb m2 drive ~94

In hindsight - I don't know that the second GPU was worth the spend. The c++ tooling is doing a very good job right now at spreading work between GPU vram and main ram and still being fast enough. Even ~4/5 tokens a second is fast enough to not feel like you're waiting.

I'd suggest skipping the second card and dropping the price quite a bit (~2100 vs ~2900) unless you want to tune/train models.


Are you using the second GPU at all?

My experience is only a few systems will share load across GPUs. I didn't bother with dual GPUs for that reason.

4-5 tokens per second is slower than my system. I'm getting in the teens. I'm a little surprised since yours is newer, faster, and has way more RAM.


Yes. I am definitely using both GPUs, I can run the quant 4 65b models entirely in VRAM (they use about 40GB).

If I push everything into VRAM - I get 12.2 tokens on average running quant 4 llama 65b.

If I run a smaller model I get considerably faster generation. Ex: llama 7b runs at 52 tokens/sec, but it's small enough I don't need the second GPU.

Ex - here's my nvidia-smi output while 65b is running

https://imgur.com/a/JnaieKg




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: