It was about time to build a new desktop anyways (roughly 4 to 6 years before the old one goes to frolic at the server farm in the basement) and $2,000 will easily buy a machine that can run the quantized 65b models right now. So I spent slightly more than I normally do on this latest box and it's happily spitting out 10+ tokens a second.
You're not going to beat GPT-4 yet, but you have direct control over where your info goes, what model you're running, compliance with work policies against using public AI, and relatively cheap fixed costs.
Not to mention, the local version works with no internet and isn't subject to provider outages (not entirely true - but you're the provider and can resolve).
Seems like an easy win for anyone who might be buying a desktop for graphic/gaming anyways.
My experience with open source models places them as a little bit worse than GPT 3, and nowhere close to GPT3.5.
That said:
- For many uses, it doesn't matter. For many of the ways I use it, I don't care. For basic use (e.g. clean up an email for me), it's basically the same. For things like complex reasoning, algorithms, or foreign languages, the hosted service is critical.
- GPT3-grade models have more soul. OpenAI trained GPT3.5 and 4 to never do anything offensive, and that has a lot of negative side effects, well-documented in research. The way I'd describe it, though, is the difference between talking to a call center rep and your grandma (with mild Alzheimer's, perhaps). They both have their place.
- Different models are often helpful in workflows.
My experience is anecdotal. Please don't take it as more than one data point. If other people post their anecdotal experiences, you'll get the plural of "anecdote."
Would you write a check to fund a business that could potentially self-destruct via lawsuits alone? In the end, the best model will not be owned by a mega corp like MicrOpenAI. It may be the most popular, but it will be the equivalent of the sanitized version of history students learn in school. The best model will have no problem telling you, very factually, that the hallways of Versailles used to smell like sh--.
If you're a diligent tester, none of the open source models can touch GPT 3.5 yet, however, in practical terms, some of the 60b and 30b parameter models are almost indistinguishable from GPT 3.5 from the layperson's perspective. Now, if you consider the uncensored models, then you actually have some capabilities that GPT 3.5 and 4 are completely lacking.
Based on the rate of progress in the open source world, it won't be more than a year before we have an open source model that is truly superior to GPT 3.5
The commercial & api based models are still more capable general purpose tools. But the current open tooling can do some nifty stuff, and the community around it is moving at a breakneck speed still.
In some areas, it's acceptably good. In some areas it's not. But it's getting better really fast.
That's why my current plan is to get ChatGPT 4 to help me set up my local open source implementations of Orca and Stable Diffusion. Got MusicGen running locally anyway; that was pretty easy.
It depends heavily on the model you're running, and to some extent what you're doing with it. It also depends to on prompt effort. The quantized llama 65b model (you can do it yourself, or pull something like https://huggingface.co/TheBloke/llama-65B-GGML) is probably the highest quality for general purpose, but it does take a fair bit of effort to prompt since it's not tuned for a use-case.
It's also not licensed commercially, so I avoid some things with it (ex: I do a lot of personal learning/investigation, but it doesn't touch or write anything related to work or personal projects)
The open models are a little further behind, but it's interesting to see them spin off into niches where they have strengths based on tuning/training.
I don’t know about self-hosted, but it the company I contact at has 3.5 instance. It has better understanding of code examples I provide than ChatGPT. The company hasn’t tweaked the code to our standards.
2. Make sure you have the latest nvidia driver for your machine, along with the cuda toolkit. This will vary by OS but is fairly easy on most linux distros.
4. Run the model following their instructions. There are several flags that are important, but you can also just use their server example that was added a few days ago - it gives a fairly solid chat interface.
4) Run gpt4all, and wait for the obnoxiously slow startup time
... and that's it. On my machine, it works perfectly well -- about as fast as the web service version of GPT. I have a decent GPU, but I never checked if it's using it, since it's fast enough.
Some example models (I'm linking to quantized versions that someone else has made, but the tooling is in the above repos to create them from the published fp16 models)
In hindsight - I don't know that the second GPU was worth the spend. The c++ tooling is doing a very good job right now at spreading work between GPU vram and main ram and still being fast enough. Even ~4/5 tokens a second is fast enough to not feel like you're waiting.
I'd suggest skipping the second card and dropping the price quite a bit (~2100 vs ~2900) unless you want to tune/train models.
It was about time to build a new desktop anyways (roughly 4 to 6 years before the old one goes to frolic at the server farm in the basement) and $2,000 will easily buy a machine that can run the quantized 65b models right now. So I spent slightly more than I normally do on this latest box and it's happily spitting out 10+ tokens a second.
You're not going to beat GPT-4 yet, but you have direct control over where your info goes, what model you're running, compliance with work policies against using public AI, and relatively cheap fixed costs.
Not to mention, the local version works with no internet and isn't subject to provider outages (not entirely true - but you're the provider and can resolve).
Seems like an easy win for anyone who might be buying a desktop for graphic/gaming anyways.