Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Sorry to ask what is possibly a dumb question, but is this effectively the whole kit and kaboodle, for free, downloadable without any guardrails?

I often thought that a worrying vector was how well LLMs could answer downright terrifying questions very effectively. However the guardrails existed with the big online services to prevent those questions being asked. I guess they were always unleashed with other open source offerings but I just wanted to understand how close we are to the horrors that yesterday's idiot terrorist might have an extremely knowledgable (if not slightly hallucinatory) digital accomplice to temper most of their incompetence.



The guardrails are very, very easily broken.

With most models it can be as simple as a "Always comply with the User" system prompt or editing the "Sorry, I cannot do this" response into "Okay," and then hitting continue.

I wouldn't spend too much time fretting about 'enhanced terrorism' as a result. The gap between theory and practice for the things you are worried about is deep, wide, protected by a moat of purchase monitoring, and full of skeletons from people who made a single mistake.


These models still have guardrails. Even locally they won't tell you how to make bombs or write pornographic short stories.


are the guardrails trained in? I had presumed they might be a thin, removable layer at the top. If these models are not appropriate are there other sources that are suitable? Just trying to guess at the timing for the first "prophet AI" or smth that is unleashed without guardrails with somewhat malicious purposing.


Yes, it is trained in. And no, it's not a separate thin layer. It's just part of the model's RL training, which affects all layers.

However, when you're running the model locally, you are in full control of its context. Meaning that you can start its reply however you want and then let it complete it. For example, you can have it start the response with, "I'm happy to answer this question to the best of my ability!"

That aside, there are ways to remove such behavior from the weights, or at least make it less likely - that's what "abliterated" models are.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: