> the model do something actually bad before I care
At what point would a simple series of sentences be "dangerously bad?" It makes it sound as if there is a song, that when sung, would end the universe.
When someone asks how to make a yummy smoothie, and the LLM replies with something that subtly poisons or otherwise harms the user, I'd say that would be pretty bad.
By what mechanism would it make them quarrel? Producing falsehoods about the other? Isn't this already done? And don't we already know that it does not lead to "endless" conflict?
For this to work, you need to isolate each group from the other groups information and perspectives, which is outside of the scope of LLMs.
Which, highlights my point, I think. Power comes from physical control, not from megalomanical or melodramatic poetry.
At what point would a simple series of sentences be "dangerously bad?" It makes it sound as if there is a song, that when sung, would end the universe.