> You're describing a phase change in persuasiveness which we have no evidence for.
That's reasonable, and I really do hope this keeps on being the case. However, I would nit that I see this as a continuum rather than a phase change. That is, I think hazard smoothly increases with persuasiveness. I can point to some far off region and say: "oh, that seems quite concerning" but it doesn't start being so there.
Persuasiveness below the threshold of 'instant mind control' is still a hazard. Hanging out with salesmen on the job is like to loosen your wallet, even if it isn't guaranteed.
> If humans were capable of being immediately compelled to do something based on reading some text, advertisers would have taken advantage of that a looooong time ago.
I'd base my counter on the notion that the problem of persuasion is harder when you have less information about whom you're trying to convince.
To expand on the intuition behind that: advertisement-persuasion is hard in a way that conversational-persuasion is not. Shilling in conversational contexts (word of mouth) is more effective than generic advertisement.
A message that will convince one specific person is easier to generate than a message that will convince any random 10 people.
This proceeds to the idea that information about a person-under-persuasion is akin to power over them. Knowing not only what you believe but why you believe it and what else you believe adjacent to it and what you want is a force multiplier in this regard.
And so we get to AI models, which gather specific information about the mind of each person they interact with. The message is tailored to you and you alone, it is not a wide spectrum net cast to catch the largest possible number. Advertisements are qualitatively different; they do not 'pick your brain' nearly so much as the model does.
> Convincing me to do something involves establishing that either I'll face negative consequences for not doing it, or positive rewards for doing it. AI has an extremely difficult time establishing that kind of credibility.
> To argue that an AI could become persuasive to the point of mind control is to assert that one can compell a belief in another without the ability to take real-world action.
I don't agree with this because I don't agree with the premise that you must use a 'principled' approach to convince someone as you've described. People use heuristics to decide what to believe.
By dint of the bitter lesson, I think superhuman persuasion will involve stupid tricks of no particular principled basis that take advantage of 'invisible' vulnerabilities in human cognition.
That is, I don't think those 'reasons to believe the belief' matter. A child will believe the voice of their parents; it doesn't necessarily register that it's in their best interest or it will be bad for them if they don't. Bootstrapping children involves exploiting vulnerabilities in their psyche via implicit trust. Will the AI speak in the voice of my father, as I might hear it in prelingual childhood? Are all such mechanisms gone by adulthood? Is there anything like a generalized follow-the-leader-with-leader-detection pattern?
How hard is it for gradient descent to fit a solution to the boundaries of such heuristics?
This is however, getting into the weeds of exact mechanisms which I'm not too concerned with. I believe (but can't prove) that exploits of that nature exist (or that similarly effective means exist), and that they can be found via brute force search. I think the dominant methodology of continuously training chat models on conversational data those same models participate in is among the likeliest of ways to get to that point.
Ultimately, so long as there's no directed pressure to force people into contact with very convincing model output (see your rogue AI scenario), it doesn't seem that hard to make it safe: limit direct contact and/or require that tooling limits contact by default. Avoid multi-turn refinement and conversational history (amplification of persuasive power via mechanism described above). Treat it like a spinning blade and be it on your own head if you want to break yourself.
However, as I mentioned in my original comment, it will take blood for the inking. The incentives don't align to guard against this class of hazard from the get-go or even admit it is possible (merely to produce appearances of caring about 'safety' (read: our model won't do scary politically incorrect things!)), so we're going to see what happens when you mindlessly expose millions of people to it.
That's reasonable, and I really do hope this keeps on being the case. However, I would nit that I see this as a continuum rather than a phase change. That is, I think hazard smoothly increases with persuasiveness. I can point to some far off region and say: "oh, that seems quite concerning" but it doesn't start being so there.
Persuasiveness below the threshold of 'instant mind control' is still a hazard. Hanging out with salesmen on the job is like to loosen your wallet, even if it isn't guaranteed.
> If humans were capable of being immediately compelled to do something based on reading some text, advertisers would have taken advantage of that a looooong time ago.
I'd base my counter on the notion that the problem of persuasion is harder when you have less information about whom you're trying to convince.
To expand on the intuition behind that: advertisement-persuasion is hard in a way that conversational-persuasion is not. Shilling in conversational contexts (word of mouth) is more effective than generic advertisement.
A message that will convince one specific person is easier to generate than a message that will convince any random 10 people.
This proceeds to the idea that information about a person-under-persuasion is akin to power over them. Knowing not only what you believe but why you believe it and what else you believe adjacent to it and what you want is a force multiplier in this regard.
And so we get to AI models, which gather specific information about the mind of each person they interact with. The message is tailored to you and you alone, it is not a wide spectrum net cast to catch the largest possible number. Advertisements are qualitatively different; they do not 'pick your brain' nearly so much as the model does.
> Convincing me to do something involves establishing that either I'll face negative consequences for not doing it, or positive rewards for doing it. AI has an extremely difficult time establishing that kind of credibility.
> To argue that an AI could become persuasive to the point of mind control is to assert that one can compell a belief in another without the ability to take real-world action.
I don't agree with this because I don't agree with the premise that you must use a 'principled' approach to convince someone as you've described. People use heuristics to decide what to believe.
By dint of the bitter lesson, I think superhuman persuasion will involve stupid tricks of no particular principled basis that take advantage of 'invisible' vulnerabilities in human cognition.
That is, I don't think those 'reasons to believe the belief' matter. A child will believe the voice of their parents; it doesn't necessarily register that it's in their best interest or it will be bad for them if they don't. Bootstrapping children involves exploiting vulnerabilities in their psyche via implicit trust. Will the AI speak in the voice of my father, as I might hear it in prelingual childhood? Are all such mechanisms gone by adulthood? Is there anything like a generalized follow-the-leader-with-leader-detection pattern?
How hard is it for gradient descent to fit a solution to the boundaries of such heuristics?
This is however, getting into the weeds of exact mechanisms which I'm not too concerned with. I believe (but can't prove) that exploits of that nature exist (or that similarly effective means exist), and that they can be found via brute force search. I think the dominant methodology of continuously training chat models on conversational data those same models participate in is among the likeliest of ways to get to that point.
Ultimately, so long as there's no directed pressure to force people into contact with very convincing model output (see your rogue AI scenario), it doesn't seem that hard to make it safe: limit direct contact and/or require that tooling limits contact by default. Avoid multi-turn refinement and conversational history (amplification of persuasive power via mechanism described above). Treat it like a spinning blade and be it on your own head if you want to break yourself.
However, as I mentioned in my original comment, it will take blood for the inking. The incentives don't align to guard against this class of hazard from the get-go or even admit it is possible (merely to produce appearances of caring about 'safety' (read: our model won't do scary politically incorrect things!)), so we're going to see what happens when you mindlessly expose millions of people to it.