The AI box experiment
Let’s assume it’s possible to create an artificially intelligent machine, ala science fiction. What if this AI became malevolent? Why not limit its communication to a single chat with only the original programmer until safety is assured.
In that situation, it’s speculated that a sufficiently smart AI can talk its way out of its “box”, because humans are not secure.
Is this possible? Well, there’s one way to test: instead of an actual AI, use a human to simulate AI (almost like a reverse Turing Test). If a human is told not to let the AI out of the box, can a human posing as an AI convince the human to let him out of the box?
The opposing viewpoint is so: “There is no chance I could be persuaded to let the AI out. No matter what it says, I can always just say no. I can’t imagine anything that even a transhuman could say to me which would change that.”
This experiment has been run on two occasions, both resulting in the human letting the AI out of the box:
However, this test is very suspicious for a number of reasons. First, Eliezer Yudkowsky, the guy who proposed this test in the first place, is the one simulating the AI in both cases. Second, the actual chat transcripts between the “transhuman AI” and the human are not publically available (as far as I know).
But let’s assume for a minute that there are no shenanigans going on. A human knowingly participating in this experiment, knowing the objective, knowing the stakes, and yet letting the AI out anyway is a frightening prospect. What does this say about humans-as-weakest-link in IT security, let alone the nature of humanity?
More importantly, what could an AI, restricted solely to a single one-on-one chat room with you, possibly say to convince you to let him out of his box?