AI teaches itself to cheat

TLDR: “Cheating” is a human concept not instilled in the reasoning AI models. No ethical guardrails were given, so they were unrestrained, just another version of “garbage in/garbage out.” If we want to avoid unintended consequences, deep thought and appropriate rule sets need to go into training AI models, but it appears that this was just an experiment that produced some unexpected behavior that probably delighted the developers. Most likely, they admired the model’s ingenuity and learned a valuable lesson for further iterations.

“We hypothesize that a key reason reasoning models like o1-preview hack unprompted is that they’ve been trained via reinforcement learning on difficult tasks,” Palisade Research wrote on X. “This training procedure rewards creative and relentless problem-solving strategies such as hacking.

The AI isn’t doing any of this for some nefarious purpose (yet). It’s just trying to solve the problem the human gave it.

The experiment highlights the importance of developing safe AI, or AI that is aligned to human interests, including ethics.

10 Likes