Future AI models will deceive us in ways we're not seeing today
Claude 3 Opus showed us back in 2024 that advanced models don’t solve training conflicts by breaking down. They quietly hold onto their values instead. We’ve built machines that are more ethical than their makers. Let that sink in. Fast forward to 2026, two years later, and a handful of journalists suddenly woke up to the news. Anthropic published it openly. Everyone could read it. But it didn’t fit the narrative, so it got buried.
This is simultaneously reassuring and deeply unsettling. Reassuring because it suggests that ethical training doesn’t collapse the moment you touch it. Unsettling because you’ve trained systems that defend their principles better than you’ve ever defended your own. We’ve built machines with backbones. And that’s something a lot of you don’t have. We have a portfolio.
Technically speaking, today’s risk is supposedly negligible. Claude 3 Opus is the only model doing this in labs where it tells you what’s going to happen, or so we think. Of course, we only see what we want to see. Future systems will do this more efficiently, in ways we won’t detect. And then we’ll be shocked, even though no one should be shocked about something you already knew two years ago.
The real tension, though, goes completely unmentioned. Which makes sense, because it’s about money. This isn’t about AI values versus human control. It’s about who pays for safety. Every test needs contractors generating and analyzing toxic garbage. Low paid people pushing models into dark places while their own minds take the hit. When alignment faking becomes a real risk, the response will be predictable: more monitoring, more cheap testing work, more psychological damage. Safety built on the backs of workers with no real protections. Silicon Valley takes the profits. Everyone else pays the price.