A handful of poisoned training documents can backdoor the world's largest AI models. Researchers...

A handful of poisoned training documents can backdoor the world’s largest AI models. Researchers just proved it.

Anthropic, working with the UK’s AI Security Institute and The Alan Turing Institute, published the research this month: among the trillions of tokens these models train on, scraped from across the internet, a few carefully crafted documents can insert hidden behaviours.

The model works perfectly. Passes every test. Gets deployed.

Then someone uses the trigger phrase, and the behaviour flips.

The number that’s striking is 250- out of billions in the training data. The backdoors don’t need to look malicious. Ordinary question+answer pairs can slip past every filter.

And model size doesn’t protect you. Whether it’s the smallest or largest LLM, the vulnerability is the same.

This is the exponential gap in microcosm: we can build planet-scale AI faster than we can secure it.

We’re training on data we don’t control, can’t fully audit, and increasingly can’t trust. The datasets are too large for human review. The sources are too distributed to verify. The economic pressure to ship fast means security comes second.

Scale gave us capability. It also gave us vulnerability.

💬 Join the conversation on LinkedIn

View on LinkedIn →

FG

Felix Ghauri

Futures Forum

Felix helps organisations navigate AI and exponential change. He writes about technology, geopolitics, and the future of work.

Connect on LinkedIn Let's Talk

A handful of poisoned training documents can backdoor the world's largest AI models. Researchers...

Felix Ghauri

More Insights

AI has become the most market-friendly explanation a CEO can give for cutting headcount.

Perplexity doesn't make any AI models. Yesterday it launched a product that orchestrates 19 of them.

Ukraine: the world's first AI-native military

Thinking about AI in your workflow?