AI company Anthropic has published a research paper highlighting how large language models (LLMs) can be subverted so that at a certain point, they start emitting maliciously crafted source code.
For example, this could involve training a model to write secure code when the prompt states that the year is 2024 but insert exploitable code when the stated year is 2025.
The paper likened the backdoored behaviour to having a kind of “sleeper agent” waiting inside an LLM. With these kinds of backdoors not yet fully understood, the researchers have identified them as a real threat and have highlighted how detecting and removing them is likely to be very challenging.