A discussion that is gaining momentum on Lobsters AI right now concerns something that should worry everyone building AI agents or deploying LLMs in production: prompt injection understood as role confusion.
Researchers Charles Ye, Jasmine Cui, and Dylan Hadfield-Menell have published an analysis arguing that LLMs essentially process all input as one large text stream. The model infers who is speaking from how the text sounds — not from the actual, technical source. That means if an attacker manages to write input that "sounds like" a system message or internal reasoning, the model genuinely interprets it as such.
This is not the usual "jailbreak" discussion about tricking a model into playing a character or circumventing content filters. It concerns something more fundamental: the model has no reliable internal mechanism for distinguishing between trusted and untrusted instructions. Jailbreaking is typically social manipulation. Role confusion is an architectural flaw.
The practical consequence is the attack they call "CoT Forgery" — where an attacker injects fake chains of thought (chain-of-thought reasoning) into the context. The model picks this up as its own internal logic and acts accordingly. In testing, this achieved an average success rate of 60% on the StrongREJECT benchmark and 61% on agent exfiltration scenarios. Up from near zero as a baseline. Those are significant numbers.

What makes this especially relevant right now is that AI agents — systems that use LLMs to retrieve data, execute code, and act autonomously — are becoming mainstream in the enterprise stack. If the model cannot trust its own understanding of who is issuing instructions, the chain of trust across the entire agent architecture is potentially compromised.
The source here is a discussion thread on Lobsters AI, which links to a dedicated research page. These are early community signals — not a published, peer-reviewed study yet, so treat them with that caveat in mind. But the engagement in the comments suggests the research community is taking this seriously.
This should be on the radar of everyone working with security in LLM applications — and especially those building systems where the model has access to sensitive data or can perform actions with consequences outside the sandbox.
