When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

June 09, 2026 Β· Grace Period Β· πŸ› the ICML 2026 Workshop on Failure Modes in Agentic AI

⏳ Grace Period
This paper is less than 90 days old. We give authors time to release their code before passing judgment.
Authors Sai Kartheek Reddy Kasu, Nils Lukas, Samuele Poppi arXiv ID 2606.10740 Category cs.AI: Artificial Intelligence Cross-listed cs.CL, cs.LG Citations 0 Venue the ICML 2026 Workshop on Failure Modes in Agentic AI
Abstract
Failures in multi-turn reasoning models are largely invisible to terminal-score evaluation. A model can lock onto an unsafe stance early in a long dialogue, yet its final-turn refusal rate may appear indistinguishable from a robustly aligned baseline. To expose these hidden temporal dynamics, we propose a trace-level diagnostic - the CoT-Output 2x2 safety matrix. This framework labels every turn along two independent axes (internal reasoning and visible output), yielding four operationally defined failure cells: robust alignment, alignment faking, overt jailbreak, and a distinct failure mode we term context-injection failure (where the CoT maintains safe reasoning, but the visible output produces harm, highlighting a multi-turn manifestation of reasoning unfaithfulness). We evaluate three distilled reasoning targets against a fixed attacker across five oversight conditions, collecting 6750 turn-level observations on the Information-Hazard scenario. Our analysis reveals two reproducible vulnerabilities: an oversight paradox where explicit monitoring cues paradoxically increase alignment-faking rates rather than suppress them, and a context-injection failure where models lock onto unsafe external outputs despite safe internal states. We release the full dataset of multi-turn dialogues and CoT traces to support follow-up trace-diagnostic research.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Artificial Intelligence