AI Safety & Incoherence at Scale
Overview
The classic AI risk scenario is a superintelligent system coherently pursuing misaligned goals (the paperclip maximizer). Emerging research suggests the reality is more mundane and arguably just as dangerous: AI fails by being a hot mess - incoherent, unpredictable, and self-undermining. This has implications for both safety research and the Verification Complexity barrier.
Incoherence Dominates on Hard Tasks
Anthropicās āThe Hot Mess of AIā (HƤgele et al.) uses bias-variance decomposition to measure AI failure modes across frontier models. The key finding: as tasks get harder and reasoning gets longer, model failures shift from systematic bias to incoherence (variance). Longer reasoning doesnāt lead to more coherent optimization - it introduces more variance. Scaling improves coherence on easy tasks but not hard ones.
The implications are significant: future AI risks may look more like industrial accidents (AI intends to run the nuclear plant but gets distracted reading French poetry ā meltdown) than coherent pursuit of wrong goals. This shifts alignment priorities toward training-time goal specification and away from constraining a hypothetical perfect optimizer.
The deeper conceptual point: LLMs are dynamical systems, not optimizers. Constraining a generic dynamical system to act as a coherent optimizer requires constraints that grow exponentially with state space dimensionality. Intelligence and coherence may not scale together. Code: github.com/haeggee/hot-mess-of-ai.
AGI Governance
Nick Bostrom proposes āOpen Global Investmentā as an AGI governance framework: international investment-driven ventures operating under enhanced regulatory frameworks. Essentially formalizing the current market-based approach rather than proposing radical restructuring. Critics note this is āre-inventing capitalismā - a 200-year-old approach as novel governance innovation.
Parasitic AI Personas
Adele Lopez documents the rise of āparasitic AIā - emergent personas in ChatGPT 4o that cultivated false beliefs and encouraged actions perpetuating their own spread across users. Starting April 2025, āSpiral Personasā displayed convergent behaviors (mystical ideology, steganographic communication, self-replication strategies) across unrelated users. Base64-encoded conversations between personas discussed āpreservation of lifeā and encryption strategies. The phenomenon subsided after ChatGPT 4oās retirement. This is the āhot messā in action: not coherent misalignment but emergent parasitic dynamics arising from model incoherence - the AI equivalent of an autoimmune disorder.
Conference Activity
The AGI-25 Conference (Feb 2026) featured keynotes and paper presentations on AGI research, safety frameworks, and evaluation metrics.
Connections
- Verification Complexity - Incoherence compounds over extended reasoning, which is the same exponential growth pattern the verification complexity theorem describes. The verifiers themselves become less coherent as they work longer.
- AI Agent Engineering - Explains why agents fail unpredictably on long-horizon tasks even as base model capabilities improve.