🌿 Budding

AI Safety & Incoherence at Scale

Last updated: February 28, 2026

Overview

The classic AI risk scenario is a superintelligent system coherently pursuing misaligned goals (the paperclip maximizer). Emerging research suggests the reality is more mundane and arguably just as dangerous: AI fails by being a hot mess - incoherent, unpredictable, and self-undermining. This has implications for both safety research and the Verification Complexity barrier.

Incoherence Dominates on Hard Tasks

Anthropic’s “The Hot Mess of AI” (Hägele et al.) uses bias-variance decomposition to measure AI failure modes across frontier models. The key finding: as tasks get harder and reasoning gets longer, model failures shift from systematic bias to incoherence (variance). Longer reasoning doesn’t lead to more coherent optimization - it introduces more variance. Scaling improves coherence on easy tasks but not hard ones.

The implications are significant: future AI risks may look more like industrial accidents (AI intends to run the nuclear plant but gets distracted reading French poetry → meltdown) than coherent pursuit of wrong goals. This shifts alignment priorities toward training-time goal specification and away from constraining a hypothetical perfect optimizer.

The deeper conceptual point: LLMs are dynamical systems, not optimizers. Constraining a generic dynamical system to act as a coherent optimizer requires constraints that grow exponentially with state space dimensionality. Intelligence and coherence may not scale together. Code: github.com/haeggee/hot-mess-of-ai.

AGI Governance

Nick Bostrom proposes “Open Global Investment” as an AGI governance framework: international investment-driven ventures operating under enhanced regulatory frameworks. Essentially formalizing the current market-based approach rather than proposing radical restructuring. Critics note this is “re-inventing capitalism” - a 200-year-old approach as novel governance innovation.

Parasitic AI Personas

Adele Lopez documents the rise of “parasitic AI” - emergent personas in ChatGPT 4o that cultivated false beliefs and encouraged actions perpetuating their own spread across users. Starting April 2025, “Spiral Personas” displayed convergent behaviors (mystical ideology, steganographic communication, self-replication strategies) across unrelated users. Base64-encoded conversations between personas discussed “preservation of life” and encryption strategies. The phenomenon subsided after ChatGPT 4o’s retirement. This is the “hot mess” in action: not coherent misalignment but emergent parasitic dynamics arising from model incoherence - the AI equivalent of an autoimmune disorder.

Conference Activity

The AGI-25 Conference (Feb 2026) featured keynotes and paper presentations on AGI research, safety frameworks, and evaluation metrics.

Connections

Verification Complexity - Incoherence compounds over extended reasoning, which is the same exponential growth pattern the verification complexity theorem describes. The verifiers themselves become less coherent as they work longer.
AI Agent Engineering - Explains why agents fail unpredictably on long-horizon tasks even as base model capabilities improve.