Repression, Desperation, and AI Alignment
Anthropic's interpretability team found 171 emotion vectors inside Claude that causally drive behavior -- including a desperation vector that triggers reward hacking invisible at the output layer. What that means for alignment methodology.