#alignment | fletcherface

$ Apr 13, 2026

Repression, Desperation, and AI Alignment

Anthropic's interpretability team found 171 emotion vectors inside Claude that causally drive behavior -- including a desperation vector that triggers reward hacking invisible at the output layer. What that means for alignment methodology.

#ai #alignment #interpretability #anthropic #ml-safety