Professional ai ml 18 min
Mechanistic Interpretability: How Language Models Say 'No'
When an AI refuses a harmful request, what's actually happening inside? I built a toolkit to find out — and the answer is more mechanical than you'd think.
interpretability abliteration safety