Tag

#abliteration

1 post

Mechanistic Interpretability: How Language Models Say 'No'

When an AI refuses a harmful request, what's actually happening inside? I built a toolkit to find out — and the answer is more mechanical than you'd think.

Apr 30, 2026

interpretability abliteration safety