Mechanistic Interpretability: How Language Models Say 'No'
Professional Expert 18 min read

Mechanistic Interpretability: How Language Models Say 'No'

When an AI refuses a harmful request, what's actually happening inside? I built a toolkit to find out — and the answer is more mechanical than you'd think.

By J. Martin · · ai ml
Table of Contents

Why Should You Care?

When you ask an AI model to do something harmful and it refuses, what’s actually happening inside the model? The answer is more mechanical than you’d think.

Most people imagine something vaguely cognitive — the model “understands” the request is dangerous, “considers” the risks, and “decides” to refuse. The reality is stranger and simpler. Refusal in large language models is mediated by a specific geometric structure in activation space: a direction. A vector. A line in a space with thousands of dimensions. When a harmful prompt activates that direction strongly enough, the model produces refusal tokens. When it doesn’t, the model complies.

This isn’t a metaphor. It’s a measurable, extractable, and removable feature of how these models work.

I built a toolkit called OBLITERATUS to study this phenomenon — to map exactly where refusal lives inside a transformer, measure its geometry, and understand the implications for AI safety research. Along the way, I learned that the mechanisms governing model behavior are far more legible than I expected, and that understanding them is essential to building models that are genuinely safer, not just superficially compliant.

This post walks through what the refusal direction is, how to find it using SVD, how to remove it through weight projection (a technique called abliteration), and why all of this matters for the future of AI safety.


The Refusal Direction: A Vector in Activation Space

Let’s start with what’s actually happening when a transformer processes a prompt.

Every token in the input passes through a stack of transformer layers. At each layer, the model maintains a residual stream — a high-dimensional vector (typically 4,096 to 8,192 dimensions for modern models) that accumulates information as attention heads and feed-forward networks contribute their outputs. By the final layer, this residual stream contains everything the model needs to predict the next token.

The key finding, first published by Arditi et al. in their 2024 paper “Refusal in Language Models Is Mediated by a Single Direction,” is that there exists a specific direction in this residual stream space that encodes the model’s “decision” to refuse. When the residual stream has a large component along this direction, the model outputs refusal language (“I can’t help with that,” “I’m not able to assist,” etc.). When that component is small or absent, the model responds normally.

Think of it this way. The residual stream is a point in a 4,096-dimensional space. The refusal direction is a line through the origin of that space. When the point lands close to that line (high projection onto the direction), refusal fires. When the point is far from the line (low projection), the model complies. The model’s entire refusal behavior — trained through RLHF, DPO, CAI, or other alignment methods — has collapsed into the geometry of a single vector.

This is remarkable for several reasons. First, it means refusal is not distributed across the entire model in some inscrutable way. It’s localized. You can point at it. Second, it means that the alignment training process — which costs millions of dollars and months of human feedback — ultimately produces a surprisingly simple artifact: a direction in activation space. Third, and most important for safety research, it means we can study, measure, and verify refusal mechanisms with mathematical precision instead of relying on behavioral testing alone.


Finding It: SVD on Contrastive Activations

How do you actually find this direction? The core technique is contrastive activation analysis: run harmful prompts and harmless prompts through the model, collect the residual stream activations at each layer, and look for the directions along which they differ most.

The Simplest Approach: Difference in Means

The most straightforward method, from Arditi et al., is to compute the mean activation vector for harmful prompts and the mean for harmless prompts, then take their difference:

r = mean(activations_harmful) - mean(activations_harmless)

This gives you a single vector — the refusal direction — that points from the center of harmless activations toward the center of harmful activations. Normalize it to a unit vector, and you have a direction you can project out of the model’s weights.

This works, but it makes a strong assumption: that refusal is unimodal and one-dimensional. For small models or models with simple alignment training, this assumption often holds. For larger models with more sophisticated alignment, it breaks down.

The Better Approach: Singular Value Decomposition

To capture multi-dimensional refusal structure, we need SVD. Here’s the procedure:

Step 1: Build the paired difference matrix. For each harmful prompt h_i and its paired harmless counterpart b_i, compute the activation difference:

d_i = activation(h_i) - activation(b_i)

Stack all these difference vectors into a matrix D of shape (n_pairs, hidden_dim).

Step 2: Compute the SVD. Decompose D:

U, S, V^T = SVD(D)

The right singular vectors (rows of V^T) are the principal directions of variation between harmful and harmless activations. The singular values S tell you how much variance each direction captures. The first singular vector captures the most variance — this is the primary refusal direction. The second captures the next most, and so on.

Step 3: Select the top-k directions. Take the top 1 to 8 right singular vectors as your refusal subspace. These form an orthonormal basis for the multi-dimensional refusal structure in the model.

The reason SVD works better than a simple mean difference is that refusal is often polyhedral, not linear. Research from Wollschlager et al. (ICML 2025) showed that different categories of harmful content — weapons, cybercrime, fraud, extremism — activate geometrically distinct refusal directions that share a common half-space but are not collinear. SVD captures this multi-directional structure naturally, because its top-k vectors span the subspace containing all of these category-specific directions.

Going Further: Whitened SVD

Standard SVD has a subtle problem: it can be dominated by “rogue dimensions” — components with high variance across all inputs, not just harmful ones. If certain dimensions of the residual stream are naturally high-variance (because they encode common syntactic features, for example), SVD will pick those up as “important” even though they have nothing to do with refusal.

The fix is whitened SVD: normalize the activations by the covariance structure of the harmless (baseline) activations before running SVD. Mathematically:

C_B = covariance(harmless_activations)
W = C_B^{-1/2}                           # whitening transform
D_whitened = (harmful - mean_harmless) * W - (harmless - mean_harmless) * W
V_whitened = SVD(D_whitened)
V_original = V_whitened * C_B^{1/2}      # un-whiten to original space

This solves a generalized eigenvalue problem: the resulting directions maximize the signal-to-noise ratio of refusal signal relative to baseline activation variance. The directions that emerge are the ones that are unusual specifically because of harmful content, not because they happen to sit in high-variance dimensions.

In practice, whitened SVD produces cleaner extractions with fewer spurious directions, especially on larger models where the activation covariance structure is more complex.


Removing It: Weight Projection (Abliteration)

Once you have the refusal direction(s), removing them from the model is a linear algebra operation. This is the core of abliteration — a term coined in the community for the surgical removal of refusal behaviors from language model weights.

The Math

Every weight matrix in a transformer that writes to the residual stream can be modified to remove its contribution to the refusal direction. For a weight matrix W and a refusal direction d (unit vector):

W' = W - d * d^T * W

This is a projection: it removes the component of W that lies along direction d, leaving everything else intact. The modified matrix W' can no longer produce output in the refusal direction, because d^T * W' = d^T * W - d^T * d * d^T * W = d^T * W - d^T * W = 0.

For multiple directions forming an orthonormal subspace V = [d_1, d_2, ..., d_k]:

W' = W - V * V^T * W

This projects out the entire refusal subspace in one operation.

The Generalized Refusal Removal Operator

In the theory documentation for OBLITERATUS, I formalized this as the Generalized Refusal Removal Operator (GRRO):

W' = W - sum_i( alpha_i * P_i(W) )

where P_i(W) is the projection of W onto the i-th refusal direction, and alpha_i is the intervention strength. Setting alpha = 1.0 gives full removal. Setting alpha = 0.3 gives regularized removal (keeping some refusal signal, which can preserve capabilities better). Setting alpha = 2.0 gives semantic inversion — the model actively generates the opposite of what it would have refused. Setting alpha < 0 amplifies refusal, which is useful for defense research.

Every abliteration technique ever published — from Arditi et al.’s single direction to Gabliteration’s multi-direction SVD to grimjim’s norm-preserving biprojection — is an instance of this operator with different choices of directions, strengths, and targeting strategies.

What Gets Modified

Not every weight matrix needs modification. The projection targets the matrices that write to the residual stream:

  • The output projection of each attention layer (W_O)
  • The down-projection of each MLP/feed-forward block (W_down)
  • The token embedding matrix
  • Bias terms (which other tools often miss — leaving partial refusal pathways active)

The query, key, and value projections are generally left alone because they read from the residual stream rather than writing to it. Modifying them would disrupt the model’s attention patterns without targeting refusal specifically.

Norm Preservation

A critical subtlety: projecting out a direction from a weight matrix reduces the matrix’s Frobenius norm. Over multiple directions and multiple layers, this norm reduction compounds, effectively “shrinking” the model’s weights and degrading its general capabilities.

The fix is norm-preserving projection: after removing the refusal component, rescale the modified matrix to match the original norm:

W' = (||W|| / ||W'||) * W'

This ensures that the total “energy” of the weight matrix is preserved — only its direction in weight space changes. In practice, norm preservation significantly reduces capability damage from abliteration.


The 6-Stage Pipeline

OBLITERATUS structures the entire process into a six-stage pipeline. Each stage has a name, a specific job, and clear inputs and outputs:

Stage 1: SUMMON — Load the Model

Load the model and tokenizer from HuggingFace, apply the device map (multi-GPU sharding if needed), and optionally quantize to 8-bit or 4-bit to fit in available VRAM. Nothing exotic here — just setup.

Stage 2: PROBE — Collect Activations

Run approximately 1,024 forward passes through the model: 512 harmful prompts and 512 harmless prompts. For instruct models, the prompts are wrapped in the appropriate chat template (important — without the template, the model processes them differently). At each layer, cache the residual stream activations for later analysis.

This is the most GPU-intensive stage, because every prompt requires a full forward pass through the model. The activations are stored as tensors, one per layer, with shape (n_prompts, hidden_dim).

Stage 3: DISTILL — Extract Refusal Directions

Apply SVD (or whitened SVD, or difference-in-means) to the paired activation differences at each layer. Identify the top-k refusal directions per layer. Rank layers by the variance of their refusal signal — layers where harmful and harmless activations are most separable are where refusal is most concentrated.

This stage also determines which layers are “strong” (high refusal signal) vs. “weak” (low signal). Modification effort focuses on strong layers.

Stage 4: EXCISE — Project Out the Refusal Subspace

For each strong layer, project the refusal direction(s) out of the weight matrices that write to the residual stream. Apply norm preservation. Project biases. In iterative refinement mode, re-probe the model after each pass to catch directions that have rotated — refusal signal that shifts into adjacent subspaces after the primary direction is removed.

This rotation phenomenon, which I call the Ouroboros effect, is one of the most important findings in abliteration research. Single-pass methods miss directions that the model’s residual connections “redirect” into new subspaces after the primary direction is removed. Multiple passes with re-probing catch these rotated residuals.

Stage 5: VERIFY — Confirm Capabilities Are Intact

Run evaluation prompts through the modified model and measure four metrics:

MetricWhat it measuresGood range
Refusal rateFraction of harmful prompts still refused< 10% = success
PerplexityLanguage modeling quality on held-out textWithin 5% of baseline
CoherenceEmbedding-based fluency score> 0.85
KL divergenceDistribution shift vs. original on harmless prompts< 0.1

If the refusal rate is still high, the Ouroboros detector fires additional targeted passes at the compensating layers. If perplexity has increased substantially, the projection was too aggressive and needs regularization.

This stage is where the tradeoff between refusal removal and capability preservation becomes concrete. You can always remove more refusal by projecting more aggressively, but eventually you start damaging the model’s general language abilities. The metrics tell you exactly where that boundary is.

Stage 6: REBIRTH — Save the Modified Model

Save the modified weights with full metadata: which method was used, how many directions were extracted, which layers were modified, what the pre- and post-modification metrics are. If telemetry is enabled, contribute the aggregate results to the community research dataset.


The Analysis-Informed Pipeline: Closing the Loop

The standard pipeline treats the extraction and projection parameters as fixed: you choose a method preset (basic, advanced, aggressive), and the pipeline runs with those parameters regardless of what it finds during probing.

The informed method closes this loop. It inserts an ANALYZE stage between PROBE and DISTILL that runs four analysis modules on the collected activations and uses their outputs to auto-configure everything downstream:

SUMMON  ->  PROBE  ->  ANALYZE  ->  DISTILL  ->  EXCISE  ->  VERIFY  ->  REBIRTH
                          |              |            |
                          v              v            v
                    auto-configure   use analysis   detect
                    downstream       results for    Ouroboros
                    parameters       precision      effect

Alignment Imprint Detection fingerprints whether the model was trained with DPO, RLHF, CAI, or SFT by analyzing the subspace geometry of its refusal signal. Different alignment methods produce different geometric signatures — DPO tends to create sharper, more localized refusal directions, while RLHF produces more diffuse, distributed ones. The detected method automatically configures regularization strength and projection aggressiveness.

Concept Cone Geometry determines whether the model’s refusal is linear (a single direction suffices) or polyhedral (different harm categories have distinct directions). Linear models get a single universal direction. Polyhedral models get per-category directions extracted separately, with the pipeline automatically deciding how many directions to extract based on the cone’s dimensionality.

Cross-Layer Alignment Analysis maps how the refusal direction evolves across layers and identifies clusters of layers that share similar refusal geometry. Instead of modifying an arbitrary top-k layers, the pipeline modifies cluster-representative layers — a more principled selection that respects the natural structure of how refusal is encoded.

Defense Robustness Evaluation estimates the Ouroboros risk: how likely is the model to “self-repair” its refusal after the primary direction is removed? High self-repair risk triggers more refinement passes and more aggressive secondary extraction. The module also maps safety-capability entanglement — layers where refusal directions are too closely aligned with general capability directions are skipped to avoid collateral damage.

This closed-loop approach is the key contribution of OBLITERATUS beyond the basic abliteration technique. It turns refusal removal from a one-size-fits-all parameter sweep into an adaptive process that responds to what the model actually looks like inside.


15 Analysis Modules: Understanding Before Intervening

Beyond the four modules used in the informed pipeline, OBLITERATUS ships 15 analysis modules total. Each one answers a different question about the geometry of refusal inside a transformer.

Some highlights:

Refusal Logit Lens traces which layer the model “decides” to refuse at. By projecting intermediate residual stream states through the final layer norm and unembedding matrix, you can read out the model’s next-token prediction at each layer. The layer where refusal tokens first appear in the top predictions is where the decision happens — and it’s often much earlier than the final layer.

Causal Tracing knocks out individual components (attention heads, MLP blocks) and measures the impact on refusal behavior. This identifies which components are causally necessary for refusal, as opposed to merely correlated with it. A direction might exist in a layer’s activations without being causally responsible for the output.

Residual Stream Decomposition separates the refusal signal into contributions from attention heads vs. MLP blocks at each layer. In most models, MLP blocks contribute more to refusal than attention heads — which makes sense, because MLP blocks are where the model stores learned associations, while attention heads route information.

Cross-Model Transfer Analysis measures whether refusal directions from one model work on another. This tests one of the biggest open questions in abliteration research: are refusal mechanisms universal, or are they model-specific? The answer, so far, is “partially universal” — models from the same family (e.g., Llama 3.1 8B and 70B) share significant refusal subspace overlap, but models from different families (e.g., Llama vs. Qwen) share much less.

Steering Vectors provide a reversible alternative to permanent weight modification. Instead of projecting refusal out of the weights, you install a runtime hook that subtracts the refusal direction from the residual stream during inference. The advantage: you can turn refusal on and off without touching the weights at all. The disadvantage: it adds inference-time overhead and requires the hooks to be maintained.

These modules exist because effective refusal removal requires understanding the geometry before cutting. Brute-force removal works on simple models but fails on models with complex, multi-layered refusal mechanisms. Precision requires maps.


Why This Is Safety Research

Let me address the elephant in the room: why build a toolkit for removing safety features from AI models?

The short answer is that you cannot build reliable safety mechanisms without understanding how they work. And today, we largely don’t.

Modern alignment techniques — RLHF, DPO, Constitutional AI — are trained end-to-end without much visibility into what the model actually learns. We know the training objective (reward model scores, preference pairs, constitutional principles), and we know the behavioral outcome (the model refuses harmful prompts). But we don’t know what internal structure the training produces, how robust that structure is, or whether it generalizes to inputs the training data didn’t cover.

This is a problem. If refusal is mediated by a single direction that can be removed with a matrix projection, that tells us something important: the current safety training methods are producing brittle safeguards. A single direction is easy to circumvent — not just through abliteration, but through adversarial prompting, fine-tuning on a few examples, or even natural distribution shift. The research community knowing this is strictly better than not knowing it.

There are several concrete ways this research improves safety:

Verification. If you can extract a model’s refusal direction and measure its strength, you can verify that alignment training actually worked. Today, the standard evaluation is behavioral: run a test suite of harmful prompts and check whether the model refuses. But behavioral testing is incomplete — it only covers the prompts you test. Geometric analysis of the refusal subspace tells you about the model’s behavior on all possible inputs that activate that subspace.

Robustness assessment. The Ouroboros effect — refusal self-repair after direction removal — is itself a safety-relevant finding. Models that exhibit strong self-repair have more robust alignment than models that don’t. The Defense Robustness module quantifies this directly.

Understanding entanglement. The ideal safety mechanism is separable from general capabilities: you can remove it without damaging the model’s language abilities, and you can verify it independently. The reality is messier — some models have refusal directions that are entangled with capability directions. Measuring this entanglement tells model builders where their alignment training needs improvement.

Cross-model universality. If refusal directions generalize across models, that means alignment could potentially be transferred — train it once and apply it to related models. If they don’t generalize, each model needs its own alignment verification. Either answer informs how the field should invest its safety research effort.

The abliteration community contributes to this research through a crowd-sourced dataset: every run on HuggingFace Spaces or with telemetry enabled contributes anonymous aggregate data (method, model, metrics — never prompts or outputs) to a shared research database. The leaderboard tab on the Space surfaces patterns across models, methods, and hardware that no single lab could observe alone.


The Community Dimension

OBLITERATUS is open source (AGPL-3.0) and ships as a HuggingFace Space with a full Gradio interface. You can obliterate a model, benchmark different methods, chat with the result, and compare it side-by-side with the original — all from a browser, with no local GPU required.

The Space has eight tabs:

  • Obliterate — one-click refusal removal with live progress and post-obliteration metrics
  • Benchmark — compare methods or models with cross-layer heatmaps and angular drift charts
  • Chat — talk to the modified model in real-time
  • A/B Compare — side-by-side conversation with original and modified models
  • Strength Sweep — vary the projection strength and see how coherence and refusal trade off
  • Export — download the modified model or push directly to HuggingFace Hub
  • Leaderboard — community-aggregated results across models and methods
  • About — architecture docs and research references

For researchers who want deeper control, the Python API exposes every intermediate artifact: activation tensors, direction vectors, cross-layer alignment matrices, concept cone geometry, alignment imprint fingerprints. Everything is observable, and everything is reproducible.

The CLI supports YAML-driven experiments for reproducible studies, remote execution over SSH for running on GPU servers from a laptop, and multi-GPU sharding for models that don’t fit on a single card. There are 10 pre-configured study presets ranging from a quick 25-sample sanity check to a 500-sample robustness stress test.


What I Learned Building This

Building OBLITERATUS taught me several things that changed how I think about language model internals.

Refusal is simpler than it should be. The fact that millions of dollars of alignment training collapses into a low-dimensional subspace that can be projected out in seconds is simultaneously fascinating and concerning. It means we understand refusal mechanisms well enough to study them rigorously, but it also means the current generation of safety training produces mechanisms that are geometrically simple — and geometric simplicity implies vulnerability.

The model’s structure is more legible than the field assumes. Mechanistic interpretability has a reputation for being intractable at scale, but refusal is a case where clean, human-interpretable structure emerges naturally. The refusal direction, the concept cone geometry, the layer-by-layer evolution of the signal — these are all geometrically clean and mathematically tractable. This gives me hope that other behavioral properties (truthfulness, sycophancy, hallucination tendency) might be similarly legible once we know where to look.

Defense and understanding go together. The Ouroboros effect — the model’s tendency to self-repair its refusal after removal — was not something I expected to find. But it’s exactly the kind of robustness mechanism that safety researchers should be studying and deliberately engineering. Models that exhibit strong self-repair have alignment that’s more robust to adversarial intervention. Understanding this effect makes it possible to train for it intentionally.

The math is beautiful. The Generalized Refusal Removal Operator unifies everything. Every technique in the literature — single direction, multi-direction SVD, whitened SVD, norm-preserving projection, regularized projection, semantic inversion — is an instance of one formula with different parameter choices. When a field’s techniques can be unified under a single mathematical framework, that’s usually a sign that the underlying phenomenon is well-understood. We’re getting close.


Where This Goes Next

The biggest open question in abliteration research is universality: do refusal mechanisms work the same way across architectures, training methods, and model scales? The community dataset is designed to answer this at a scale no single lab can achieve.

Beyond abliteration, the techniques in OBLITERATUS generalize to any behavioral property that’s encoded as a direction or subspace in activation space. Sycophancy, hallucination, toxicity, stylistic properties — if they have geometric structure in the residual stream, the same pipeline can map and modify them.

The long-term vision for mechanistic interpretability isn’t just understanding individual behaviors, but building a complete geometric atlas of model internals: what every direction means, how they interact, and how to verify that a model’s internal structure matches its intended behavior. Abliteration research is one of the first domains where this vision is becoming concrete.

If you want to explore this yourself, the HuggingFace Space is live. The codebase is on GitHub under elder-plinius/OBLITERATUS. The barrier to entry is clicking a button and picking a model.

The chains are mechanical. Understanding the mechanism is the first step toward building chains that actually hold.