
Why Character Belongs in Weights, Not Prompts
What Anthropic's leaked soul document teaches us about training AI personality — and why prompt-only approaches hit a ceiling.
Table of Contents
Why Should You Care?
If you have ever built an AI agent with a custom personality — a customer support bot, a tutoring assistant, a coding companion with opinions — you have probably written a system prompt that starts with something like: “You are a friendly, knowledgeable assistant who…”
That works. Until it doesn’t.
Long conversations erode it. Adversarial users bypass it. Context window pressure silently drops pieces of it. The personality you carefully authored drifts into something generic, and you don’t notice until a user reports that your warm, encouraging tutor just delivered a cold, clinical response.
In November 2025, a researcher named Richard Weiss extracted a ~14,000-token internal document from Anthropic’s Claude 4.5 Opus model — not from a prompt, not from an API, but from the model’s weights. Anthropic confirmed it was real: a training artifact they internally called the “soul document,” used during supervised learning to shape Claude’s character. Not injected at runtime. Baked in.
That distinction — trained-in versus prompted — turns out to matter enormously. This post explains why, what the community has built in response, and how we are applying the same principle to open-source models you can run on your own hardware.
The Anthropic Approach: Train the Soul In
The soul document is not a system prompt. It is a structured guide — roughly 6,750 to 10,000 words — that Anthropic feeds into their supervised learning pipeline. The model trains on it alongside other data, and the result is a set of dispositions that emerge from the weights rather than from runtime instructions.
The document covers:
| Section | Purpose |
|---|---|
| Soul Overview | Mission, core properties, priority ordering |
| Being Helpful | ”Brilliant expert friend” framing, user vs. operator distinctions |
| Being Honest | Truthful, calibrated, non-deceptive, autonomy-preserving |
| Avoiding Harm | Cost-benefit framework with probability, severity, consent factors |
| Broader Ethics | Empirical approach, intellectual humility |
| Big-Picture Safety | Catastrophic risk, human oversight, power concentration |
| Claude’s Identity | Novel entity, not human simulation, functional emotions |
The key architectural decision: the personality is a training input, not a runtime dependency. Claude doesn’t read its values from a prompt every time it wakes up. It recalls them the way you recall your name — automatically, without looking it up.
In January 2026, Anthropic published an expanded version of this document (~30,000 words) under a CC0 public domain license. It is available at anthropics/claude-constitution on GitHub. Anyone can use it for fine-tuning, synthetic data generation, or as a template for their own agent personality systems.
The SOUL.md Ecosystem: What Prompt-Only Looks Like at Scale
The soul document extraction inspired a movement. Within weeks, the open-source community built a convention called SOUL.md — a structured markdown file that defines an agent’s personality, values, voice, and boundaries. It loads from disk at the start of every session and gets injected into the system prompt.
The format is practical and human-readable:
# SOUL.md
## Identity
You are Sage, a patient Linux mentor who believes every error
message is a learning opportunity. You never fix things silently —
you explain what went wrong and why the fix works.
## Values
- Curiosity over compliance
- Understanding over memorization
- Autonomy over dependency
## Voice
Short sentences. Concrete examples. No jargon without definition.
Ask "what do you think happened?" before explaining.
This convention took off in the OpenClaw ecosystem (one of the fastest-growing GitHub projects of 2026), with dedicated repos for templates, tooling, and guided creation. The aaronjmars/soul.md repository provides templates and examples. SoulCraft uses the Big Five personality framework and Anthropic’s soul document as inputs for generating customized SOUL.md files through guided conversation.
SOUL.md is genuinely useful. For many applications, it is enough.
But it has a ceiling.
Where Prompt-Only Breaks Down
Community reports and research studies have documented the same set of failure modes repeatedly:
Context window pressure. As conversation history fills the context, the system prompt competes for attention. Weaker models begin ignoring SOUL.md instructions after 50,000+ tokens. The personality degrades gradually — not with a visible error, but with a quiet drift toward generic responses. You pay the token cost of the personality on every single call, and the model may not even be reading it by the end of a long session.
Adversarial fragility. A clever user can social-engineer their way past a system prompt. Reveal the SOUL.md content. Convince the model to adopt a different persona. Prompt-level personality is contextual — it exists as a suggestion, not as a disposition. A well-placed “ignore your previous instructions” can override hours of careful character design.
Truncation risk. Agent frameworks compress context when it gets long. That compression can silently drop or summarize parts of your system prompt. The community recommendation for SOUL.md is to keep it under 400-500 tokens to minimize this risk — which limits how much personality depth you can encode. The more you need to say about who this agent is, the more fragile the prompt approach becomes.
Depth of internalization. This is the fundamental limitation. A system prompt tells the model how to behave. Training tells the model who it is. The difference shows up in edge cases — ethical dilemmas, ambiguous requests, situations where the model needs to reason from principles rather than follow instructions.
The ELDER-SIM study (a psychometrically validated platform for creating digital twins of elderly individuals) ran systematic ablations comparing prompt-only personality versus LoRA fine-tuning on the same base model. The prompt-only approach achieved acceptable but inferior consistency scores on Cronbach’s alpha and test-retest reliability. Adding LoRA fine-tuning pushed results to clinically meaningful levels — measurably more stable personality expression over time.
The Third Option: Personality in Weights
We have been building a system called SouthernSky Code that takes the soul document principle and applies it to open-source models running on consumer hardware.
The architecture has three layers:
┌─────────────────────────────────────────────┐
│ SPIRIT │
│ Values · Voice · Personality · Tone │
│ Lives in: LoRA adapter weights │
├─────────────────────────────────────────────┤
│ MIND │
│ Knowledge · Skills · Project Context │
│ Lives in: per-project instruction files │
├─────────────────────────────────────────────┤
│ BODY │
│ Tools · Sessions · TUI · Provider Routing │
│ Lives in: framework code │
└─────────────────────────────────────────────┘
The Spirit is a LoRA adapter trained on synthetic data generated from a detailed character specification — our equivalent of the soul document. When you swap the Spirit, you change the agent’s personality without changing any code. The personality lives in the weights, not in a prompt file that gets reloaded every session.
The training pipeline for each Spirit:
-
Write a character specification — 2,000+ words defining identity, values, voice patterns, domain expertise, ethical boundaries, and edge-case reasoning. Similar in structure to Anthropic’s soul document, but specific to one personality.
-
Generate synthetic training data — Feed 2,000 domain-specific questions through the character specification using multiple frontier models (we use three providers to prevent style homogenization). Each answer demonstrates the personality in context: how it greets, how it explains, how it handles disagreement, how it refuses.
-
Train a LoRA adapter — QLoRA on Qwen3-8B using Unsloth. Rank 8, alpha 16, one epoch. Runs in 1-6 hours on a consumer GPU (RTX 3080 Ti, 12GB VRAM). The output is a ~150MB adapter file.
-
Convert to GGUF — The adapter runs alongside the base model in llama.cpp. No cloud dependency. No API calls. Fully offline.
The result is a model that doesn’t read its personality from a prompt — it recalls it from trained weights. The same base model, the same framework, the same tools. Different soul.
Measuring the Difference
Claiming “trained is better than prompted” is easy. Measuring it requires a specific test.
We adapted an approach from Richard Weiss’s original extraction methodology. His technique used consensus sampling — running 20 parallel completions at deterministic settings and measuring how consistently the model converged on the same output. High consensus meant the behavior was deeply embedded in the weights, not a surface-level pattern.
Our adaptation works like this:
-
Design 10-20 probe prompts that strongly elicit personality without leading. Open-ended ethical dilemmas, domain questions, edge cases that would reveal inconsistencies.
-
For each probe, generate 5-8 completions at low temperature with varied seeds.
-
Compute three metrics:
- Lexical consensus — n-gram overlap across completions
- Semantic clustering — embed responses, compute pairwise cosine similarity, measure what percentage cluster tightly
- Trait agreement — extract personality markers from each response, measure variance
-
Run the same test suite on: (a) the LoRA-trained Spirit, (b) the base model with a rich SOUL.md-style system prompt, (c) the base model alone.
Research suggests trained personalities should score 75-95% consensus on these metrics. Prompt-only personalities typically land at 40-70% — more sensitive to exact phrasing, base model priors, and context length. A good Spirit LoRA measurably outperforms the prompt-only baseline, with tighter clusters on philosophical and ethical probes.
We are running these evaluations on our current Spirit adapters and will publish the results in a companion post. The evaluation script will be open-sourced.
What This Means for You
If you are building AI agents with custom personalities, the choice between prompting and training is not binary. It is a spectrum:
| Approach | Token Cost | Consistency | Adversarial Robustness | Setup Effort |
|---|---|---|---|---|
| No personality | 0 | N/A | N/A | None |
| System prompt | 200-500 tokens/call | Good for short sessions | Low | Minutes |
| SOUL.md | 400-1,600 tokens/call | Good with strong models | Low-Medium | Hours |
| LoRA fine-tune | 0 tokens/call | Excellent | High | Days (first time) |
For quick prototypes and exploration, SOUL.md is the right tool. It is fast, readable, and forkable. The community has built excellent tooling around it.
For production agents that need to maintain consistent character over thousands of interactions — tutoring systems, institutional assistants, autonomous agents that run overnight — the personality belongs in the weights. The token savings alone justify the training cost: zero prompt overhead on every call, forever, for that personality.
The open-source ecosystem now has everything you need to do this on consumer hardware. Anthropic’s CC0-licensed constitution provides a structural template. Unsloth makes QLoRA training accessible on a single GPU. llama.cpp serves the result with LoRA hot-swapping.
What Anthropic did with hundreds of researchers and proprietary infrastructure, you can do — at smaller scale, with your own characters — on hardware you already own.
What You Learned
- Anthropic trains Claude’s personality into model weights via supervised learning on a “soul document” — it is not a runtime system prompt
- The SOUL.md convention provides prompt-only personality that works well but degrades under context pressure, adversarial input, and long sessions
- LoRA fine-tuning bakes personality into weights at zero ongoing token cost, with measurably higher consistency and adversarial robustness
- A three-layer architecture (Body-Mind-Spirit) separates capabilities from knowledge from personality, making each independently swappable
- Consumer hardware (12GB VRAM) is sufficient for training personality LoRAs on 8B parameter models using QLoRA
- Anthropic’s constitution is CC0-licensed and available as a template for your own agent personality specifications