
Self-Distillation: Mining Your AI Conversations for LoRA Data
Your real coding agent interactions are the highest-quality training data you'll ever have. Here's the 4-stage pipeline that converts conversation histories into structured LoRA training data — zero cost until Stage 3.
Table of Contents
The finest synthetic training data money can buy still can’t replicate something you already have for free: the actual record of how you work.
Over the past several months I’ve accumulated 76 Claude Code conversation histories — sessions where I debugged production deployments, built LoRA training pipelines, wrote data ingestion scripts, and fixed systemd services that refused to cooperate. Each of those conversations is a detailed trace of real developer cognition: the request, the tool chain, the wrong turn, the recovery, the final working state. That’s approximately 24,000 tool calls sitting in JSONL files on disk — structured, scored, and waiting to be used.
This post documents the pipeline I built — conversation-miner — that extracts, scores, and amplifies those interaction records into training data for Agent Core v5.2, my local tool-calling agent running on Qwen3-8B. Four stages, mostly zero-cost, producing 2,777 training episodes before a single GPU cycle is spent on amplification.
Why Should You Care?
Synthetic training data for tool-calling agents has a well-known problem: it’s too clean. The model learns to call tools in tidy sequences with perfect inputs, because that’s what a prompt engineer writes when generating examples at scale. Real developer work doesn’t look like that. Real work involves trying grep -r before realizing the flag order is wrong, noticing a permission error mid-chain, pivoting to find instead, and eventually arriving at the answer through a sequence that no benchmark ever anticipated.
That authentic messiness — especially the error-recovery patterns — is exactly what makes a local agent useful on actual tasks. And it’s sitting in your ~/.claude/projects/ directory right now.
Vercel’s 2026 research on agentic task performance found a practical ceiling around 10 tools for 8B-class models before accuracy degrades significantly. Below that ceiling, the limiting factor isn’t model size — it’s training data quality. Models trained on realistic tool-calling traces outperform those trained on synthetic-only data on real developer tasks. The gap is measurable, and the ingredient is authenticity.
The good news: you’ve already been generating that ingredient every time you opened a coding session.
Stage 1: Index Without Reading
The first design constraint I set for this pipeline: Stage 1 costs zero tokens, zero GPU, and completes in under 30 seconds. That means no LLM calls, no embedding generation — just structural parsing.
Each Claude Code conversation is stored as a JSONL file in ~/.claude/projects/. The format is line-delimited JSON, one record per message, with typed entries for user, assistant, ai-title, and a few bookkeeping types. The assistant turns contain tool_use blocks inside their content array — that’s all we need to build a useful index.
The indexer reads every conversation file and extracts a “skeleton”: user messages (capped at 500 characters each), tool call names and their input key signatures, and multi-tool chain sequences. It never reads full assistant responses, never reads thinking blocks, never reads tool results. Just the bones.
# From index-conversations.py — extract_skeleton()
elif msg_type == "assistant":
content = msg.get("message", {}).get("content", [])
if isinstance(content, list):
for block in content:
if isinstance(block, dict) and block.get("type") == "tool_use":
tool_name = block.get("name", "unknown")
tool_input = block.get("input", {})
skel.tool_calls.append({
"name": tool_name,
"input_keys": list(tool_input.keys())[:5],
"timestamp": timestamp,
})
current_chain.append(tool_name)
All skeletons land in a SQLite database with FTS5 over the user messages, plus tables for tool calls and chain patterns:
$ python3 index-conversations.py --stats
Conversations: 76
Total size: 295.0 MB
Total messages: 18,432
Total tool calls: 21,123
Total chains: 4,891
=== By Project ===
-home-user-projects-OMNI 12 convos | 3841 tool calls | avg score 8.2
-home-user-projects-southernsky 9 convos | 2107 tool calls | avg score 7.4
-home-user-tools-conversation-miner 4 convos | 892 tool calls | avg score 9.1
The FTS5 index makes it cheap to find conversations by topic without reading them:
$ python3 index-conversations.py --search "systemd fails to start"
12 matches for 'systemd fails to start':
[ 9.4] -home-user-projects | Debugging containerized web service
systemd fails to start after compose restart — journalctl shows...
session: a7f3c2e91b4d8f21...
[ 7.1] -home-user-tools | GPU power management service not loading
systemd fails to start nvidia-powerd.service on boot...
The scoring function runs during indexing, before any episodes are extracted. It weights tool diversity, chain length, multi-tool chain count, conversation depth, and — critically — error recovery signals:
def score_training_signal(skel: ConversationSkeleton) -> None:
error_keywords = ["error", "failed", "fix", "broken", "crash", "debug", "issue"]
recovery_keywords = ["fixed", "works", "resolved", "solved", "let me try"]
all_text = " ".join(skel.user_messages).lower()
has_errors = any(kw in all_text for kw in error_keywords)
has_recovery = any(kw in all_text for kw in recovery_keywords)
if has_errors and has_recovery:
score += 2.0 # Error-recovery bonus — these are premium examples
That +2.0 bonus reflects a specific finding from AgentHER (arXiv:2603.21357), which demonstrated that failed trajectories properly relabeled for hindsight are roughly 2x more data-efficient than successes alone. An episode where the agent tries something, gets an error, and recovers teaches error pattern recognition in a way that a clean success never can.
Stage 2: Episode Extraction
The indexer tells us which conversations are worth reading. The extractor reads them and segments each conversation into individual training episodes.
The definition of an episode: a user request, followed by the assistant’s reasoning and tool chain, followed by tool results, bounded by the next user message or 30 turns — whichever comes first. One episode becomes one training example in ShareGPT/Hermes format.
# From extract-trajectories.py — the episode boundary logic
if msg_type == "user":
user_text = extract_user_text(content)
if user_text and len(user_text) > 10:
finalize_episode(current_episode, episodes) # Seal the previous one
current_episode = Episode(
user_request=user_text[:1000],
session_id=session_id,
project=project,
domain_tags=list(domains.keys()),
)
Tool names get normalized to v5 equivalents during extraction. Claude Code uses names like Bash, Read, Edit, Grep, Agent, WebFetch — the v5 agent schema collapses these to bash, read, edit, search, dispatch, bridge. That normalization is the key transformation that makes these conversation records useful as training data for a different agent runtime.
TOOL_NAME_MAP = {
"Bash": "bash",
"Read": "read",
"Edit": "edit",
"Write": "write",
"Grep": "search",
"Glob": "search",
"Agent": "dispatch",
"WebFetch": "bridge",
"WebSearch": "bridge",
}
Tool inputs get summarized — not truncated arbitrarily, but structured by tool type. A bash call keeps the command (capped at 500 chars). A write call keeps the file path and a 200-character content preview plus the full content length. An edit call keeps both the old and new strings (capped at 300 chars each). The goal is keeping enough information that the training example is realistic, without ballooning file size with full file dumps.
Running extraction across all 76 conversations:
$ python3 extract-trajectories.py --all --min-score 1.0
Extracting from 59 conversations (domain=all)...
[1/59] score= 12.4 | 24.1MB | -home-user-projects-OMNI → 89 episodes, 67 qualified
[2/59] score= 11.8 | 18.7MB | -home-user-tools-conversation-miner → 44 episodes, 38 qualified
[3/59] score= 9.4 | 15.2MB | -home-user-projects-southernsky → 71 episodes, 54 qualified
...
============================================================
Total episodes extracted: 2777
Average score: 4.52
With error recovery: 77
Multi-tool chains (3+): 1,421
Tool distribution in episodes:
bash 12274
read 3591
edit 2423
write 1045
dispatch 699
search 638
bridge 451
Domain distribution:
infrastructure 2690
research 2659
devops 2532
web_dev 2413
database 2039
lora_training 1634
Score distribution across the 2,777 episodes:
| Score Range | Count | Notes |
|---|---|---|
| 1–2 | 394 | Single-tool, shallow interactions |
| 2–3 | 202 | Short chains, minimal diversity |
| 3–5 | 711 | Solid multi-tool episodes |
| 5–8 | 1,421 | High-signal, multi-domain chains |
| 8+ | 49 | Elite episodes — deep chains, error recovery, dispatch |
The 1,470 episodes scoring 5+ are the primary training corpus. That’s enough data to meaningfully influence fine-tuning behavior on tool selection and chaining patterns without exhausting a training budget.
Stage 3: LLM Classification
Keyword-based domain tagging from Stage 1 is fast but imprecise. A conversation about deploying a web app might hit both devops and web_dev keyword lists equally. For training data splits — where you want clean domain labels so you can balance curriculum — you need something smarter.
Stage 3 uses a local LLM (Gemma4 via Ollama) to classify each episode by its primary domain. The classifier sees only the user request, the list of tools used, and the turn count — not the full episode content. That keeps classification fast and cost-free at inference time.
CLASSIFY_PROMPT = """You are classifying a conversation episode for training data curation.
Given the user request and tool calls below, assign the SINGLE most specific domain label.
Available domains:
{domains}
Episode:
User request: {user_request}
Tools used: {tools}
Turn count: {turn_count}
Respond with ONLY the domain label (one word, lowercase)."""
The classification step runs only when the GPU is idle — it won’t compete with active LoRA training. The --skip-classify flag falls back to keyword labels when the GPU is busy, producing lower-quality splits but preserving pipeline continuity.
# Check GPU state before running
$ nvidia-smi --query-gpu=name,memory.used,memory.free --format=csv,noheader
NVIDIA GeForce RTX 3080 Ti, 1024 MiB, 11264 MiB
# GPU is free — run classification
$ python3 convert-to-training.py --min-score 5
Loading 1,470 high-signal episodes...
Classifying via gemma4:12b...
[147/1470] devops (0.3s/ep)
...
Domain split written to training-ready/
The output is domain-split JSONL, ready for SFTTrainer. Each record is in ShareGPT format with a metadata field that preserves provenance: source session ID, domain classification, original score, and whether the episode contains error recovery.
Stage 4: Seed Amplification
Here’s where the math gets interesting. If one developer’s conversation histories yield 1,470 high-signal training episodes, that’s a meaningful dataset — but it’s also a dataset that reflects one person’s patterns, one set of projects, one set of failure modes. Training too heavily on it risks overfitting to personal idioms rather than generalizing agent behavior.
The literature on self-distillation gives us a practical constraint. Research across AgentHER, SPIN (arXiv:2401.01335), and OPSD converges on the same rough ratio: 20–30% real data as a floor, 70–80% synthetic amplification as the ceiling. Below 20% real, you get model collapse — the synthetic data loop detaches from ground truth. Above 30%, you overfit to the specific developer’s patterns: their project names, their phrasing, their particular mix of tool calls.
The sweet spot is real seeds treated as templates, with synthetic variations filling the bulk of the training set.
The amplifier extracts a “workflow template” from each real episode — the tool chain sequence and the variable slots (project names, file paths, ports, service names, domains). It then prompts an LLM to generate variations that preserve the tool chain but vary everything else:
VARIATION_SYSTEM_PROMPT = """You are a training data generator for an AI coding agent called sscode.
sscode uses exactly 9 tools: bash, read, write, edit, search, query, store, dispatch, bridge.
Your job: given a REAL conversation episode (seed), generate a realistic VARIATION
that preserves the same tool-calling pattern but changes the context.
Rules:
1. KEEP the same tool chain order (the sequence of tools is the pattern we're teaching)
2. CHANGE the project name, file paths, domain names, ports, or task description
3. KEEP responses realistic — real bash output, real file content, real error messages
4. Tool calls must use valid v5 schema
5. Output ONLY valid JSON
6. Make it feel like a REAL developer working on REAL infrastructure"""
A real deployment episode becomes a template:
Seed (real):
User: "deploy the web app to the VPS"
→ read(deploy.mjs) → bash(npm run build) → write(Dockerfile)
→ bash(podman build) → bash(scp to server) → bash(docker compose up)
Variations (synthetic, same tool chain):
├─ "deploy the new API service to staging"
├─ "deploy to VPS but the TypeScript build fails — fix it first"
├─ "deploy to a different port, 3000 is already taken"
├─ "rollback: last deploy broke the site, restore previous version"
└─ "deploy with a database migration step before the service comes up"
The tool chain is the training signal. The context is the variation surface. This is the distinction that separates seed amplification from Evol-Instruct — Evol mutates the instruction, which can invalidate the entire tool chain. Seed amplification preserves the chain and varies only the context slots.
Projected output from the current corpus:
| Seed Threshold | Seeds | Variations | Total Examples |
|---|---|---|---|
| score ≥ 3.0 | 2,181 | × 3 | 6,543 |
| score ≥ 5.0 | 1,338 | × 3 | 4,014 |
| score ≥ 7.0 | 194 | × 5 | 970 |
The Stage 4 pipeline supports local generation (Gemma4 via Ollama) and cloud provider fallbacks (Grok, Gemini, OpenAI), with a checkpoint system that survives interruption:
# Dry run — no generation, just plan
$ python3 amplify-seeds.py --dry-run --min-score 5
Seeds loaded: 1338 (score >= 5.0)
Variations per seed: 3
Expected output: 4014 examples
Provider: local
Sample templates:
[ 9.1] bash → read → bash → write → bash → bash
"run the full deployment pipeline for the new service..."
Slots: projects=['my-project', 'api-service'], domains=['api.example.com']
[ 8.7] read → dispatch → bash → edit → bash
"the training loss is spiking after epoch 3 — check the config..."
Slots: paths=['/home/user/training/config.yaml']
Domain coverage in seeds:
devops 892 seeds → 2676 variations
web_dev 743 seeds → 2229 variations
lora_training 401 seeds → 1203 variations
infrastructure 377 seeds → 1131 variations
database 289 seeds → 867 variations
Each generated variation goes through a structural validator before it’s written to the output file: minimum 4 turns, maximum 40, at least 30% tool-chain overlap with the original seed, a changed user request (not a copy of the seed), and valid v5 tool schemas on every function call turn. Variations that fail validation are discarded — the manifest records the validation rate as a quality signal.
What Transfers and What Doesn’t
Running the Stage 1 chain analysis reveals something that matters for anyone thinking about contributing their conversation histories to a shared dataset: the transferable signal and the personal noise are separable.
$ python3 index-conversations.py --chains
=== Multi-Tool Chain Patterns (length >= 2) ===
847x bash → read
612x read → edit → bash
508x bash → bash
341x read → bash → bash → bash
287x bash → read → edit
219x bash → write → bash
178x read → dispatch → bash → read → bash
143x bash → read → bash → read → bash → edit
The patterns that repeat hundreds of times — read → edit → bash, bash → read → bash → read — are workflow archetypes. They represent how developers actually move through tasks: read the config, edit it, run it to see if it works, read the output, read the relevant code, edit again. These patterns are not personal. They’re structural. They transfer across developers, across projects, across domains.
What doesn’t transfer: project-specific file paths, personal naming conventions, particular phrasing habits, the specific mix of projects that appear in a corpus. A model trained only on one developer’s deployment episodes will have strong priors about paths and service names that don’t exist on anyone else’s machine.
The amplification step handles this by design — varying project names, paths, and domains across synthetic examples. But the deeper implication is that a multi-contributor corpus would be dramatically better than any single developer’s histories. One developer’s 76 conversations produce 2,777 usable episodes. One hundred contributors, with proper anonymization and consent, could produce hundreds of thousands — covering domain mixes, error patterns, and workflow styles that no single person’s work captures.
WildChat and LMSYS-Chat-1M proved this model works for general conversation data. The same approach is open for agentic tool-calling data — and the training gap between local 8B agents and hosted models likely closes faster through community-contributed real interaction data than through any amount of synthetic generation alone.
Integration with Agent Core v5.2
The mined data supplements, not replaces, the synthetic dataset that the Agent Core v5.2 training pipeline generates via three-provider generation. The distinction matters:
| Dimension | Synthetic (3-provider) | Mined (conversation-miner) |
|---|---|---|
| Source | API generation (Grok, Gemini, OpenAI) | Real Claude Code sessions |
| Tool results | Simulated | Actual tool output |
| Error patterns | Scripted injection | Genuine errors + recovery |
| Coverage | Uniform across categories | Weighted by real usage |
| Cost | ~$35 per 8,000 examples | Zero (already captured) |
The recommended blend for a training run: 10–20% mined data mixed with 80–90% synthetic, with mined error-recovery episodes weighted upward. That proportion keeps the model grounded in real patterns without overfitting to one developer’s specific workflows.
The conversation-miner output integrates directly into the existing SFTTrainer pipeline with no format conversion required — both datasets use the same ShareGPT/Hermes schema the pipeline already expects.
Running It Yourself
The pipeline adapts to any coding agent that stores conversation histories in JSONL format — not just Claude Code. The indexer needs to know where your conversation files live and which tool names your agent uses. The extractor’s episode boundary logic is format-agnostic as long as messages alternate between user and assistant types with tool calls in the assistant blocks. The amplifier works with any JSONL episode set regardless of source.
The full tool is at github.com/StankyDanko. The four scripts are self-contained Python with no dependencies beyond sqlite3 (stdlib) and httpx (for Stage 3–4). Stage 1 and 2 run on any machine. Stages 3 and 4 need either a local Ollama instance or API keys.
Start with the index:
$ python3 index-conversations.py
Building conversation index...
Found 76 conversation files
[1/76] a7f3c2e9... (24.1MB) → 47 user msgs, 412 tools, score=12.4
[2/76] b8e41d7f... (18.7MB) → 38 user msgs, 287 tools, score=11.8
...
Index built: conversation-index.db (892 KB)
Indexed 76 conversations
If the stats look reasonable, run extraction. If the score distribution looks thin, check whether your conversations have enough tool calls to produce meaningful episodes — conversations that are mostly text exchange without tool use will score near zero and get filtered.
The data is already there. It’s already yours. Once you run the index and see your own score distribution, you’ll have a clear picture of exactly which conversations are worth extracting — and you’ll likely find more high-signal material than you expected. The only decision left is whether to let it sit in ~/.claude/projects/ doing nothing, or turn it into an agent that’s measurably better at the work you actually do.
Key Takeaways
- Claude Code conversation histories are structured JSONL files containing complete tool-call traces — the raw material for agent training data is sitting on your disk right now.
- A 4-stage pipeline (index, extract, classify, amplify) can convert real interaction records into ShareGPT/Hermes training data without spending a single token until Stage 3.
- Error-recovery episodes are 2x more data-efficient than clean successes per AgentHER — the conversations where something went wrong are your most valuable training signal.
- The optimal real-to-synthetic ratio is 20–30% real data as a floor, with seed amplification filling the remainder — preserving tool chains while varying context slots prevents both model collapse and personal-pattern overfitting.
- Tool chain patterns (bash → read → edit → bash, read → dispatch → bash) transfer across developers; project-specific paths and naming conventions don’t — the structural signal generalizes, the personal noise doesn’t.