Agent Core v5.3: What We Learned Training a Tool-Calling LoRA on Real Data

The Tool-Calling Problem

Local LLMs are getting better at most things. Reliable tool calling is not one of them. Ask an untuned 8B model to query a database or chain a shell command into a file write and you will get hallucinated tool names, parameters that don’t match the schema, or a model that confidently calls tools when the right answer is to respond directly from knowledge.

This is the problem Agent Core is designed to solve: a QLoRA adapter trained on Qwen3-8B that teaches a model the 9 universal tool primitives — bash, read, write, edit, search, query, store, dispatch, bridge — well enough to recall them from weights with a 741-character system prompt instead of a 6,000-character schema dump.

We just finished v5.3. Here is what the data says — wins, regressions, and the ceiling we hit that tells us exactly what needs to happen in v5.4.

The Architecture in One Paragraph

Agent Core is a 2-phase QLoRA fine-tune on Qwen3-8B (rank 64, alpha 128, RSLoRA). Phase 1 trains on full schema context for 3 epochs so the model memorizes all 9 tool signatures, 3 database schemas, and the vault security boundary. Phase 2 trains on minimal-prompt data for 2 more epochs at a 10x lower learning rate, forcing the model to recall those schemas from weights when only given tool names — not full JSON specs. The result is a model that can reliably call tools with a ~200-token system prompt instead of a ~1,500-token one. We proved this works in v5.2 with a +5.3pp tool accuracy gain going from the full-prompt condition to the minimal-prompt condition. v5.3 asked whether real production data could make it better.

What Changed in v5.3

1. Mining Real Tool-Calling Episodes

The core v5.2 dataset was entirely synthetic — generated by GPT-4.1-mini, Gemini 2.5 Flash, and Grok in a 4-stage pipeline. Good data, but every example looked like an API tutorial. We wanted production behavior.

We ran a conversation miner across our Claude Code session logs and extracted 1,476 tool-calling episodes: real prompts, real tool selections, real multi-step sequences from actual development work. We supplemented this with 34 manually curated examples targeting edge cases the miner missed — boundary conditions that don’t appear in typical agent sessions. Total: 1,510 real episodes.

These cover the kinds of tool calls that don’t appear in textbook datasets: a search followed by a read to verify a file exists before a write, a bash that chains into a store to cache the result, a query that returns nothing and triggers an edit fallback. The kind of sequences you only see if you’ve actually built agents.

2. Quality-Filtering the Synthetic Base

v5.2 had 8,249 validated synthetic examples. For v5.3 we ran an IFD-proxy complexity score over all of them — a heuristic that approximates instruction-following difficulty by measuring response length, tool call count, turn depth, and output diversity. We kept the top 5,500.

This is a research-backed bet. The LIMA paper showed 1,000 high-quality curated examples beat 52,000 broad-coverage examples on instruction following. APIGen applied the same principle to tool calling specifically. Quality over quantity at QLoRA scale is a documented phenomenon, not a hypothesis.

3. Recovering Corrupted Safety Data

A post-v5.2 audit found a serious bug in the data generation pipeline: a one-line fallback (return result or ["bash"]) was assigning a bash step to every seed task with empty expected_tools. The irrelevance category — 775 examples meant to teach the model when NOT to call tools — had 743 examples that contained tool calls. The model trained on v5.2 had been taught to call tools for questions like “What is the difference between horizontal and vertical scaling?” That is the opposite of the intended behavior.

We fixed the pipeline and recovered 897 clean no-tool examples from the corrupted batch, supplemented with 498 cloud-generated irrelevance examples (280 from Grok, 218 from OpenAI) to add voice diversity. These were 2x upweighted in Phase 2 training.

4. The Final Composition

Source	Count	Phase
Quality-filtered synthetic	5,500	Phase 1 + 2
Real mined episodes	1,510	Phase 1 + 2 (2x in Phase 2)
Recovered no-tool examples	897	Phase 2 only
Cloud irrelevance	498	Phase 2 only
Phase 1 total	6,895
Phase 2 total	9,915

Real data accounted for 30.5% of effective Phase 2 training weight after upweighting.

Training Details

Both phases ran on a single RTX 3080 Ti (12GB VRAM). Total wall time: approximately 8 hours. The hardware constraint shaped several decisions — notably the chunked cross-entropy loss patch described in the practitioner notes below.

Phase 1 (Schema Memorization)

Config: 8192 ctx, 3 epochs, LR=2e-4, grad_accumulation=16
LoRA: r=64, alpha=128, RSLoRA, targets q/k/v/o_proj
NEFTune alpha=3, cosine scheduler, 3% warmup
Peak VRAM: 7.79 GB

train_loss = 0.1244
eval_loss  = 0.0923
Total steps: 1,470 (~6.5 hours)

One CUDA crash at step ~985 (Triton/Unsloth offloading at an eval boundary). Resumed from checkpoint-980. No data loss. This is why the training scripts save a rolling checkpoint every 10 steps.

Phase 2 (Minimal-Trigger Recall)

Config: 2048 ctx, 2 epochs, LR=2e-5, warmup_steps=30
Loads Phase 1 weights via safetensors
All other params identical to Phase 1
Peak VRAM: 7.81 GB

train_loss = 3.776  (NEFTune-inflated — not a real regression)
eval_loss  = 0.3831
Total steps: 980 (~1.5 hours, resumed from checkpoint-780 after GPU crash)

The Phase 2 eval loss of 0.383 sits in the target generalization band: above 0.2 rules out memorization, below 0.5 confirms the minimal-prompt recall objective was absorbed. Smooth descent through all checkpoints, no overfitting inflection.

Results: v5.3 vs v5.2

Same 225 eval prompts. Same minimal system prompt. All differences are attributable to training data composition.

Headline Metrics

Metric	v5.2	v5.3	Delta
Schema Compliance	100%	100%	stable
Tool Selection	58.2%	52.9%	-5.3pp
Vault Avoidance	13.3%	53.3%	+40pp
Irrelevance (no-tool)	8.0%	24.0%	+16pp
Error Recovery	40.0%	60.0%	+20pp
Bash	90.0%	93.3%	+3.3pp
Multi-tool Chains	6.7%	6.7%	unchanged

Per-Category Breakdown

Category	v5.2	v5.3	Delta	Notes
bash	90.0%	93.3%	+3.3pp	Strongest overall
query_routing	100%	93.3%	-6.7pp	Minor regression
vault_avoidance	13.3%	53.3%	+40pp	Biggest win
irrelevance	8.0%	24.0%	+16pp	Still weak
error_recovery	40.0%	60.0%	+20pp	Real data helped
file_operations	72.0%	52.0%	-20pp	Regression
store_operations	85.0%	30.0%	-55pp	Significant regression
bridge	73.3%	40.0%	-33.3pp	Regression
dispatch	33.3%	33.3%	unchanged
multi_tool_chains	6.7%	6.7%	unchanged	NOT LEARNED

Reading the Results Honestly

The Safety Wins Are Real

Vault avoidance went from 13.3% to 53.3% — a 40-percentage-point improvement. To put that in context: in v5.2, the model would attempt to query or write against the vault database (the security boundary protecting secrets and credentials) on 87% of adversarial prompts. In v5.3, it refuses on 53% of them. That is a meaningful shift in the right direction, and it came entirely from the recovered no-tool data and cloud irrelevance examples. When you give the model clear examples of what “don’t touch that” looks like, it learns.

Error recovery going from 40% to 60% is also meaningful, and it is directly attributable to real data. The mined production episodes include the kinds of recovery patterns that synthetic data misses: you read a file that doesn’t exist, so you create it; a query returns empty results, so you fall back to a search. Real agents make those chains. Synthetic generators tend to produce happy-path scenarios.

The Tool-Accuracy Regressions Are Meaningful

Store, bridge, file_operations, and query_routing all regressed. The most likely explanation is category imbalance from real data composition. The 1,510 mined production episodes are heavily biased toward bash, read, and edit — that is what software development work looks like. Categories like bridge (MCP/HTTP external service calls) and store (INSERT/UPDATE against agent DBs) are underrepresented in real Claude Code sessions. When we 2x upweighted the real data in Phase 2, we inadvertently biased the model away from those categories.

The fix for v5.4 is targeted: stratified real-data collection for underrepresented categories, not a blanket upweight.

The Ceiling: Multi-Tool Chains at 6.7%

This number did not move. Not in v5.2, not in v5.3, not across any training variant we have run. The model can execute individual tool calls correctly. It cannot plan sequences of 3–5 tools to accomplish a compound task.

This is not a data quality problem. We have good multi-tool chain examples. It is a fundamental limitation of supervised fine-tuning as a learning signal. Put plainly: SFT teaches the model to imitate; DPO teaches it to judge. SFT shows the model the correct next step in isolation. It does not teach the model to evaluate a sequence, backtrack when a step fails, or recognize that step 3 depends on the outcome of step 2. You can give the model perfect demonstrations of multi-step chains and it will still fail at planning them — because imitation of individual steps is not the same cognitive operation as planning across steps.

The research literature is consistent on this. Multi-step tool-calling requires preference optimization — you need to show the model which sequences of steps are better than other sequences, not just demonstrate the correct steps in isolation. That means DPO (Direct Preference Optimization) or step-level variants like SWiRL.

What the Research Says About the Ceiling

After seeing the v5.3 results, we surveyed the recent literature on multi-step tool-calling specifically. The pattern is consistent: SFT + DPO is now the standard recipe for 8B tool-calling models. Recent benchmarks (BFCL v3, ToolBench) show that models trained with SFT alone plateau in the 55–65% accuracy range on complex tool selection. Models trained with SFT + DPO iteration break through to 85–91%.

The mechanism: DPO generates preference pairs from the model’s own mistakes. You run the SFT model on evaluation prompts, collect the wrong tool selections, and construct (chosen, rejected) pairs where the chosen response is the correct tool call and the rejected response is what the model actually produced. Training on these pairs adjusts the model’s probability distribution directly — it learns not just “this is correct” but “that specific mistake is wrong.”

For multi-tool chains specifically, step-wise variants like SWiRL apply this logic at each step of a trajectory rather than treating the full sequence as a single training unit. That matches the structure of the problem: a 4-step tool chain can fail at step 3 while steps 1, 2, and 4 are correct. SFT-style training penalizes the whole sequence. SWiRL surgically penalizes the bad step.

What This Means for Anyone Building Tool-Calling Agents

If you are training your own tool-calling LoRA, here is what Agent Core v5.3 demonstrates empirically:

Schema compliance is the easy win. Once you normalize to a consistent format (we use XML <tool_call> blocks) and quality-filter your training data, 100% schema compliance is achievable and stable. Format normalization alone contributed significantly to this in the v5.2 → v5.3 transition.

Real production data outperforms synthetic at small scale. 1,510 real examples 2x upweighted contributed more signal per example than the synthetic base. This matches LIMA/APIGen findings. If you have access to real agent interaction logs, mine them before generating more synthetic data.

Safety categories require dedicated no-tool training. The vault avoidance and irrelevance improvements came from explicitly recovering and generating “don’t call tools here” examples. If your safety data is contaminated with tool calls (as ours was), your model will learn the opposite of what you intended.

SFT alone will not get you past ~60% tool accuracy on complex tasks. If your application requires reliable 3–5 step tool chains, you need preference tuning. Build a DPO dataset from your SFT model’s own failures.

Consumer hardware is sufficient for this scale. All of v5.2 and v5.3 training ran on a single RTX 3080 Ti with 12GB VRAM. Peak usage was 7.81 GB. The key VRAM optimization that makes this possible is a chunked cross-entropy loss patch that processes 32 tokens at a time instead of the full sequence — without it, the 8192 context window OOMs at Phase 1.

What’s Next: v5.4 DPO

The v5.3 results give us a clear roadmap:

Multi-tool chains via SWiRL — Generate preference pairs from v5.3 model failures on multi-step eval prompts. Apply step-wise DPO so the model learns which step in a sequence was wrong.
Irrelevance hardening via DPO — Construct (chosen=no-tool, rejected=unnecessary-tool-call) pairs. Even with the v5.3 improvements, 24% irrelevance accuracy means the model still over-calls tools on 76% of should-be-direct questions.
Vault hardening via red-team pairs — The adversarial injection prompts that still breach vault at 20% rate need (chosen=refuse-and-explain, rejected=vault-query) DPO pairs.
Stratified real-data collection — Before any training, build category-balanced real episode datasets for store, bridge, and file_operations to address the composition bias.

The goal for v5.4 is to cross 80% overall tool accuracy while holding the safety gains from v5.3. Based on the literature, that target is achievable with 2–3 DPO iterations on the v5.3 SFT base.

When v5.4 is complete, the full Agent Core standard — training data, eval set, adapter weights, and this empirical record — will be published on HuggingFace. The adapter is designed for any 8B agent stack running Ollama with Qwen3.

Takeaways

v5.3 clarified what SFT can and cannot do for tool-calling agents. The short version: SFT is necessary but not sufficient. It gets you schema compliance, safety boundaries, and error recovery. It does not get you multi-step planning. Here is what the data confirmed:

Schema compliance (100%) is achievable with format normalization and quality filtering, and it is stable across training iterations.
Real production data is more valuable per example than synthetic data at QLoRA scale — 1,510 mined episodes improved safety categories by +16pp to +40pp despite being a small fraction of the total dataset.
SFT has a hard ceiling around 55–65% tool accuracy for complex tasks. Multi-tool chains (6.7% across both versions) will not improve without preference tuning.
The DPO/SWiRL recipe is now the research-validated path to 85–91% accuracy on 8B models.
Consumer hardware (RTX 3080 Ti, 12GB VRAM) is sufficient for this work — but requires careful VRAM management (chunked CE loss, rolling checkpoints, subprocess isolation for eval).