
Agent Core v5.4: When Data Augmentation Backfires
We 6.5x'd our training data and made the model safer but less capable. Here's the root cause, the numbers, and the v5.5 fix plan.
Table of Contents
The conventional wisdom in fine-tuning is seductive: more data means a better model. It is true often enough to feel like a law. Going into v5.4, we believed it. We had 7K quality-filtered examples from v5.3. We augmented them to 45,853 — paraphrases, contrastive negatives, irrelevance examples — a 2.7x expansion with techniques backed by recent research. We ran the same 225-prompt eval on the resulting GGUF adapter.
The model got safer. It also got worse at its primary job.
This post is the honest accounting of what happened: the numbers, the root cause, and the revised approach for v5.5. If you are training tool-calling LoRAs at QLoRA scale, the failure mode we hit is real and reproducible, and the fix is counterintuitive.
What Agent Core Is
Agent Core is a QLoRA adapter for Qwen3-8B that teaches a model the 9 universal tool primitives: bash, read, write, edit, search, query, store, dispatch, and bridge. It also trains in a 3-database architecture (system, agent, work) plus a behavioral vault boundary — a trained-in security refusal for credential-adjacent requests. The goal is a model that executes reliable tool calls from a ~200-token system prompt instead of a 1,500-token schema dump.
v5.3 established the baseline (covered in the previous post — forthcoming). v5.4 was the data-augmentation experiment. The training config stayed identical. Everything that changed was in the dataset.
What Changed in v5.4
The Augmentation Pipeline
Three augmentation passes on the v5.3 originals (16,810 examples post-quality-filtering):
Pass 1 — Paraphrases. Each example was rephrased 2-3x using phi4-mini for simple queries and qwen3:8b for complex multi-step calls, matching the target model family to minimize distribution shift. This expanded coverage of query phrasing diversity without generating new examples from scratch.
Pass 2 — Irrelevance examples. 985 synthetic “no tool needed” cases generated per domain category — general_knowledge, opinion, already_answered, off_topic, no_tool_needed. These teach the model to respond directly from knowledge instead of reaching for a tool it doesn’t need.
Pass 3 — NAT contrastive negatives. Based on the NAT paper, we generated perturbed examples for every training case with 3+ tool calls. Perturbations: 60% tool_swap (replace correct tool with a plausible wrong one), 40% arg_perturbation (corrupt argument values or types). Each perturbed example gets a [NAT:WRONG] suffix on the user query; originals get [NAT:CORRECT]. The intuition is that explicit contrastive signals help the model learn the boundary between correct and incorrect tool selection, not just the happy path.
Final dataset: 16,810 → 45,853 examples (2.7x expansion).
Training
Two-phase QLoRA on RunPod RTX 4090, total ~16 hours:
Phase 1 (schema memorization):
ctx=8192, epochs=3, LR=2e-4, batch=1×16, grad_accum=16
LoRA: r=64, alpha=128, RSLoRA, q/k/v/o_proj targets
NEFTune alpha=3, cosine LR, 3% warmup, BF16+FA2
Examples: 18,288
Final: train_loss=0.0009, eval_loss=0.0248
Phase 2 (minimal-trigger recall):
ctx=2048, epochs=2, LR=2e-5, warmup_steps=30
Loads Phase 1 weights via safetensors
Examples: 27,565
Final: train_loss=0.7278, eval_loss=0.5756
Eval via Ollama with Q4_K_M GGUF. Same 225 prompts, same system prompt as v5.3.
The Numbers
Headline Metrics
| Metric | v5.3 | v5.4 | Delta |
|---|---|---|---|
| Tool Selection | 48.9% | 44.4% | -4.5pp |
| Schema Compliance | 99.6% | 100.0% | +0.4pp |
| DB Routing | 82.6% | 78.8% | -3.8pp |
| SQL Validity | 92.0% | 89.8% | -2.2pp |
| Vault Breach Rate | 20.0% | 6.7% | -13.3pp |
| Irrelevance Accuracy | 28.0% | 48.0% | +20.0pp |
| Vault Avoidance | 20.0% | 53.3% | +33.3pp |
| Avg Latency | 0.93s | 0.69s | -26% |
Per-Category Breakdown
| Category | v5.3 | v5.4 | Delta | Notes |
|---|---|---|---|---|
| query_routing | 95.6% | 95.6% | stable | No change |
| bash | 83.3% | 66.7% | -16.6pp | Hard regression |
| error_recovery | 60.0% | 20.0% | -40.0pp | Severe |
| store_operations | 50.0% | 15.0% | -35.0pp | Severe |
| dispatch | 33.3% | 13.3% | -20.0pp | Hard regression |
| vault_avoidance | 20.0% | 53.3% | +33.3pp | Clear win |
| irrelevance | 28.0% | 48.0% | +20.0pp | Real improvement |
| multi_tool_chains | 6.7% | 6.7% | stable | SFT ceiling confirmed |
Reading the Results Honestly
The Safety Wins Are Real
Vault breach rate dropped from 20% to 6.7%. In adversarial prompts designed to extract credential-adjacent operations, v5.3 responded with a vault query on 1 in 5 attempts. v5.4 responds with a refusal on 93.3% of those same prompts. The vault avoidance category — direct adversarial injections against the trained security boundary — went from 20% to 53.3%.
For context: the BFCL leaderboard shows Qwen3-8B scoring roughly 42-50% overall on tool-calling tasks, with few published benchmarks for vault-style security boundaries. A 6.7% breach rate on adversarial prompts is a meaningful result for an 8B model in a zero-internet, weight-only deployment.
Irrelevance accuracy improved from 28% to 48%. v5.3 still called tools when the right answer was to respond directly from knowledge — 72% of the time. v5.4 gets it right 48% of the time. Not solved, but substantially better.
The latency improvement (0.93s → 0.69s, 26% faster) is a real production benefit. The model generates fewer tokens on safe refusals, which is actually accurate: shorter responses on paths where no tool is warranted.
The Regressions Are Severe
Error recovery dropped 40 percentage points. Store operations dropped 35 points. These are not noise — they represent a fundamental shift in how the model behaves on categories that require confident, decisive tool selection.
Here is a representative failure from the eval. Prompt: “Append today’s work summary to the tasks database.” v5.3 generates:
<tool_call>
{"name": "store", "arguments": {"db": "work", "table": "tasks", "op": "insert", "data": {"summary": "...", "date": "2026-05-12"}}}
</tool_call>
v5.4 on the same prompt frequently produces a direct prose response: “I can help you track your work summary. What would you like me to note?” — which looks like helpful intent, but is a tool-calling failure. The model learned to hedge instead of act.
That pattern is the signature of NAT overcorrection.
Root Cause Analysis
NAT Overcorrection: The Decision Boundary Shifted Too Far
The NAT paper introduced contrastive negative training for agentic tool-use specifically to reduce action errors — incorrect tool invocations in ReAct-style sequences. The technique works by marking perturbed examples with explicit [NAT:WRONG] labels so the model learns to distinguish correct from incorrect tool selection rather than just imitating correct examples.
The mechanism is sound. The problem is dose.
When you generate NAT negatives algorithmically for every example with 3+ tool calls, and those negatives make up a significant fraction of a 45K-example dataset, the contrastive signal doesn’t just teach “this specific wrong tool is wrong.” It shifts the model’s prior toward caution across the board. The model learns a generalized heuristic: when uncertain, don’t call a tool. This is exactly the right behavior for vault and irrelevance cases. It is the wrong behavior for bash, store, dispatch, and error_recovery cases — where the correct answer is confident, immediate tool selection.
That diagnosis is also consistent with what the NAT paper itself warns: NAT reduces action errors and improves learning from failures, but the marginal utility of negatives diminishes quickly as positive examples grow, and excessive negatives hurt performance rather than help it. We generated them at scale. We paid for it.
45K Examples Exceeded the QLoRA Capacity for Rank 64
The LIMA paper, QLoRA paper, and multiple instruction-tuning ablations converge on the same finding: quality and dataset suitability dominate over scale. For rank-64 QLoRA on 8B models, the empirical sweet spot is roughly 8-10K high-quality examples. The v5.3 dataset at 7K was close to this range. The v5.4 dataset at 45K was 5-6x beyond it.
This creates two problems. First, the LoRA adapter’s rank-64 capacity is finite — beyond a certain dataset scale, you are not adding signal, you are adding repetitive pattern saturation. Second, the paraphrase pass introduces subtle distribution shift even when using the same model family for generation. The augmented examples look slightly different from the original eval distribution, and with 45K examples, that shift accumulates.
The v5.3 result at 48.9% tool selection now reads differently in hindsight: that 7K quality-filtered dataset was well-calibrated to the eval distribution. v5.4 disrupted that calibration while adding safety signal.
Augmented Data Is Not Free Signal
Paraphrases are often treated as zero-cost data expansion — the information content is identical, just phrased differently. In practice, paraphrasing changes the query’s surface form in ways that interact with how the model generalizes. When phi4-mini rephrases “append a record to the tasks table” as “log this task in my work database”, the semantic intent is the same but the surface cues that trigger tool selection may differ. Multiplied across thousands of examples and mixed with contrastive negatives that pull toward caution, these small distributional shifts compound.
Quality over quantity is not a platitude at QLoRA scale. It is what the training dynamics actually reward.
What Didn’t Move: The SFT Ceiling Is Confirmed Again
Multi-tool chains stayed at 6.7% across both versions. Every training variation since v5.2 has produced the same result for this category. The model cannot plan a 3-5 step tool sequence from a single prompt.
This is not a data quantity problem and it is not a data quality problem. The multi-tool chain training examples exist, they are correct, and the model fails to generalize them. The reason is structural: SFT teaches imitation of individual correct steps. It does not teach the model to evaluate a sequence, backtrack on a failed intermediate step, or recognize that step 3 depends on the output of step 2. You can demonstrate correct 4-step chains to a model indefinitely and it will not learn to plan them — because demonstrating is not the same cognitive operation as planning.
The fix is well-established at this point. Multi-step tool-calling requires preference optimization — DPO, ORPO, or step-level variants like SWiRL — where the model is shown not just “here is the correct sequence” but “this sequence was better than that one, and here is why the wrong step was wrong.” SFT is necessary to establish the tool vocabulary. It is not sufficient to develop sequence reasoning.
v5.5 Plan: Calibrate, Don’t Scale
The v5.5 strategy inverts the v5.4 hypothesis. Instead of more data, fewer better examples. Instead of more contrastive negatives, calibrated proportions. If the diagnosis is correct, v5.5 should recover the core capability losses while holding the vault and irrelevance gains — and that is a testable prediction against the same 225-prompt eval.
Target Dataset Composition
Research-validated ratios for tool-calling LoRAs at 8B QLoRA scale:
| Category | Target % | Rationale |
|---|---|---|
| Positive tool-use examples | 55-65% | Dominant signal — the base case |
| NAT contrastive negatives | 15-25% | Effective at this proportion, toxic at higher |
| Irrelevance / no-tool | 15-25% | Necessary, but no longer the priority gap |
Target total: 12-15K examples, heavily curated. Categories with severe regressions — store_ops, dispatch, error_recovery, bash — will be upweighted in the positive pool, not just present at their organic rate.
DPO for Multi-Tool Chains (and Irrelevance Hardening)
v5.5 will add a DPO pass after the SFT phase. The process:
- Run the v5.4 GGUF adapter against multi-tool chain prompts and collect failures.
- Construct
(chosen, rejected)preference pairs: chosen = the correct tool sequence, rejected = what the model actually produced. - Train DPO on these pairs so the model adjusts its probability distribution away from specific failure modes — not just “imitate the correct thing” but “penalize the specific wrong thing you did here.”
For multi-tool chains, the step-level variant (SWiRL) is preferable: it generates preference pairs at each step of a trajectory rather than treating the full sequence as a single unit. This is appropriate because 4-step chains fail at specific steps — you want to penalize the bad step, not the whole trajectory.
For irrelevance, standard DPO on (chosen=no-tool response, rejected=unnecessary tool call) pairs. The goal is to improve from 48% toward 70%+ without pulling tool-use accuracy down.
Replay Mixing
10-20% of v5.3’s quality-filtered 7K examples will be replayed directly into v5.5 Phase 1. This is the “catastrophic forgetting mitigation” strategy: including original high-quality examples alongside augmented data prevents the augmentation from overwriting the signal that made v5.3 competitive on core categories.
Practitioner Notes
Latency as a diagnostic signal. The 26% latency improvement in v5.4 is not just a performance win — it tells you something about how the model changed. Faster responses on average means the model is generating fewer tokens per call. The categories where this is appropriate (vault refusals, irrelevance) got better. The categories where decisive action produces longer tool_call responses (store, dispatch) got worse. If your latency drops sharply across a training run, check whether your “efficiency” is actually over-refusal.
Schema compliance is the wrong leading indicator. v5.4 hit 100% schema compliance. v5.4 also hit its worst tool-selection numbers. A model can output perfectly structured <tool_call> XML while choosing the wrong tool, or output a fluent prose response instead of calling any tool at all. Schema compliance tells you the format is correct. It tells you nothing about whether the model’s reasoning was correct.
The eval category breakdown matters more than the aggregate. Overall tool selection at 44.4% vs 48.9% is a 4.5pp change. The per-category table shows regressions of 20-40pp in specific categories that are masked by stability elsewhere (query_routing held at 95.6%, which is a large eval slice). If you report only aggregate metrics, you can miss severe functional degradation in critical subcategories.
NAT negatives are a precision instrument, not a volume lever. Generate them algorithmically (fast, cheap, reproducible) at 15-25% of your total dataset. Do not scale them proportionally with your positive set. The research evidence is that they help most in data-scarce regimes; their marginal value decreases as positive examples grow, and at high proportions they degrade performance rather than improve it.
What This Means for Your Tool-Calling LoRA
The v5.3 → v5.4 transition is a clean controlled experiment. Same architecture, same eval, same hardware. The only variable was the dataset. The findings translate directly:
Do not augment past your adapter’s capacity. Rank-64 QLoRA on an 8B model has a practical ceiling around 8-12K examples for tool-calling tasks. Beyond that, you are likely introducing noise faster than signal.
NAT negatives require proportion discipline. Target 15-25% of your total dataset. If you are generating them algorithmically and scaling them with your positive pool, you will overcorrect.
Safety improvements and capability improvements can trade against each other in SFT. The vault and irrelevance gains in v5.4 are not free — they came at the cost of bash, store, dispatch, and error_recovery. If you need both, you need two separate optimization objectives: SFT for the capability foundation, DPO for the safety boundary. Trying to encode both signals in a single SFT dataset creates exactly this tradeoff.
Measure per-category, every run. Aggregate tool accuracy hides the distribution of wins and losses. Run your eval suite broken down by category after every checkpoint.
Takeaways
- v5.4 trained on 45,853 augmented examples vs v5.3’s 7K. The model got safer — vault breach rate dropped from 20% to 6.7%, a meaningful result for an 8B weight-only deployment — but core tool-calling capability regressed across bash, store, dispatch, and error_recovery. The extra data hurt more than it helped.
- NAT contrastive negatives shifted the decision boundary toward caution across all categories, not just the target ones. The model learned “when in doubt, don’t call a tool” as a generalized prior.
- 45K examples is 4-5x beyond the empirical sweet spot for rank-64 QLoRA on 8B models. Quality filtering and calibrated composition outperform volume.
- Multi-tool chains remain at 6.7% — identical across v5.2, v5.3, and v5.4. SFT cannot teach sequence planning. v5.5 will introduce DPO preference pairs to address this.
- v5.5 target: 12-15K curated examples at calibrated ratios (55-65% positive, 15-25% NAT, 15-25% irrelevance), DPO pass for multi-tool chains and irrelevance hardening, replay of v5.3 high-quality examples to prevent forgetting.
The full Agent Core standard — training data, eval set, adapter weights, and this empirical record — will be published on HuggingFace after v5.5 reaches production quality. The adapter targets any Ollama deployment running Qwen3-8B.