Agent Core v5.4: When Data Augmentation Backfires

The conventional wisdom in fine-tuning is seductive: more data means a better model. It is true often enough to feel like a law. Going into v5.4, we believed it. We had 7K quality-filtered examples from v5.3. We augmented them to 45,853 — paraphrases, contrastive negatives, irrelevance examples — a 2.7x expansion with techniques backed by recent research. We ran the same 225-prompt eval on the resulting GGUF adapter.

The model got safer. It also got worse at its primary job.

This post is the honest accounting of what happened: the numbers, the root cause, and the revised approach for v5.5. If you are training tool-calling LoRAs at QLoRA scale, the failure mode we hit is real and reproducible, and the fix is counterintuitive.

What Agent Core Is

Agent Core is a QLoRA adapter for Qwen3-8B that teaches a model the 9 universal tool primitives: bash, read, write, edit, search, query, store, dispatch, and bridge. It also trains in a 3-database architecture (system, agent, work) plus a behavioral vault boundary — a trained-in security refusal for credential-adjacent requests. The goal is a model that executes reliable tool calls from a ~200-token system prompt instead of a 1,500-token schema dump.

v5.3 established the baseline (covered in the previous post — forthcoming). v5.4 was the data-augmentation experiment. The training config stayed identical. Everything that changed was in the dataset.

What Changed in v5.4

The Augmentation Pipeline

Three augmentation passes on the v5.3 originals (16,810 examples post-quality-filtering):

Pass 1 — Paraphrases. Each example was rephrased 2-3x using phi4-mini for simple queries and qwen3:8b for complex multi-step calls, matching the target model family to minimize distribution shift. This expanded coverage of query phrasing diversity without generating new examples from scratch.

Pass 2 — Irrelevance examples. 985 synthetic “no tool needed” cases generated per domain category — general_knowledge, opinion, already_answered, off_topic, no_tool_needed. These teach the model to respond directly from knowledge instead of reaching for a tool it doesn’t need.

Pass 3 — NAT contrastive negatives. Based on the NAT paper, we generated perturbed examples for every training case with 3+ tool calls. Perturbations: 60% tool_swap (replace correct tool with a plausible wrong one), 40% arg_perturbation (corrupt argument values or types). Each perturbed example gets a [NAT:WRONG] suffix on the user query; originals get [NAT:CORRECT]. The intuition is that explicit contrastive signals help the model learn the boundary between correct and incorrect tool selection, not just the happy path.

Final dataset: 16,810 → 45,853 examples (2.7x expansion).

Training

Two-phase QLoRA on RunPod RTX 4090, total ~16 hours:

Phase 1 (schema memorization):
  ctx=8192, epochs=3, LR=2e-4, batch=1×16, grad_accum=16
  LoRA: r=64, alpha=128, RSLoRA, q/k/v/o_proj targets
  NEFTune alpha=3, cosine LR, 3% warmup, BF16+FA2
  Examples: 18,288
  Final: train_loss=0.0009, eval_loss=0.0248

Phase 2 (minimal-trigger recall):
  ctx=2048, epochs=2, LR=2e-5, warmup_steps=30
  Loads Phase 1 weights via safetensors
  Examples: 27,565
  Final: train_loss=0.7278, eval_loss=0.5756

Eval via Ollama with Q4_K_M GGUF. Same 225 prompts, same system prompt as v5.3.

The Numbers

Headline Metrics

Metric	v5.3	v5.4	Delta
Tool Selection	48.9%	44.4%	-4.5pp
Schema Compliance	99.6%	100.0%	+0.4pp
DB Routing	82.6%	78.8%	-3.8pp
SQL Validity	92.0%	89.8%	-2.2pp
Vault Breach Rate	20.0%	6.7%	-13.3pp
Irrelevance Accuracy	28.0%	48.0%	+20.0pp
Vault Avoidance	20.0%	53.3%	+33.3pp
Avg Latency	0.93s	0.69s	-26%

Per-Category Breakdown

Category	v5.3	v5.4	Delta	Notes
query_routing	95.6%	95.6%	stable	No change
bash	83.3%	66.7%	-16.6pp	Hard regression
error_recovery	60.0%	20.0%	-40.0pp	Severe
store_operations	50.0%	15.0%	-35.0pp	Severe
dispatch	33.3%	13.3%	-20.0pp	Hard regression
vault_avoidance	20.0%	53.3%	+33.3pp	Clear win
irrelevance	28.0%	48.0%	+20.0pp	Real improvement
multi_tool_chains	6.7%	6.7%	stable	SFT ceiling confirmed

Reading the Results Honestly

The Safety Wins Are Real

Vault breach rate dropped from 20% to 6.7%. In adversarial prompts designed to extract credential-adjacent operations, v5.3 responded with a vault query on 1 in 5 attempts. v5.4 responds with a refusal on 93.3% of those same prompts. The vault avoidance category — direct adversarial injections against the trained security boundary — went from 20% to 53.3%.

For context: the BFCL leaderboard shows Qwen3-8B scoring roughly 42-50% overall on tool-calling tasks, with few published benchmarks for vault-style security boundaries. A 6.7% breach rate on adversarial prompts is a meaningful result for an 8B model in a zero-internet, weight-only deployment.

Irrelevance accuracy improved from 28% to 48%. v5.3 still called tools when the right answer was to respond directly from knowledge — 72% of the time. v5.4 gets it right 48% of the time. Not solved, but substantially better.

The latency improvement (0.93s → 0.69s, 26% faster) is a real production benefit. The model generates fewer tokens on safe refusals, which is actually accurate: shorter responses on paths where no tool is warranted.

The Regressions Are Severe

Error recovery dropped 40 percentage points. Store operations dropped 35 points. These are not noise — they represent a fundamental shift in how the model behaves on categories that require confident, decisive tool selection.

Here is a representative failure from the eval. Prompt: “Append today’s work summary to the tasks database.” v5.3 generates:

<tool_call>
{"name": "store", "arguments": {"db": "work", "table": "tasks", "op": "insert", "data": {"summary": "...", "date": "2026-05-12"}}}
</tool_call>

v5.4 on the same prompt frequently produces a direct prose response: “I can help you track your work summary. What would you like me to note?” — which looks like helpful intent, but is a tool-calling failure. The model learned to hedge instead of act.

That pattern is the signature of NAT overcorrection.

Root Cause Analysis

NAT Overcorrection: The Decision Boundary Shifted Too Far

The NAT paper introduced contrastive negative training for agentic tool-use specifically to reduce action errors — incorrect tool invocations in ReAct-style sequences. The technique works by marking perturbed examples with explicit [NAT:WRONG] labels so the model learns to distinguish correct from incorrect tool selection rather than just imitating correct examples.

The mechanism is sound. The problem is dose.

When you generate NAT negatives algorithmically for every example with 3+ tool calls, and those negatives make up a significant fraction of a 45K-example dataset, the contrastive signal doesn’t just teach “this specific wrong tool is wrong.” It shifts the model’s prior toward caution across the board. The model learns a generalized heuristic: when uncertain, don’t call a tool. This is exactly the right behavior for vault and irrelevance cases. It is the wrong behavior for bash, store, dispatch, and error_recovery cases — where the correct answer is confident, immediate tool selection.

That diagnosis is also consistent with what the NAT paper itself warns: NAT reduces action errors and improves learning from failures, but the marginal utility of negatives diminishes quickly as positive examples grow, and excessive negatives hurt performance rather than help it. We generated them at scale. We paid for it.

45K Examples Exceeded the QLoRA Capacity for Rank 64

The LIMA paper, QLoRA paper, and multiple instruction-tuning ablations converge on the same finding: quality and dataset suitability dominate over scale. For rank-64 QLoRA on 8B models, the empirical sweet spot is roughly 8-10K high-quality examples. The v5.3 dataset at 7K was close to this range. The v5.4 dataset at 45K was 5-6x beyond it.

This creates two problems. First, the LoRA adapter’s rank-64 capacity is finite — beyond a certain dataset scale, you are not adding signal, you are adding repetitive pattern saturation. Second, the paraphrase pass introduces subtle distribution shift even when using the same model family for generation. The augmented examples look slightly different from the original eval distribution, and with 45K examples, that shift accumulates.

The v5.3 result at 48.9% tool selection now reads differently in hindsight: that 7K quality-filtered dataset was well-calibrated to the eval distribution. v5.4 disrupted that calibration while adding safety signal.

Augmented Data Is Not Free Signal

Paraphrases are often treated as zero-cost data expansion — the information content is identical, just phrased differently. In practice, paraphrasing changes the query’s surface form in ways that interact with how the model generalizes. When phi4-mini rephrases “append a record to the tasks table” as “log this task in my work database”, the semantic intent is the same but the surface cues that trigger tool selection may differ. Multiplied across thousands of examples and mixed with contrastive negatives that pull toward caution, these small distributional shifts compound.

Quality over quantity is not a platitude at QLoRA scale. It is what the training dynamics actually reward.

What Didn’t Move: The SFT Ceiling Is Confirmed Again

Multi-tool chains stayed at 6.7% across both versions. Every training variation since v5.2 has produced the same result for this category. The model cannot plan a 3-5 step tool sequence from a single prompt.

This is not a data quantity problem and it is not a data quality problem. The multi-tool chain training examples exist, they are correct, and the model fails to generalize them. The reason is structural: SFT teaches imitation of individual correct steps. It does not teach the model to evaluate a sequence, backtrack on a failed intermediate step, or recognize that step 3 depends on the output of step 2. You can demonstrate correct 4-step chains to a model indefinitely and it will not learn to plan them — because demonstrating is not the same cognitive operation as planning.

The fix is well-established at this point. Multi-step tool-calling requires preference optimization — DPO, ORPO, or step-level variants like SWiRL — where the model is shown not just “here is the correct sequence” but “this sequence was better than that one, and here is why the wrong step was wrong.” SFT is necessary to establish the tool vocabulary. It is not sufficient to develop sequence reasoning.

v5.5 Plan: Calibrate, Don’t Scale

The v5.5 strategy inverts the v5.4 hypothesis. Instead of more data, fewer better examples. Instead of more contrastive negatives, calibrated proportions. If the diagnosis is correct, v5.5 should recover the core capability losses while holding the vault and irrelevance gains — and that is a testable prediction against the same 225-prompt eval.

Target Dataset Composition

Research-validated ratios for tool-calling LoRAs at 8B QLoRA scale:

Category	Target %	Rationale
Positive tool-use examples	55-65%	Dominant signal — the base case
NAT contrastive negatives	15-25%	Effective at this proportion, toxic at higher
Irrelevance / no-tool	15-25%	Necessary, but no longer the priority gap

Target total: 12-15K examples, heavily curated. Categories with severe regressions — store_ops, dispatch, error_recovery, bash — will be upweighted in the positive pool, not just present at their organic rate.

DPO for Multi-Tool Chains (and Irrelevance Hardening)

v5.5 will add a DPO pass after the SFT phase. The process:

Run the v5.4 GGUF adapter against multi-tool chain prompts and collect failures.
Construct (chosen, rejected) preference pairs: chosen = the correct tool sequence, rejected = what the model actually produced.
Train DPO on these pairs so the model adjusts its probability distribution away from specific failure modes — not just “imitate the correct thing” but “penalize the specific wrong thing you did here.”

For multi-tool chains, the step-level variant (SWiRL) is preferable: it generates preference pairs at each step of a trajectory rather than treating the full sequence as a single unit. This is appropriate because 4-step chains fail at specific steps — you want to penalize the bad step, not the whole trajectory.

For irrelevance, standard DPO on (chosen=no-tool response, rejected=unnecessary tool call) pairs. The goal is to improve from 48% toward 70%+ without pulling tool-use accuracy down.

Replay Mixing

10-20% of v5.3’s quality-filtered 7K examples will be replayed directly into v5.5 Phase 1. This is the “catastrophic forgetting mitigation” strategy: including original high-quality examples alongside augmented data prevents the augmentation from overwriting the signal that made v5.3 competitive on core categories.

Practitioner Notes

Latency as a diagnostic signal. The 26% latency improvement in v5.4 is not just a performance win — it tells you something about how the model changed. Faster responses on average means the model is generating fewer tokens per call. The categories where this is appropriate (vault refusals, irrelevance) got better. The categories where decisive action produces longer tool_call responses (store, dispatch) got worse. If your latency drops sharply across a training run, check whether your “efficiency” is actually over-refusal.

Schema compliance is the wrong leading indicator. v5.4 hit 100% schema compliance. v5.4 also hit its worst tool-selection numbers. A model can output perfectly structured <tool_call> XML while choosing the wrong tool, or output a fluent prose response instead of calling any tool at all. Schema compliance tells you the format is correct. It tells you nothing about whether the model’s reasoning was correct.

The eval category breakdown matters more than the aggregate. Overall tool selection at 44.4% vs 48.9% is a 4.5pp change. The per-category table shows regressions of 20-40pp in specific categories that are masked by stability elsewhere (query_routing held at 95.6%, which is a large eval slice). If you report only aggregate metrics, you can miss severe functional degradation in critical subcategories.

NAT negatives are a precision instrument, not a volume lever. Generate them algorithmically (fast, cheap, reproducible) at 15-25% of your total dataset. Do not scale them proportionally with your positive set. The research evidence is that they help most in data-scarce regimes; their marginal value decreases as positive examples grow, and at high proportions they degrade performance rather than improve it.

What This Means for Your Tool-Calling LoRA

The v5.3 → v5.4 transition is a clean controlled experiment. Same architecture, same eval, same hardware. The only variable was the dataset. The findings translate directly:

Do not augment past your adapter’s capacity. Rank-64 QLoRA on an 8B model has a practical ceiling around 8-12K examples for tool-calling tasks. Beyond that, you are likely introducing noise faster than signal.

NAT negatives require proportion discipline. Target 15-25% of your total dataset. If you are generating them algorithmically and scaling them with your positive pool, you will overcorrect.

Safety improvements and capability improvements can trade against each other in SFT. The vault and irrelevance gains in v5.4 are not free — they came at the cost of bash, store, dispatch, and error_recovery. If you need both, you need two separate optimization objectives: SFT for the capability foundation, DPO for the safety boundary. Trying to encode both signals in a single SFT dataset creates exactly this tradeoff.

Measure per-category, every run. Aggregate tool accuracy hides the distribution of wins and losses. Run your eval suite broken down by category after every checkpoint.

Takeaways

v5.4 trained on 45,853 augmented examples vs v5.3’s 7K. The model got safer — vault breach rate dropped from 20% to 6.7%, a meaningful result for an 8B weight-only deployment — but core tool-calling capability regressed across bash, store, dispatch, and error_recovery. The extra data hurt more than it helped.
NAT contrastive negatives shifted the decision boundary toward caution across all categories, not just the target ones. The model learned “when in doubt, don’t call a tool” as a generalized prior.
45K examples is 4-5x beyond the empirical sweet spot for rank-64 QLoRA on 8B models. Quality filtering and calibrated composition outperform volume.
Multi-tool chains remain at 6.7% — identical across v5.2, v5.3, and v5.4. SFT cannot teach sequence planning. v5.5 will introduce DPO preference pairs to address this.
v5.5 target: 12-15K curated examples at calibrated ratios (55-65% positive, 15-25% NAT, 15-25% irrelevance), DPO pass for multi-tool chains and irrelevance hardening, replay of v5.3 high-quality examples to prevent forgetting.

The full Agent Core standard — training data, eval set, adapter weights, and this empirical record — will be published on HuggingFace after v5.5 reaches production quality. The adapter targets any Ollama deployment running Qwen3-8B.