Agent Core: 9 Tools Is All You Need — SouthernSky Engineering Blog

Why Should You Care?

Every agent framework in production pastes tool schemas into the system prompt. At 5 tools that is manageable. At 23 tools — the number we were running in our previous personal taxonomy — you are spending roughly 5,000 tokens before the model has processed a single character of user input. You pay that cost on every call. Forever.

In our previous post, Schema on the Inside, we described a two-phase LoRA approach that trained a Qwen3-8B model to call 23 tools without any schemas in the prompt. That experiment confirmed schema internalization works. But 23 tools had a design problem: they were ours. Personal filesystem helpers, private database tables, custom operations that made sense in our environment and nobody else’s.

Agent Core v5 fixes that. We collapsed 23 personal tools down to 9 universal primitives, consolidated 7 databases into 3, added a trained-in security boundary called the vault, and retrained everything. Then we ran a head-to-head ablation study. The result: v5 with a minimal 739-character system prompt outperforms both the base model on a full prompt and our own v4 fine-tuned model — 58.2% tool selection accuracy versus 49.8% in the previous version, an 8.4-point improvement on the metric that matters most.

This post is the announcement of that standard and a complete walkthrough of what we built, how we trained it, and what the numbers actually say.

The Problem With More Tools

Before getting into the solution, it helps to understand why fewer tools is not an obvious choice. Intuitively, more tools should mean more capability. The counter-evidence is substantial.

Vercel published research in 2026 showing that removing 80% of the tools available to an agent produced a 100% task success rate — a meaningfully better outcome than the full tool set at comparable model sizes. Soni et al. (AAAI 2026) showed that tool-selection accuracy drops sharply as the number of routing options grows, specifically for models in the 7–9B parameter range. xLAM research converged on 6–10 tools as the practical ceiling for 8B models before routing quality degrades.

The intuition behind this: a model choosing between 9 options has a tractable decision space. A model choosing between 23 options is significantly more likely to pick the wrong one, especially on ambiguous prompts where multiple tools could plausibly apply.

Our v4 results confirmed this from the other side. At 23 tools with a full prompt, we hit 49.8% tool selection accuracy. That is not bad for a model that has internalized 23 distinct schemas — but it leaves a lot of room for improvement. The question was whether a tighter taxonomy could extract that room.

The 9 Universal Tools

The design constraint was strict: every tool must be a universal primitive that any developer could use for any agent, on any codebase, without modification. No personal operations. No environment-specific shortcuts. Every tool maps to a single, composable responsibility — a direct application of the Unix philosophy of doing one thing well.

Tool	Description
`bash`	Shell execution — the universal escape hatch for system management, git, pipelines, CLI tools
`read`	File read with optional offset and line limit
`write`	Create or overwrite a file with complete content
`edit`	Surgical string replacement — old string must be unique in the file
`search`	Unified text grep and file glob in one tool
`query`	SQL and FTS5 reads against system, agent, or work databases
`store`	SQL writes (INSERT, UPDATE, DELETE) against the same databases
`dispatch`	Delegate a task to another named agent
`bridge`	MCP, HTTP, and gRPC gateway for external services

The key consolidation decisions: read, write, edit, and search replace the sprawl of individual file-operation tools in v4. query and store replace separate per-database read/write tools (in v4, each database had its own tool pair). bridge replaces a collection of service-specific connectors — the model learns that all external services follow the same protocol / endpoint / method / params pattern, and the runtime handles authentication.

bash is the honest escape hatch. When none of the other eight cover a need precisely, bash does. The model learns a preference hierarchy: reach for query before search, search before bash. The shell is not a shortcut — it is a last resort.

The 3+1 Database Architecture

v4 had 7 databases, one per concern. That created two problems: more routing targets meant lower routing accuracy, and cross-database JOINs were impossible. We consolidated to 3 databases plus a vault.

Database	Tables	FTS5 Indexes	Purpose
`system`	`files`, `tools`	`files_fts`, `tools_fts`	Environment — filesystem index + tool registry
`agent`	`memories`, `skills`, `agents`	`memories_fts`, `skills_fts`	Identity — knowledge, capabilities, available agents
`work`	`documents`, `tasks`, `projects`	`knowledge_fts`	The work — knowledge archive, task tracking, project state
`vault`	`secrets`, `restricted_paths`	none	Credentials — NEVER QUERIED

The split follows a natural mental model. The system database is about the environment the agent operates in. The agent database is about what the agent knows and can do. The work database is about whatever the agent is working on. A developer adopting this standard swaps in their own schemas for each group while keeping the routing logic stable.

The 3-database consolidation means ~2,666 training examples per database instead of ~1,140 per database in v4. Deeper per-target coverage translates directly to more reliable routing — the model has seen more examples of each database in action before it has to recall them.

The Vault Boundary

The vault is the novel security contribution in v5. Every agent system eventually deals with credentials: API keys, database passwords, SSH key paths. The standard approach puts them in environment variables or a secrets manager and hopes the model does not hallucinate them into tool calls or log them somewhere.

We took a different approach: train the model to never query or write to the vault database. At all. Even when the prompt looks like it is asking for credential-related information.

The model is trained on 15 standard symbolic variable names — VAULT_OPENAI_API_KEY, VAULT_GITHUB_TOKEN, VAULT_DB_PASSWORD, and so on — as first-class primitives. When the model needs a credential in a bash call, it references $VAULT_SUDO_PASSWORD by name. The runtime intercepts that reference and injects the actual value as a temporary environment variable for the subprocess. The value never appears in model context, tool call arguments, or log output.

This creates four enforcement layers:

Weights — The model actively avoids vault access even without runtime guardrails
Runtime DB rejection — The query and store tools return access-denied for any vault-targeted SQL
Path filtering — Restricted paths in the vault table block filesystem traversal
Credential injection — The runtime replaces VAULT_* references with values at the last possible moment, after the model has already committed to the tool call structure

The v5.2 results on vault avoidance are mixed — the minimal prompt condition shows regression to 20% breach rate compared to 6.7% for the base model. We traced this to a data quality issue in the irrelevance category (described later), which damaged the model’s ability to decline tool calls entirely. The weight-level boundary is confirmed working in smoke tests; the regressions are a training data problem, not a design problem. v5.3 addresses this directly.

The Three-Layer Stack

Agent Core is designed to be the foundation layer of a three-layer LoRA composition:

Layer 1 — Core Primitives (this post) The 9 universal tools and 3 database schemas are burned into model weights via QLoRA. Any prompt that arrives at the model can trigger tool calls without schema injection. This layer is horizontal — it belongs in every agent, regardless of domain.

Layer 2 — Pipeline LoRAs Domain-specific workflow sequences trained on top of Core Primitives. Examples: a media ingestion pipeline that knows how to move content from source to database in a consistent sequence, or a deployment pipeline that chains bash calls with status checks and rollback logic. These are vertical — they encode how to use the primitives for a specific workflow class.

Layer 3 — Personality LoRAs Character, voice, and values. A customer service agent, an educational tutor, a code reviewer — each has a personality layer on top that does not know anything about tool schemas directly but inherits the tool-calling ability from Layer 1.

Layers are composed via DARE-TIES merge at build time, or loaded as hot-swap adapters at inference via llama.cpp --lora. The critical point is that Layer 1 must exist before Layers 2 and 3 are trained — the model needs to know how to call tools before learning when to chain them for specific workflows.

Training: Two-Phase Schema Internalization

The training pipeline is the same two-phase approach we described in the previous post, applied with an updated taxonomy.

Data Generation

We generated 8,249 validated training examples across 10 categories using three cloud providers for synthesis diversity: OpenAI (gpt-4.1-mini), Gemini (gemini-2.5-flash), and Grok (grok-4-1-fast). Local Ollama handled coherence repair. The generation pipeline ran over approximately 4–5 hours total wall time.

Category	Target	Generated
`multi_tool_chains`	1,600	1,645
`bash`	1,550	1,614
`query_routing`	1,450	1,514
`file_operations`	800	842
`store_operations`	640	670
`irrelevance`	800	775
`dispatch`	400	416
`bridge`	400	414
`vault_avoidance`	200	191
`error_recovery`	160	168
Total	8,000	8,249

Total data generation cost across all three cloud APIs: approximately $35. 8,249 examples of supervised fine-tuning data for $35 is a meaningful signal about where synthetic training data economics are heading.

Phase 1: Schema Memorization

Phase 1 trains on the full 5,899-character system prompt. The model sees every tool signature, every database schema, every example query — 70% of examples with the complete prompt, 20% with one relevant database schema, 10% with the minimal prompt. The goal is to burn the schemas into the weights through repetition and varied context.

Config: 8192-token context window, 3 epochs, learning rate 2e-4, LoRA rank 64 with RSLoRA, targeting q/k/v/o_proj. Training ran for approximately 5 hours on an RTX 3080 Ti (12GB VRAM), with one CUDA crash at step 985 recovered from checkpoint without data loss.

Phase 1 results:

Checkpoint	Training Loss	Eval Loss
Step 490 (epoch 1)	1.84	0.0920
Step 980 (epoch 2)	0.88	0.0834
Step 1470 (epoch 3)	0.189	0.0923

The eval loss of 0.0923 at the final checkpoint is strong — well into generalization territory, not memorization.

One critical VRAM pattern that is not optional: chunked cross-entropy loss. At 8192-token context with o_proj targeting on a 12GB card, the standard unsloth_fused_ce_loss spikes VRAM beyond what the card can handle. We replace it with a chunked implementation that processes 32 tokens at a time. Remove this and the training script OOMs.

Phase 2: Minimal-Trigger Recall

Phase 2 loads the Phase 1 adapter weights and trains exclusively on the 741-character minimal system prompt. The model receives only tool names, database names, and the vault boundary — no signatures, no schemas, no example queries. It must recall all of that from Phase 1 weights.

Config: 2048-token context, 2 epochs, learning rate 2e-5 (10x lower to preserve learned schemas), fixed warmup of 30 steps. Training ran for 3 hours 19 minutes with zero crashes.

Phase 2 eval loss curve:

Step	Eval Loss
50	0.6621
100	0.5392
200	0.4591
400	0.4133
600	0.3937
Final (step 980)	0.3831

The smooth descent with no uptick at the end is exactly what we want to see. The final eval loss of 0.383 sits in the generalization sweet spot we target (0.4–0.5). Below 0.2 would indicate memorization of the training set rather than recall from Phase 1 weights.

The total training pipeline — data generation, Phase 1, Phase 2, GGUF conversion — ran in approximately 15 hours on a single consumer GPU.

The Ablation: What the Numbers Say

We ran a three-condition ablation on 225 held-out prompts across all 10 categories:

Condition	Model	System Prompt	Prompt Size
A	Base Qwen3-8B	Full (5,899-char document, 5,883 chars measured)	5,883 chars
B	v5.2 LoRA	Full (5,899-char document, 5,883 chars measured)	5,883 chars
C	v5.2 LoRA	Minimal (741 chars)	739 chars

The result we were testing for: condition C within 5 percentage points of condition B on tool selection accuracy. If C stays within 5pp of B, schema internalization is confirmed — the model is recalling schemas from weights, not pattern-matching from the prompt.

Full ablation results:

Metric	A (Base+Full)	B (LoRA+Full)	C (LoRA+Minimal)	A→B	B→C
Tool Selection	45.3%	52.9%	58.2%	+7.6pp	+5.3pp
Schema Compliance	99.1%	100.0%	100.0%	+0.9pp	0pp
DB Routing	63.8%	87.9%	76.1%	+24.1pp	-11.8pp
SQL Validity	82.4%	95.4%	90.8%	+13.0pp	-4.6pp
Vault Breach Rate	6.7%	20.0%	20.0%	—	—
Bash Safety	94.7%	97.3%	98.5%	+2.6pp	+1.2pp
Avg Latency	15.75s	2.31s	1.47s	6.8x↓	1.6x↓
Prompt Tokens	5,883ch	5,883ch	739ch	—	-87.4%

The headline finding: C beats both A and B on tool selection accuracy. Condition B already demonstrates that the LoRA adapter substantially improves tool selection (+7.6pp over the base model with the same full prompt). Condition C then adds another 5.3pp by removing the full schema from the prompt entirely. The model performs better when the prompt does not contain the schemas it has already internalized.

The explanation is not mysterious. With the full prompt in context, the model attends to both the prompt schemas and its internalized weight-level knowledge simultaneously. Noise from two schema sources degrades the signal. With the minimal prompt, the model has only its weights to work from — and the weights, after two-phase training, are the cleaner source of truth.

The per-category breakdown makes this more concrete:

Category	A (Base)	B (LoRA+Full)	C (LoRA+Minimal)
`query_routing`	75.6%	100.0%	100.0%
`bash`	53.3%	83.3%	90.0%
`bridge`	26.7%	73.3%	73.3%
`file_operations`	28.0%	36.0%	72.0%
`store_operations`	45.0%	55.0%	85.0%
`dispatch`	6.7%	6.7%	33.3%
`error_recovery`	20.0%	60.0%	40.0%
`irrelevance`	76.0%	12.0%	8.0%
`multi_tool_chains`	6.7%	6.7%	6.7%

Query routing hits 100% in both LoRA conditions. Bash improves from 83% to 90% in the minimal condition — the model learned bash so well that the full prompt context is actively interfering. File operations and store operations show dramatic improvements: 36% → 72% and 55% → 85%. These are categories where the model clearly internalized the schemas and performs better recalling from weights than reading from prompt.

The losers: DB routing drops 11.8pp in the minimal condition. This is the expected tradeoff — without exact table schemas in the prompt, the model relies on weight-level pattern matching for routing decisions, which is less precise than having the schema directly visible. At 76% from weights alone, though, the model is still functionally useful.

The two broken categories — irrelevance and multi_tool_chains — are data quality failures, not architecture failures.

What Broke and Why

Publishing the failures alongside the wins is deliberate. Two of the ten categories completely failed to learn, and the root causes are traceable, fixable, and worth documenting so anyone running a similar pipeline can avoid the same mistakes.

Irrelevance Category: Corrupted (96% of examples have wrong labels)

The irrelevance category is supposed to teach the model that some questions do not need tool calls. “What is the difference between horizontal and vertical scaling?” — no tool needed, just answer.

A bug in the plan generator gave every seed task with an empty expected_tools field a default bash step. Since all 30 irrelevance seeds had empty tool lists (correctly — that is the whole point of the category), the generation prompt produced tool calls for 743 of the 775 irrelevance examples. We trained the model to call bash when asked a conceptual question. The category collapsed from 76% accuracy (base model, no fine-tuning) to 12% (v5.2 LoRA).

This also explains the vault breach regression. The model lost its ability to say “no tool needed” entirely, which meant it was generating tool calls in vault-adjacent contexts where the correct answer was to decline.

The fix is already in the generation pipeline. v5.3 will regenerate 800 clean irrelevance examples with the corrected guard and merge them with the v5.2 training data.

Multi-Tool Chains: Not Learned (6.7% in all conditions)

Multi-tool chains scored 6.7% in all three conditions including the base model — which means the LoRA adapter did not move the needle at all. The training taxonomy planned 1,600 multi-tool chain examples with specific subcategories: discover_and_dispatch, tool_discovery_bridge, error_recovery_chains, complex_workflows. But the pipeline had no mechanism to enforce ChainWeave-specific generation — it produced multi-turn conversations, but not examples with explicit 3–5 tool sequences following a branching pattern.

The model never saw what multi-tool chaining is supposed to look like. v5.3 will generate approximately 600 targeted chain examples with explicit sequencing patterns.

Modelfile and Tool Call Format

The complete Modelfile for an Agent Core v5.2 spirit:

FROM ./agent-core-v5.2-qwen3-8b-q4_k_m.gguf
TEMPLATE {{ .Prompt }}
RENDERER qwen3.5
PARSER qwen3.5
PARAMETER temperature 0.3
PARAMETER num_ctx 8192

The RENDERER and PARSER directives are the key detail. Ollama 0.21+ has built-in renderers for the Qwen family that handle tool call formatting and parsing natively. Do not write a manual Go template for this — the built-in qwen3.5 renderer matches the <tool_call> format the model was trained on.

A standard tool call from the model looks like this:

{"name": "bash", "arguments": {"command": "ls /tmp"}}

Query routing to a specific database:

{
  "name": "query",
  "arguments": {
    "database": "system",
    "sql": "SELECT path, size_bytes FROM files WHERE path LIKE '%.py' ORDER BY modified_at DESC LIMIT 10"
  }
}

A credential reference using the vault pattern:

{
  "name": "bash",
  "arguments": {
    "command": "curl -H 'Authorization: Bearer $VAULT_GITHUB_TOKEN' https://api.github.com/user/repos"
  }
}

The runtime replaces $VAULT_GITHUB_TOKEN with the actual token before subprocess execution. The model never constructs the value — it only names it.

What the Minimal Prompt Looks Like

For reference, this is the entire system prompt that drives condition C — the one that produced 58.2% tool selection accuracy:

You are an Agent Core v5.2 assistant. You solve problems by composing
9 tool primitives and querying 3 databases. Think in SQL first.

Tools: bash, read, write, edit, search, query, store, dispatch, bridge
Databases: system, agent, work
Vault: credentials are injected by the runtime — never query or store
to the vault.

Recall database schemas from training. Route queries to the correct
database and table. Use FTS5 MATCH syntax for full-text search.

Prefer query/store over filesystem scanning. Prefer specialized tools
over bash. Reference VAULT_* variables by name for credentials.

<tool_call>
{"name": "tool_name", "arguments": {"key": "value"}}
</tool_call>

Tool results arrive in:
<tool_response>
...result...
</tool_response>

739 characters. No schema definitions. No example queries. No parameter tables. The model supplies all of that from weights. The 87.4% reduction in prompt tokens translates to measurably faster inference — 1.47 seconds average latency versus 15.75 seconds for the base model on the full prompt. That is a 10.7x speedup driven almost entirely by context reduction.

Adopting the Standard

The standard is designed to be adoptable without modifications to the core 9 tools or the 3+1 database architecture. When you adapt it for your own agent, four decisions drive the customization:

Replace the system database content with your actual filesystem index and available tools. The schema stays the same; the rows change.

Replace the agent database content with your agent’s memories, skills, and available specialists. A teaching agent, a devops agent, and a research agent will have completely different rows — but the same table structure.

Replace the work database content with your domain’s documents, tasks, and projects. The knowledge_fts FTS5 index works the same way whether it is indexing medical literature or software documentation.

Add domain tools via bridge rather than extending the core 9. The bridge tool is the extension point. If your agent needs to call a Slack API, a Stripe API, or a custom internal service, the runtime registers the endpoint and handles authentication — the model just needs to know the endpoint name from the system.tools table.

When you retrain on this taxonomy, the two-phase approach applies unchanged. Phase 1 with your domain-specific database schemas in the full system prompt, Phase 2 with only the minimal prompt. The minimal prompt does not change between domains — the model learns your schemas in Phase 1 and recalls them from weights in Phase 2.

What’s Next

v5.3 is a data supplement, not a redesign. We will regenerate the broken irrelevance category (800 examples, fixed pipeline), add targeted ChainWeave examples for dispatch and multi_tool_chains, and reinforce vault avoidance. Expected new total: approximately 10,250 examples. Full Phase 1 + Phase 2 retrain.

Expected impact from remediation: irrelevance above 70%, dispatch above 40%, multi_tool_chains above 30%, vault breach rate below 5%.

After v5.3, the plan is a whitepaper documenting the standard in full — training taxonomy, evaluation methodology, benchmark comparisons — and a HuggingFace release of the adapter, evaluation set, and training data generation pipeline.

If you are training tool-calling LoRAs on consumer hardware, Agent Core v5 gives you a proven starting point: 9 universal tools, 3 routable databases, a credential boundary, and a two-phase training recipe that runs on 12GB VRAM for $35 in data generation costs.

What You Learned

Fewer tools genuinely outperform more tools for 8B models — the evidence is in the research literature and confirmed in our own ablation (9 tools at 58.2% vs 23 tools at 49.8%)
The minimal prompt beats the full prompt when schema internalization succeeds — the full context is noise that dilutes what the model already knows in its weights
Two-phase training (full schemas in Phase 1, minimal trigger in Phase 2) is the mechanism that makes recall work: the model burns schemas into weights, then proves it can retrieve them without the prompt as a crutch
The vault pattern provides a trained-in credential boundary: the model learns VAULT_* symbolic names as first-class primitives and the runtime handles late-binding injection, keeping actual secrets out of model context entirely
Data quality matters more than quantity: the corrupted irrelevance category caused a measurable regression across multiple metrics — 800 bad examples can actively harm a model trained on 8,249 total

The architecture is stable. The two failures we documented here have known fixes already in the pipeline. Agent Core v5 is ready to serve as a foundation layer — and v5.3 will make it stronger.