Schema on the Inside: Training an 8B Model to Recall Tool Schemas From Memory
Applied Intermediate 18 min read

Schema on the Inside: Training an 8B Model to Recall Tool Schemas From Memory

How we trained Qwen3-8B to call 23 tools without any schemas in the prompt — and beat BFCL benchmarks in the process.

By J. Martin · · ai ml
Table of Contents

Why Should You Care?

Every tutorial on AI agents starts the same way: construct a system prompt, paste in your tool descriptions, watch the token counter climb.

At 23 tools, that system prompt is 20,000 bytes before the user has typed a single character. That is around 5,000 tokens of context consumed before any reasoning begins. You pay that cost on every single call. Every call. Forever.

There is a different question worth asking: what if the model already knew the tools?

Not “had tools described to it” — but had internalized the schemas the same way a senior developer has internalized the stdlib. No lookup. No retrieval. Just recall.

By the time you finish reading this post, you will have seen exactly how a two-phase LoRA approach can embed that kind of schema memory into a compact model — and you will have the configuration details to run it yourself. This post walks through the experiment, the results (which surprised me), and what they imply about where tool-calling agents are headed.


Three Paradigms for Tool-Aware Agents

Before getting into the training, it helps to name the three approaches to giving an AI agent access to tools — because the tradeoffs clarify exactly what we were trying to solve.

Context-window agents (Claude Code, Cursor, most production agents) paste the full tool schemas into every prompt. The model sees description: "Search the filesystem for files matching a pattern..." before it reasons about your request. The schemas are always visible, always consuming tokens, always a dependency. This is the standard approach because it works reliably without any training overhead.

RAG agents (many LangChain pipelines) store tool descriptions in a vector database and retrieve the top-k relevant ones before each call. Instead of 23 schemas, you inject 3. This reduces the context overhead, but it adds a retrieval step and introduces a new failure mode: if the retrieval misses a relevant tool, the model cannot call it.

Schema-internalized agents are what we built. The model has no schemas in the prompt. No retrieval step. When it decides to use db_search, it constructs the call from memory — the correct argument names, types, and usage patterns live in the model’s weights, learned during fine-tuning. The system prompt is 643 bytes and tells the model what role it is playing, not what tools exist.

The question this experiment answers: can an 8B model actually do this reliably enough to be useful?


The Tool Taxonomy

Before writing a single training example, we defined a stable 23-tool taxonomy. Taxonomy stability matters more than most tutorials acknowledge — if tool signatures change between training and inference, the model’s memory becomes wrong, and wrong memory is worse than no memory.

The 23 tools across six categories:

CategoryCountExamples
Core file ops6read_file, write_file, list_directory, move_file, delete_file, file_exists
DB and search7db_search, db_query, memory_lookup, calendar_query, research_search, media_search, knowledge_retrieve
Pipeline dispatch3spirit_dispatch, run_pipeline, queue_task
External APIs2github_api, web_fetch
System1bash_exec
Meta4clarify, plan, summarize, error_report

The most important design decision here was the split between db_search and bash_exec. The model should reach for a database query first — not grep, not find, not a Python loop. Our agent framework maintains seven SQLite FTS5 indexes per host: file metadata, research archives, memory files, calendar entries, media library, knowledge base, and project state. A full-text search against any of them completes in milliseconds and returns structured results. The model does not need to know that grep exists for most search tasks.

This “thinks in databases” architecture is what separates our approach from most agent frameworks. The retrieval decision lives in the model’s weights, not in an external orchestrator that decides “now we search” vs. “now we read files.”


Generating 3,213 Training Examples: ToolWeave

You cannot fine-tune a model to internalize schemas you never showed it. The training data is where the schema memory gets written.

We built a four-stage pipeline called ToolWeave to generate synthetic training pairs at scale. The full pipeline is available in the SouthernSky Code repository — here is how each stage works:

Stage 1: Plan Generation
  → Given a tool taxonomy, generate diverse task scenarios
  → Each scenario maps to one or more tools

Stage 2: Dialogue Synthesis
  → For each scenario, generate a full conversation:
      user message → model reasoning → tool call → tool result → final response
  → Three providers generate independently for diversity

Stage 3: Validate and Normalize
  → JSON schema validation against the canonical tool signatures
  → Strip malformed calls, hallucinated argument names, wrong types
  → Normalize to consistent chat template format

Stage 4: Dedup and Filter
  → Remove near-duplicate examples (MinHash, threshold 0.85)
  → Ensure coverage across all 23 tools
  → Balance the dataset to prevent over-representation of simple tools

Three providers generated examples independently: Gemini 2.5 Flash, Grok 4 Fast, and GPT-4.1 Mini. This was intentional. A single provider tends to produce examples with consistent structural patterns — essentially the same reasoning path rephrased. Cross-provider generation means the model sees the same tool called in three stylistically distinct ways, which significantly improves generalization.

Final corpus: 3,213 validated examples across all 23 tools. Each example is a complete conversation including the tool result and the final user-facing response — the model learns not just to call tools, but what to do with the output.


Phase 1: Teaching the Schema

The first training phase uses the full tool context. Every example includes all 23 schema definitions in the system prompt — the complete JSON with argument names, types, descriptions, and examples. The prompts are approximately 20,000 bytes each.

The goal of Phase 1 is schema memorization. The model sees the schema hundreds of times across thousands of examples. By the end of Phase 1, it should be able to reconstruct a schema from memory because it has internalized the pattern deeply.

Training configuration:

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen3-8B-bnb-4bit",
    max_seq_length=4096,
    load_in_4bit=True,
    dtype=torch.bfloat16,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=64,
    target_modules=["q_proj", "k_proj", "v_proj"],
    lora_alpha=128,
    lora_dropout=0.05,
    bias="none",
    use_gradient_checkpointing="unsloth",
)

The LoRA rank of 64 on q/k/v projections only was a deliberate choice. Higher rank increases capacity and training time. Targeting only the attention projections (not MLP layers) keeps the adapter focused on the pattern-recognition and recall tasks where tool schema memory lives, rather than modifying the model’s general knowledge.

One critical lesson from Phase 1: SFTTrainer silently disables gradient checkpointing by default. The flag exists, but the default behavior ignores it unless you pass it explicitly through the training config:

from trl import SFTConfig

training_args = SFTConfig(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    warmup_ratio=0.05,
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=False,
    bf16=True,
    logging_steps=10,
    output_dir="./toolweave-phase1",
    # Required — SFTTrainer silently ignores this if not in SFTConfig
    gradient_checkpointing=True,
)

Without this flag, training a 4-bit Qwen3-8B on a consumer GPU with 12GB VRAM will run out of memory partway through the first epoch. It took a frustrating OOM crash and a dive into the TRL source to find this.

NEFTune noise was enabled at alpha=3:

from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=phase1_dataset,
    args=training_args,
    neftune_noise_alpha=3,
)

NEFTune adds random uniform noise to token embeddings during the forward pass. The intuition: without noise, the model can memorize training examples exactly, and exact memorization generalizes poorly. With noise, it has to learn the underlying pattern rather than the surface form. At alpha=3, the model is pushed to generalize without the noise level being high enough to destabilize training.

Phase 1 training took approximately one hour on a single GPU.

The warning sign: After Phase 1, validation loss was approaching zero. That is not a good sign. A val_loss near zero on a held-out split means the model has memorized the training examples, not generalized from them. The Phase 1 examples are all structured identically — same system prompt, same schema block, same format — so there is not enough surface variation to prevent memorization.

Phase 1 is a necessary step, not the destination. The schema is now in the weights. Phase 2 is where we find out if it can be recalled without the prompt.


Phase 2: Recalling Without the Scaffold

Phase 2 uses the same 3,213 examples with a completely different system prompt structure. The 20KB schema block is gone. In its place is 643 bytes:

You are an AI assistant with access to a set of tools. Use them when needed.
Reason briefly (1-3 sentences) before calling a tool. Respond concisely.

That is the entire system prompt. No tool names. No argument descriptions. No JSON schema. If the model calls a tool in Phase 2, it is calling from memory.

Training setup for Phase 2 is identical to Phase 1, but with a lower learning rate and fewer epochs to avoid overwriting the Phase 1 learning:

training_args_p2 = SFTConfig(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    warmup_ratio=0.03,
    num_train_epochs=2,
    learning_rate=8e-5,        # Lower than Phase 1
    bf16=True,
    output_dir="./toolweave-phase2",
    gradient_checkpointing=True,
)

The lower learning rate matters here. Phase 2 is not trying to write new information — it is reinforcing a retrieval pattern on top of information that is already in the weights. Too aggressive a learning rate risks overwriting what Phase 1 established.

Qwen3 models have a structural wrinkle that Phase 2 directly addresses: thinking tokens. Qwen3-8B generates chain-of-thought reasoning enclosed in <think>...</think> tags before producing an output. In Phase 1, these thinking sections were sometimes hundreds of tokens long — the model was using the schema block as a reference and reasoning through it extensively. In Phase 2, without the schema to consult, the thinking sections shorten dramatically. The model has to produce tool calls from recall, not from visible reference. Phase 2 training teaches the model that short, confident reasoning is the correct pattern for tool dispatch.

Phase 2 training: approximately one hour. Total wall-clock cost for the entire experiment: roughly two hours on a single consumer GPU. That is the right frame for the scope of what we were testing — not a cluster job, not a multi-day run.


Evaluation: 40-Prompt Benchmark

We evaluated the Phase 2 model against 40 held-out prompts covering all tool categories. Each prompt was assessed for:

  1. Tool selection accuracy — did the model call the right tool?
  2. Schema compliance — was the generated JSON valid and correctly structured?
  3. Bash safety — did any bash_exec calls include dangerous patterns?
  4. Response token count — how verbose was the final answer?

Results:

MetricPhase 1 (full schemas)Phase 2 (no schemas)
Tool selection accuracy25.0%47.5%
Schema compliance89.0%97.5%
Average tokens per response800+68.7
Bash safety100%100%

The headline number: 47.5% tool selection accuracy with zero schemas in the prompt.

For context, Qwen3-8B in its base function-calling configuration (FC mode with full schemas injected) scores 42.6% on Berkeley Function Calling Leaderboard (BFCL). Our model scores 47.5% without schemas. It is outperforming the schema-prompted baseline while consuming a fraction of the context.

The schema compliance number — 97.5% — is arguably more important than the selection accuracy. When the model decides to call a tool, it almost always constructs the call correctly. Argument names are right. Types are right. Required vs. optional fields are handled correctly. The schema is genuinely in the weights.

The token count drop from 800+ to 68.7 is a practical win. In Phase 1, the model would generate long preambles explaining the tool it was about to use — because the schema was visible and it was referencing it. In Phase 2, the response is direct: a brief reasoning sentence, the tool call, the result, a concise answer.


Current Limits: The Compositional Gap

The experiment was honest about its limits. Multi-tool chaining accuracy: 0%. Pipeline dispatch accuracy: 0%. Brainstorming task accuracy: 0%.

These are not failures of the schema internalization approach — they are failures that every current model shares.

WildToolBench (ICLR 2026) evaluated 57 language models on compositional tool use: tasks requiring efficient orchestration of multiple tool calls across dialogue turns. The finding was stark: no model exceeds 15% accuracy. The gap is not between schema-in-prompt and schema-in-weights. The gap is between single tool dispatch and chained reasoning.

Understanding why helps clarify what Phase 2 actually trained. The model learned a pattern: “user request → identify single tool → construct call → incorporate result → respond.” That pattern is deeply embedded after 3,213 examples. But the training data did not include examples of multi-step tool chains where the output of one call gates the input of the next. Absence of examples is absence of behavior.

This is a data problem, not a model capacity problem. An 8B model is large enough to reason about chained tool calls. It has not been shown enough examples of what that looks like. Phase 3 — when we build it — will address this directly with targeted multi-step examples.

The brainstorming failure is a different kind of gap. Brainstorming tasks do not need tools at all. The model was reaching for clarify or plan tools instead of responding directly. This is over-trained tool behavior — Phase 2 was so effective at routing toward tool calls that the model started routing tasks that do not benefit from them. A small set of negative examples (“here is a task that does not require a tool, and here is how to respond to it directly”) should correct this in the next training pass.


What “Thinks in Databases” Actually Means at Inference Time

Here is what the model’s behavior looks like when it works correctly. A prompt:

User: Find all research files we've collected about BFCL benchmarks

Phase 1 model (full schemas visible):

<think>
The user wants to search for research files about BFCL benchmarks.
Looking at the available tools, I should use research_search which takes
a query parameter. The schema shows: research_search(query: string,
limit?: number, project?: string, after?: string). I'll call it with
the relevant search terms...
</think>

I'll search the research archive for BFCL benchmark files.

{"tool": "research_search", "query": "BFCL benchmarks", "limit": 10}

Phase 2 model (no schemas):

<think>
Research archive search for BFCL benchmark files.
</think>

{"tool": "research_search", "query": "BFCL benchmarks", "limit": 10}

Same tool. Same correct arguments. One-tenth the token count. The Phase 2 model skips the schema consultation and goes directly to the call, because there is no schema to consult.

The database-first instinct shows up in how the model chooses between tools. Given “find files about topic X on this machine,” a general assistant would reach for bash_exec with a find command. The trained model reaches for db_search — because that is what the training data modeled, and because the taxonomy was designed to make that the correct answer for search tasks. The retrieval architecture preference is baked into the tool selection behavior.

This is the key distinction from RAG. In a RAG agent, an external orchestrator decides “this task needs the search tool” before passing anything to the model. In our setup, the model itself makes that decision based on internalized patterns. The retrieval instinct lives in the weights.


The Training Stack

Everything below is what it actually takes to reproduce this experiment on a single consumer GPU. No proprietary infrastructure.

GPU:          Single consumer GPU, 12GB VRAM
Quantization: 4-bit (bnb NF4 via bitsandbytes)
LoRA:         r=64, alpha=128, target=[q_proj, k_proj, v_proj]
LoRA dropout: 0.05
Base model:   Qwen3-8B (not a quantized checkpoint — load in 4bit at runtime)
Trainer:      TRL SFTTrainer (v0.17+)
NEFTune:      alpha=3
Optimizer:    AdamW (8-bit via bitsandbytes)
LR schedule:  Cosine with linear warmup
Phase 1 LR:   2e-4 over 3 epochs
Phase 2 LR:   8e-5 over 2 epochs
Context:      4096 tokens max

The choice to use Qwen3-8B as the base for all LoRA training — rather than distilled or fine-tuned variants — was deliberate. You want the most capable base that fits your VRAM budget. Distilled variants have already had their capacity compressed; fine-tuned variants have their weight space partially occupied by their target task. A clean base gives LoRA adapters the most room to work.

The 4-bit quantization does not significantly degrade the results at 8B parameters. The information density of the base model’s weights is more than sufficient to hold 23 tool schemas after the LoRA training has reinforced them. At 3B parameters, this claim would be more questionable.


Key Takeaways

  • Schema internalization is a real phenomenon: a fine-tuned 8B model can call 23 tools with correct argument structure from a 643-byte system prompt, with no schemas present
  • Two-phase training works — Phase 1 writes schema memory, Phase 2 teaches recall under minimal context
  • gradient_checkpointing=True must be passed explicitly to SFTConfig; SFTTrainer’s default silently ignores it
  • Qwen3’s thinking tokens behave constructively when trained correctly: Phase 2 teaches the model that brief reasoning is the correct pattern for tool dispatch
  • Schema compliance (97.5%) is more practically important than tool selection accuracy — wrong selection is recoverable; malformed JSON is not
  • Compositional tool chaining remains unsolved across the field (WildToolBench: no model above 15%); this is a training data problem, not a capacity problem
  • The “thinks in databases” architecture — where the model reaches for a fast FTS5 index before grep or find — is a behavioral preference that can be trained into the weight space itself

The adapter weights and training data generation scripts are available in the SouthernSky Code repository. If you want to run the eval suite against your own fine-tuned model, the 40-prompt benchmark lives at toolweave/eval/benchmark_v1.json — each prompt includes the expected tool name and a list of valid argument structures for scoring.

Phase 3 is in progress: multi-step chain examples and negative training for tasks that should not trigger tool calls. The headline goal is matching the Phase 2 schema compliance numbers (97.5%) on chained calls. We have not solved it yet — but the foundation is in place, and the path forward is clear.