Building an AI Dungeon Master That Lives Inside Minecraft

Why Should You Care?

Most LLM integrations bolt a chatbot onto the side of a game. The user types a question, gets a text response, copies a command, pastes it somewhere. That is a chatbot with a Minecraft skin, and it is not an interesting engineering problem.

Player2 is a different problem: an entity that inhabits the game, reads its world state every second, decides what tools to invoke, and executes server-level commands with the same permissions as an operator — all in response to natural conversation in the game chat. No UI. No copy-paste. The world changes because a language model decided it should.

The interesting engineering is not “plug Grok into a chat box.” It is the full stack required to make that sentence true: extracting game state from a live JVM, bridging it to a Python daemon over WebSocket, wiring a tool-calling LLM loop across 28 callable functions, building hybrid RAG over 87K game-knowledge QA pairs, and persisting conversation memory across sessions. That is what this post covers.

The Architecture

Two processes. One bridge.

┌─────────────────────────────────────────────────────┐
│  Minecraft 1.21.1 (Fabric + Carpet)                 │
│  ┌──────────────────────┐  ┌────────────────────┐   │
│  │  Player2 Mod (Java)  │  │  Carpet Fake Player │   │
│  │  - GameStateExtractor│  │  "Player2" entity   │   │
│  │  - DaemonConnection  │──│  - ActionPack ctrl  │   │
│  │  - /p2 commands      │  │  - Full player body │   │
│  │  - FollowBehavior    │  └────────────────────┘   │
│  └──────────┬───────────┘                            │
│             │ WebSocket (localhost:7677)              │
└─────────────┼────────────────────────────────────────┘
              │
┌─────────────┴────────────────────────────────────────┐
│  Player2 Daemon (Python 3.12)                        │
│  ┌─────────────┐  ┌──────────────────────────────┐   │
│  │  WS Server  │  │  LLM Client                  │   │
│  │  - Routes   │  │  - Grok API (primary)        │   │
│  │    messages │  │  - Ollama fallback            │   │
│  │  - Sends    │  │  - Tool-calling loop (×3 max)│   │
│  │    actions  │  └──────────────────────────────┘   │
│  └─────────────┘  ┌──────────────────────────────┐   │
│  ┌─────────────┐  │  Tools (28 functions)        │   │
│  │  Memory     │  │  - run_command, chat         │   │
│  │  - SQLite   │  │  - lookup_*, wiki_lookup     │   │
│  │  - Sessions │  │  - we_fill/sphere/cylinder   │   │
│  │  - Quests   │  │  - paste_schematic           │   │
│  │  - Profiles │  │  - navigate_to, give_kit     │   │
│  └─────────────┘  └──────────────────────────────┘   │
└──────────────────────────────────────────────────────┘

The Carpet mod provides the fake player entity — a fully simulated player that the server treats as a real client. The Player2 Fabric mod rides on top: it extracts game state, routes chat messages, and executes actions the daemon sends back. The daemon is where all the intelligence lives.

Game State Extraction

Every 20 ticks (once per second), GameStateExtractor.java serializes the world’s current reality into a JSON snapshot and sends it to the daemon over the WebSocket connection:

// GameStateExtractor.java — 20-tick polling
public String extractIfReady(ServerPlayerEntity humanPlayer) {
    tickCounter++;
    if (tickCounter < TICK_INTERVAL) return null;
    tickCounter = 0;
    return extract(humanPlayer).toString();
}

The snapshot captures position, health, hunger, XP level, dimension, current biome (queried from the server’s registry, not hardcoded), time of day, weather state, hotbar contents, and all entities within a 32-block radius:

// Nearby entity scan — 15 entities max, hostile health included
Box scanBox = player.getBoundingBox().expand(ENTITY_SCAN_RADIUS);
List<Entity> nearby = world.getOtherEntities(player, scanBox);
for (Entity entity : nearby) {
    if (count >= 15) break;
    // skip the bot entity itself
    if (entity instanceof ServerPlayerEntity spe
        && spe.getName().getString().equals(PlayerTwoEntity.BOT_NAME)) continue;
    // ...serialize type, coords, distance, health
}

On the Python side, state.py deserializes this into typed dataclasses and provides a summary() method that collapses it to readable text for the LLM context window:

Player steve at (142.0, 64.0, -88.0)
Health: 8.0/20.0, Hunger: 11/20
Biome: minecraft:dark_forest, Dimension: minecraft:overworld
Time: 14230 (day)
Hotbar: diamond_sword x1, cooked_beef x32, torch x64
Nearby: zombie(4.2m), zombie(7.1m), spider(11.3m)

That summary drops into the system prompt on every message. The dungeon master always knows exactly where you are and what is trying to kill you — which means its responses are grounded in the current moment of the game, not a static description of Minecraft in general.

The Tool-Calling Loop

With game state flowing in, the daemon’s LLMClient sends all 28 tool definitions with every request and lets the model decide what to call. The implementation uses a three-pass loop to handle cases where the model needs to look something up before it can act:

# llm_client.py — three-pass tool loop
async def think(self, game_state_summary, player_message, recent_chat, ...):
    # ...build conversation, append user message...

    all_actions = []
    for _ in range(3):
        response = await self._call_grok()  # or _call_ollama()
        actions, needs_followup = self._process_tool_calls(response, ...)
        all_actions.extend(actions)

        if not needs_followup:
            break
    
    return all_actions

The needs_followup flag is the key design insight. Knowledge tools — lookup_item, lookup_recipe, wiki_lookup, and others — don’t produce game actions. They return text to inject back into the conversation. When the LLM calls one of them, _process_tool_calls appends the result and explicitly nudges the model to act on what it just learned:

if has_knowledge_call:
    self.conversation.append({
        "role": "assistant",
        "content": message.get("content"),
        "tool_calls": tool_calls,
    })
    self.conversation.extend(tool_results)
    self.conversation.append({
        "role": "user",
        "content": "Now use the information above to fulfill the player's request. "
                   "Remember: you MUST call run_command to actually affect the game world "
                   "— a chat message alone does nothing.",
    })
    return actions, True  # needs_followup=True, loop continues

That nudge was non-negotiable. Without it, the model consistently completed its knowledge lookup and responded conversationally — but never called run_command. The game state would not change. Players would get an encyclopedic response about diamond sword enchantments while nothing happened in the world. Adding the explicit reminder that text alone does nothing dramatically improved action follow-through.

The underlying design principle is the hard line between KNOWLEDGE_TOOLS (read-only, return text, re-enter the LLM) and action tools (run_command, we_fill, paste_schematic, which produce WebSocket messages that go directly to the game). Once that line is drawn, the loop safety model is clear: knowledge lookups can chain as many times as needed; action calls execute and exit. Nothing writes to the game world until the model has all the information it needs to do so correctly.

Knowledge Infrastructure: Two Layers

Layer 1 — PrismarineJS Structured Data

The first knowledge layer is a set of SQLite tables built from the PrismarineJS minecraft-data package — the same authoritative game data used by most Minecraft bots. It covers every item (ID, display name, stack size), crafting recipe, enchantment, entity type, and biome. Lookups use fuzzy string matching so “diamond pick” finds diamond_pickaxe without requiring exact IDs.

This layer handles the precision work: before constructing a /give command with enchantments, the dungeon master calls lookup_item and lookup_enchantment to verify the exact IDs and max levels rather than hallucinating them from training data. Minecraft 1.21 uses a component syntax for enchanted items that differs substantially from older NBT format — getting this wrong silently produces non-functional commands.

Layer 2 — Wiki RAG (87K QA Pairs)

The second layer handles the “how” and “why” questions: mob spawning rules, redstone logic, farming mechanics, dimension strategy. It is built on the lparkourer10/minecraft-wiki dataset from HuggingFace — 87K question-answer pairs extracted from the official Minecraft Wiki.

The retrieval is hybrid. Two independent searches run in parallel — SQLite FTS5 with BM25 scoring (keyword precision) and Qdrant vector search with nomic-embed-text embeddings (semantic recall) — then merge via Reciprocal Rank Fusion:

# wiki_rag.py — hybrid RRF merge
def search(query, limit=5, category=None):
    k = 60  # RRF constant

    fts_results  = search_fts(query, limit=limit * 2, category=category)
    vec_results  = search_qdrant(query, limit=limit * 2, category=category)

    scores = {}
    entries = {}

    for rank, r in enumerate(fts_results):
        key = r['question'][:80]
        scores[key] = scores.get(key, 0) + 1.0 / (k + rank + 1)
        entries[key] = r

    for rank, r in enumerate(vec_results):
        key = r['question'][:80]
        scores[key] = scores.get(key, 0) + 1.0 / (k + rank + 1)
        if key not in entries:
            entries[key] = r

    return [entries[k] for k in sorted(scores, key=lambda x: scores[x], reverse=True)[:limit]]

RRF handles the common case where FTS5 and Qdrant disagree. A result that ranks highly in both gets a substantially higher combined score than one that only appears in one system. The k=60 constant penalizes low-ranked results without completely discarding them.

Results are filtered by category when the query provides one (redstone, mobs, biomes, potions, enchanting, tools, armor, farming, trading, dimensions, tutorials, general). Category inference runs at index time from URL patterns — /w/Redstone_circuit goes to redstone, /w/Zombie goes to mobs.

Physical Behaviors: Follow and Protect

The FollowBehavior component ticks every 5 server ticks (4x/second) and runs two independent behaviors. Follow uses a three-zone distance model:

// FollowBehavior.java
if (distance > TELEPORT_DISTANCE) {          // > 12 blocks
    // teleport behind the player
} else if (distance > COMFORTABLE_DISTANCE) { // 4-12 blocks
    // Carpet: look at player, move forward, sprint
} else {
    // Carpet: stop
}

Walk mode issues three sequential Carpet ActionPack commands: look at the target position, move forward, sprint. Teleport mode calculates a target position behind and to the side of the player’s look vector so the entity appears to arrive naturally rather than snapping to the player’s feet.

Protect scans an 8-block radius around the human player for HostileEntity instances, finds the closest, and engages — but only after a line-of-sight raycast:

private boolean hasLineOfSight(ServerWorld world, ServerPlayerEntity bot, Entity target) {
    Vec3d eyePos   = new Vec3d(bot.getX(), bot.getEyeY(), bot.getZ());
    Vec3d targetPos = new Vec3d(target.getX(), target.getEyeY(), target.getZ());
    RaycastContext ctx = new RaycastContext(
        eyePos, targetPos,
        RaycastContext.ShapeType.COLLIDER,
        RaycastContext.FluidHandling.NONE,
        bot
    );
    HitResult result = world.raycast(ctx);
    return result.getType() == HitResult.Type.MISS;
}

The line-of-sight check matters more than it looks. Without it, the entity would issue attack commands against mobs on the other side of a wall — commands that connect with nothing, waste ticks, and look obviously wrong to any player watching. Spatial honesty is the same principle the tool-calling loop enforces at the LLM layer: only act on what you can actually reach.

World Manipulation: WorldEdit Bridge

The WorldEdit integration exposes six operations as daemon tools: we_fill, we_replace, we_walls, we_sphere, we_cylinder, and we_paste. These bypass Minecraft’s 32,768-block vanilla /fill limit and allow the LLM to construct geometry by reasoning about coordinates in the game-state context.

The paste_schematic tool is the most used. A catalog of 111 curated .schem files is indexed by name, tags, category, and dimensions. When a player asks to “build a house,” the LLM calls list_schematics first with a search tag, finds the best match by ID, then calls paste_schematic. Placement auto-detects ground level from the player’s current Y coordinate and offsets 5 blocks ahead:

# tools.py — auto-positioning for schematic paste
if game_state:
    pos = game_state.get_player_pos(player_name)
    if pos:
        px, py, pz = int(pos[0]) + 5, int(pos[1]), int(pos[2]) + 5

The catalog spans six categories — residential, decorative, functional, defense, infrastructure, mega — with builds ranging from a 5×5 campfire ring to a 60×40 medieval castle. The LLM can search by keyword so “pagoda” finds the Japanese pagoda, “castle” surfaces the stone castle, and “farm” returns the auto-farms.

Memory and Session Persistence

SQLite backs four tables: conversations, sessions, player_profiles, and quests. Sessions start on WebSocket connect and close gracefully on disconnect, triggering a session summary call to the LLM before shutdown:

# server.py — session summary on disconnect
async def _end_session(self):
    last_messages = self.memory.get_session_history(self.session_id)
    if len(last_messages) >= 4:
        summary = await self.llm.summarize_session(last_messages)
        self.memory.end_session(self.session_id, summary)

On the next startup, those summaries are injected as a system message before the first player interaction, giving the dungeon master continuity across sessions without replaying raw conversation history. The injection is limited to the last three session summaries — enough context without bloating the prompt.

The quest system persists active objectives between sessions. On reconnect, the server greets with any open quests: “You have active quests: Find Ancient City, Build a Beacon. What’s the plan?” — the dungeon master picking up mid-story rather than starting cold.

Running It

# Start the daemon — set your API key first
$ export GROK_API_KEY="xai-..."
$ cd /path/to/player2
$ source venv/bin/activate
$ python -u -m daemon

[p2] Using Grok API (grok-3-mini-fast)
[p2] Loaded 3 previous session summaries
[p2] Server listening on localhost:7677

If you don’t have a Grok key, the daemon falls back to Ollama automatically — any model with reliable tool-calling works. grok-3-mini-fast is the current recommendation for cloud: fast, cheap, and consistently follows the action-tool contract. For local inference, qwen3:8b handles the tool loop well on hardware with 8GB+ VRAM.

Then in Minecraft with Fabric 1.21.1 + Carpet + Player2 mod installed:

/p2 spawn       # spawn the entity
/p2 follow      # toggle follow (default: on)
/p2 protect     # toggle protect (default: on)

Any chat message goes to the dungeon master. No command prefix required. It reads everything.

What You Learned

A Carpet fake-player entity combined with a Fabric mod is the right primitive for an in-game LLM companion — it carries full player permissions and a physical body without requiring a real client connection
Tool-calling loops need explicit nudges when chaining knowledge lookups into action calls; without them, the model treats a successful lookup as task completion and responds conversationally without ever touching the game world
The hard line between read-only knowledge tools and write-capable action tools is what makes the loop safe — once that contract is enforced, you can let the model re-enter as many times as it needs before it acts
Hybrid FTS5 + vector search with Reciprocal Rank Fusion consistently outperforms either method alone on heterogeneous natural language queries over structured game knowledge
Line-of-sight raycasting is not optional for behavior AI — spatial honesty is as important at the physics layer as at the LLM layer; behaviors that ignore what the entity can actually reach look wrong and break trust immediately
Session summarization on disconnect is cheaper than injecting full history and gives the LLM enough context to maintain narrative continuity across restarts without prompt bloat

The architecture here — game state extraction, typed tool bifurcation, hybrid RAG, and summarized memory — is not specific to Minecraft. Any environment that can expose state over a socket and accept commands in return is a candidate for the same pattern. That is the more interesting thing Player2 demonstrated: a replicable blueprint for embedding a deliberate, memory-bearing LLM agent into any interactive system.