Building Mneme · Part 6

Code Creator - The Quest for Shippable Code

~10 min read · by Dave Wheeler (aka Weller Davis)

I wanted Mneme to write code I'd trust to run. Not "technically correct but requires 30 minutes of cleanup" code. Not "works in the demo, breaks in production" code. Real, shippable code. That meant rethinking everything about how AI generates software-starting with the uncomfortable truth that most AI code fails validation the moment you try to run it.

The breakthrough came from an unlikely place: watching my daughters play a snake game. Not a tutorial example or proof-of-concept-an actual working game that Code Creator built from scratch, complete with collision detection, score persistence, and progressive speed increases. They didn't care about the architecture. They just wanted to beat each other's high scores.

That's when I knew Code Creator was working.

TL;DR - Code Creator implements the plan → execute → validate → fix loop that works across all of Mneme's creators. The breakthrough: a Compact Context DSL that cut token usage from 50,000 per task to 60% less while improving output quality. Plus discovering that validation needs to check for functional duplication, not just syntax. The result: working apps that pass the ultimate test-people actually want to use them.

The Goal: Shippable, Not Just Syntactic

Early versions of Code Creator could generate code that looked correct. Proper syntax, reasonable structure, even decent naming conventions. But when you actually ran it? Undefined variables. Missing imports. Functions that referenced other functions that didn't exist yet. And my personal favorite: three different implementations of the same logic scattered across the codebase because the model forgot it already wrote that function.

I needed Code Creator to produce code you could run immediately after generation. No manual fixing, no "just change this one thing," no debugging sessions. The bar was simple: If it doesn't work when you click Run, it's not done.

The Pattern: Why Plan → Execute → Validate → Fix?

By the time I started building Code Creator, I'd already implemented the same pattern across e-books, tutorials, images, and music. Every creator followed the same flow:

Universal Creator Pattern
1. Plan: Break the big goal into manageable chunks
2. Execute: Generate one chunk at a time
3. Validate: Check quality before moving on
4. Fix: Auto-correct issues or escalate to human

The trick was breaking it down to manageable chunks while maintaining continuity, consistency, and quality.

For code, this meant:

The pattern made sense. The challenge was execution.

The Snake Game: A Concrete Example

Let me show you what Code Creator produces by walking through an actual project. I asked it to "build a snake game with score tracking and increasing difficulty."

The DevPlan

Code Creator broke this into atomic tasks:

DevPlan: Snake Game
├─ T001  Create project structure (HTML, CSS, JS files)          [completed]
├─ T002  Implement canvas setup and game state                   [completed]
├─ T003  Add snake movement and collision detection              [completed]
├─ T004  Implement pill generation and score tracking            [completed]
├─ T005  Add speed progression and localStorage high scores      [completed]
└─ T006  Polish UI (game over screen, pause/resume)              [completed]

The Generated Code

Here's a snippet from the actual script.js that Code Creator produced (cleaned for readability):

// Game state
let snake = [];
let pill = {};
let direction = 'right';
let score = 0;
let highScore = localStorage.getItem('snakeHighScore') || 0;
let gameSpeed = INITIAL_SPEED;

// Update game state
function update() {
    direction = nextDirection;
    const head = {x: snake[0].x, y: snake[0].y};

    // Calculate new head position
    switch (direction) {
        case 'up': head.y -= 1; break;
        case 'down': head.y += 1; break;
        case 'left': head.x -= 1; break;
        case 'right': head.x += 1; break;
    }

    // Check collision with walls
    if (head.x < 0 || head.x >= TILE_COUNT ||
        head.y < 0 || head.y >= TILE_COUNT) {
        gameOver();
        return;
    }

    // Check collision with self
    for (let segment of snake) {
        if (segment.x === head.x && segment.y === head.y) {
            gameOver();
            return;
        }
    }

    // Add new head, check for pill
    snake.unshift(head);
    if (head.x === pill.x && head.y === pill.y) {
        score += 10;
        generatePill();
        // Increase speed
        if (gameSpeed > MIN_SPEED) {
            gameSpeed -= 2;
            clearInterval(gameInterval);
            gameInterval = setInterval(gameLoop, gameSpeed);
        }
    } else {
        snake.pop();  // Remove tail
    }
}

This isn't cherry-picked. This is the code Code Creator wrote. Collision detection works. Score tracking persists to localStorage. Speed increases with each pill. The game is actually playable.

The Validation Moment

When I opened the game in a browser and played it, everything worked. Then I showed it to my daughters. They immediately started competing for high scores, laughing when the snake got too fast to control. One of them asked, "Can you make it so we can see each other's scores?"

That question was validation. Not "Does it compile?" but "Can I use this?"

The Compact Context DSL Breakthrough

Early versions of Code Creator had a problem: I was burning 50,000 tokens per task sending entire files to the code model. For a simple function addition, the model would receive hundreds of lines of irrelevant context. The context size was painful. The output quality was worse-models would get distracted by unrelated code and suggest unnecessary refactoring.

Then I realized: the model doesn't need to see every line-just the interfaces, the task context, and the exact section it's editing.

What Changed

Before: Full File Context
  • Send entire files (500-2000 lines)
  • 50,000 tokens per task
  • Model gets distracted by irrelevant code
  • Suggests unnecessary refactoring
  • Slow, expensive, lower quality
After: Compact Context DSL
  • Send only: interfaces, task context, edit section
  • ~20,000 tokens per task (60% reduction)
  • Model focuses on relevant context
  • Targeted changes only
  • Faster, cheaper, better quality

Example: Compact Context Format

[TREE]
/src
  game.js       (Main game logic)
  render.js     (Canvas drawing)
  utils.js      (Helper functions)
[/TREE]

[TASK:T003]
Implement snake movement and collision detection
[/TASK]

[IFACE:game.js]
- snake: Array<{x, y}>
- direction: string
- update(): void
- checkCollision(x, y): boolean
[/IFACE]

[EDIT:src/game.js:115-145]
// Only the function being modified, not the entire file
function update() {
    // ... existing implementation
}
[/EDIT]

[INSTRUCTION]
Add collision detection for walls and self-collision.
Return true if collision detected, false otherwise.
[/INSTRUCTION]

This focused context cut token usage by 60% and improved output quality. Less is more, but only if you choose the right "less."

The Functional Duplication Bug

One morning I was reviewing generated code and noticed something odd: three different functions for validating user input, each with slightly different names but identical logic.

function validateInput(data) { /* ... */ }
function checkInputValidity(data) { /* ... */ }
function verifyUserInput(data) { /* ... */ }

All three did the same thing. The model had forgotten it already implemented this functionality and kept generating new versions with different names. My validation pipeline was checking syntax, imports, and undefined symbols-but it wasn't catching functional duplication.

The Fix: Semantic Analysis

I added a new validation step that analyzes function behavior, not just signatures:

This caught cases where the model would implement the same logic multiple times under different names. The validation failure would trigger a fix task: "Consolidate duplicate validation functions into a single reusable utility."

Lesson learned - Syntax checking isn't enough. You need semantic validation to catch logical duplication, not just syntactic errors.

The Incremental Loop: Plan → Execute → Validate → Fix

Here's how Code Creator actually works, step by step:

1. Plan: DevPlan Generation

2. Execute: Task Implementation

3. Validate: Multi-Layer Quality Gates

4. Fix: Targeted Repair Tasks

Visual Flow
Request: "Build snake game"
   ↓
DevPlan (6 atomic tasks) → User approves
   ↓
For each task:
   1. Build compact context (interfaces + edit section)
   2. Generate code changes
   3. Apply to filesystem
   4. Validate (syntax, imports, duplication, tests)
   5. If pass: mark completed, move to next task
   6. If fail: generate fix task, retry (max 3 attempts)
   ↓
All tasks completed → Project ready

Two-Tier LLM Design: Fast Analysis, Focused Generation

Code Creator uses two LLMs with distinct roles:

Tier 1: Analysis (local_llm)
Personas: Casey (Coder), Priya (Architect)
  • Parse user intent
  • Survey codebase structure
  • Generate DevPlan (atomic tasks)
  • Build compact context for each task
  • Fast, cheap, local inference
Tier 2: Code Generation (code_llm)
Specialized code model
  • Receive focused context + task
  • Generate structured file operations
  • Low temperature for determinism
  • Return only what changes
  • Optimized for code quality

This separation keeps context size lower (fast local LLM for planning) and quality high (specialized model for code generation with focused context).

The Honest Truth: I Haven't Used It for Real Work Yet

Here's the part where I'm supposed to tell you about all the production apps I've shipped with Code Creator. But I can't-because I haven't used it for real work yet.

Not because it doesn't work. The snake game proves it works. But because I'm a perfectionist, and Code Creator isn't quite there yet for the kind of complex, production-grade software I build professionally.

What's it good for right now?

What's it not quite ready for?

But it's getting there. Every week, the validation catches more issues. Every update to the Compact Context DSL improves focus. Every persona training iteration makes the plans smarter. The gap between "works for snake games" and "ships production software" is narrowing.

Real validation - My daughters use Code Creator to build their web projects. They don't care about the architecture. They just want apps that work. That's the bar.

Lessons: What Makes Code Shippable

1. Token Efficiency Improves Quality

Cutting context from 50,000 to 20,000 tokens wasn't just about context size, it made the output better. Focused context means focused changes. The model stops suggesting unnecessary refactoring and just solves the task at hand.

2. Validation Must Be Semantic, Not Just Syntactic

Catching functional duplication required understanding what code does, not just whether it parses. Syntax checking is table stakes. Real quality gates need semantic analysis.

3. Atomic Tasks Compound

Breaking "build a snake game" into 6 focused tasks meant each one could be validated independently. When task 3 failed, tasks 1-2 were still good. No monolithic rewrites-just targeted fixes.

4. User Enjoyment Is the Real Test

My daughters playing the snake game validated Code Creator more than any unit test could. If people want to use what it builds, it's working. If they don't, it's not-regardless of test coverage.

5. Honesty About Limitations Builds Trust

Code Creator works for small projects. It's not ready for production systems. Saying that out loud doesn't diminish what it can do-it clarifies where the value is today and where it's headed tomorrow.

What's Next

Code Creator is evolving. Current priorities:

The goal remains the same: code you'd trust to run. Not "technically correct," but actually shippable.


Key Takeaways

For AI teams building code generators:

For teams evaluating local vs. cloud:


© Dave Wheeler · wellerdavis.com · Built one error message at a time.