Building Mneme · Part 7

System Patterns - Drowning in Duplication, Saved by Structure

~10 min read · by Dave Wheeler (aka Weller Davis)

Three months into building Mneme, I had a crisis. E-book Creator worked. Image Creator worked. Music Creator... mostly worked. But they shared almost no code. Every new module meant reimplementing retry logic, checkpoint recovery, progress updates, and artifact storage. I was building five systems, not one platform.

That's when I started extracting patterns-not because I love abstraction, but because I was drowning in duplication.

This is the story of four patterns that saved the project: MCP (standardized tool calling), Unified Scheduler (one heartbeat for all creators), Filesystem Discipline (predictable artifact organization), and Checkpoint Recovery (resilience through bounded retries). Plus the moment I knew they were working-when Quiz Creator took only hours to build.

TL;DR - Mneme's patterns emerged from pain, not planning. MCP standardized tool calling (Pandoc integration went from 3 days to 20 minutes). Unified Scheduler fixed a 3am GPU memory collision. Filesystem discipline made finding artifacts trivial. Checkpoint recovery saved a 4-hour e-book from 90% crash. The payoff: adding Quiz Creator took hours, not weeks.

Pattern 1: MCP - From Crude Hacks to Standard Protocol

Before MCP: The Tool Calling Wilderness

Prior to November 2024, I was building tool calling into my AI products using custom, brittle approaches. At work, I had a deep research tool with crude but somewhat reliable function calling-JSON schema validation, manual parsing, error-prone integration with external services.

Every tool was a special case. Adding Pandoc support to E-book Creator took three days of writing custom adapters, handling edge cases, and debugging why the LLM's function calls didn't match my schema.

MCP Changes Everything

When Anthropic launched the Model Context Protocol (MCP) in November 2024, I immediately recognized what it meant: a standard handshake between LLMs and tools. No more custom adapters for every integration. No more JSON schema mismatches. Just a clean protocol that any MCP server could implement.

The First Integrations

I started with filesystem access-the most fundamental tool. Mneme needed to read, write, edit, and search files within sandboxed project directories. With MCP's filesystem_server, that became a standard capability across all modules.

Next came web search for research in Lesson and E-book Creator. Before MCP, I had custom web scraping with fragile HTML parsing. With MCP's fetch_server (deterministic web fetch + HTML→markdown conversion), research became reliable.

The Before/After Moment

When I needed to add Pandoc support to Blog Creator, I braced for another three-day integration marathon. Instead, it took 20 minutes:

Before MCP (E-book Creator)
  • Day 1: Write custom Pandoc wrapper
  • Day 2: Debug schema mismatches
  • Day 3: Handle error cases and format variations
  • Result: Pandoc works for e-books only
After MCP (Blog Creator)
  • Minute 1-5: Add pandoc_server to MCP config
  • Minute 6-15: Update persona prompt to use "convert" tool
  • Minute 16-20: Test markdown → HTML conversion
  • Result: Pandoc works everywhere, reliably

MCP Tools in Mneme

Today, Mneme uses MCP for all tool integrations:

Why MCP matters - Tool isolation means the same persona logic works whether generating an EPUB, PNG, or FLAC. No module-specific plumbing. No special cases. Just standard tool calls.

Pattern 2: Unified Scheduler - The 3AM Crisis

Before: Six Cron Jobs, One Disaster

Early Mneme had six cron jobs running independently-one for each creator module. Each job would wake up, check its MongoDB collection for pending work, and dispatch tasks. Simple. Decentralized. And completely unsustainable.

At 3am one night, two jobs collided: Image Creator and Music Creator both fired up simultaneously, each loading their models into GPU memory. The RTX 4080's 16GB VRAM wasn't enough for both. Both workflows crashed. Both projects were marked as failed and needed manual intervention.

I woke up to error notifications and spent the next hour debugging why two unrelated systems had taken each other down.

The Unified Scheduler Emerges

That 3am debugging session led to a simple insight: one heartbeat, not six. Instead of N cron jobs per module, Mneme needed a single scheduler that understood concurrency, resource constraints, and priority across all creators.

Unified Scheduler Architecture
UnifiedProjectScheduler (wakes every N minutes)
  │
  ├─ EbookProjectHandler.find_pending_work()    → [WorkItem...]
  ├─ ImageProjectHandler.find_pending_work()    → [WorkItem...]
  ├─ MusicProjectHandler.find_pending_work()    → [WorkItem...]
  ├─ CodeProjectHandler.find_pending_work()     → [WorkItem...]
  └─ QuizProjectHandler.find_pending_work()     → [WorkItem...]
       ↓
  ProjectDispatcher.dispatch(items, respect_concurrency=True)
       ↓
  {Module}Orchestrator.process_workflow()

What Changed

Before (Six Cron Jobs)
  • Each module polls independently
  • No coordination between jobs
  • Resource collisions crash workflows
  • Adding new module = new cron job
After (Unified Scheduler)
  • Single heartbeat discovers all pending work
  • Respects concurrency limits per module
  • Prevents resource collisions (VRAM, CPU)
  • Adding module = implement handler interface

Benefits Beyond Stability

The Unified Scheduler didn't just prevent crashes-it made the system predictable:

Pattern 3: Filesystem Discipline - IDs, Not Chaos

Design Goal from Day One

I like to keep things organized. From the start, I knew Mneme needed a predictable filesystem structure: one directory per project, consistent subdirectories for different artifact types, no mixing module outputs.

The structure I settled on uses project IDs as directory names. Is it great for humans? No-691bd6ce65d90a698ea95ace doesn't tell me what's inside. But it works perfectly for Mneme, and I solved the human problem with a simple UX fix.

Standard Structure

~/mneme_data/{module}s/{project_id}/
├── research/              # Initial research and planning
├── planning/              # Chapter structure, section outlines
├── drafts/                # Raw generated content
├── content/               # Validated, compiled content
├── chapters/              # Final chapter markdown
├── manuscript/            # Assembled full manuscript
├── artifacts/             # EPUB, PDF, cover images, ZIP packages
├── audit_trail/           # Validator output, decisions, timestamped logs
├── checkpoint.json        # Recovery metadata
└── meta.json              # Project metadata

The UX Fix: Show the ID

The solution to "IDs aren't human-readable" was simple: display the project ID in every panel of Mneme's web interface. Now when I need to find a project's artifacts, I don't have to query MongoDB for the ID-I just look at the UI, copy the ID, and navigate directly to the directory.

Small UX decisions like this make the difference between "theoretically organized" and "actually usable."

Why This Structure Matters

Pattern 4: Checkpoint Recovery - The 90% Crash

The Facepalm Moment

Early in Mneme's development, I kicked off a 4-hour e-book generation before bed. The local LLM pipeline was still a bit unreliable-occasional timeout errors, sporadic model crashes. I figured, "What's the worst that could happen?"

At 90% completion-28 sections generated, validation complete, artifacts being assembled-the LLM crashed with a cryptic CUDA error. Four hours of work, gone. I woke up to a failed project with no way to resume.

Facepalm. But at least it was running locally-no cloud costs wasted!

Checkpoint Discipline

That failure taught me checkpoint discipline. Every long-running workflow now checkpoints at phase boundaries:

Checkpoint Structure
{
  "project_id": "691bd6ce65d90a698ea95ace",
  "checkpoint_timestamp": "2025-11-18T04:16:07Z",
  "checkpoint_metadata": {
    "last_completed_section": {
      "chapter_index": 7,
      "section_index": 4,
      "section_title": "Preparing for Tomorrow's Threats",
      "timestamp": "2025-11-18T04:16:07Z",
      "failed": false
    },
    "progress": {
      "completed_sections": 28,
      "total_sections": 28,
      "percentage": 100.0
    },
    "current_phase": "researching",
    "resume_point": {
      "status": "researching",
      "can_resume": true
    }
  }
}

Bounded Retries

Checkpoints enable smart recovery, but they're not magic. Some failures are transient (network timeout), others are permanent (malformed input). Mneme uses bounded retries with exponential backoff:

This prevents infinite loops while giving transient issues time to resolve (network recovery, GPU memory cleanup, temporary service outages).

Recovery Wins

Since implementing checkpoint recovery:

Concrete Example: Smart Home Setup Guide E-book

Let me show you how all four patterns work together in a real workflow. When a user creates an e-book project, here's what happens:

1. Scheduler Discovers Work

UnifiedScheduler wakes up (every 5 minutes)
  → EbookProjectHandler.find_pending_work()
    → Finds project 691bd6ce65d90a698ea95ace in "planning" status
    → Dispatches to EbookOrchestrator

2. MCP Tools in Action

Phase: Research
  → fetch_server: Retrieve web articles on smart home security
  → filesystem_server: Save research to research/ directory
  → Checkpoint: research phase complete

Phase: Planning
  → LLM: Generate chapter structure (8 chapters, 28 sections)
  → filesystem_server: Save planning to planning/ directory
  → Checkpoint: planning phase complete

Phase: Drafting
  → For each section (1-28):
    → LLM: Generate section content
    → filesystem_server: Save to drafts/{chapter}_{section}.md
    → Checkpoint: section N complete
    → If failure: Retry from last checkpoint (max 3 attempts)

Phase: Validation
  → Read all drafts, validate quality/coherence
  → filesystem_server: Save audit logs to audit_trail/
  → Checkpoint: validation complete

Phase: Assembly
  → Compile chapters into manuscript
  → pandoc_server: Convert markdown → EPUB
  → pandoc_server: Convert markdown → PDF
  → filesystem_server: Save to artifacts/
  → Checkpoint: assembly complete

3. Filesystem Structure After Completion

~/mneme_data/ebooks/691bd6ce65d90a698ea95ace/
├── artifacts/
│   ├── Smart_Home_Setup_Guide.epub        (536 KB)
│   ├── Smart_Home_Setup_Guide.pdf         (647 KB)
│   ├── Smart_Home_Setup_Guide.md          (111 KB)
│   ├── Smart_Home_Setup_Guide_cover.jpg   (484 KB)
│   └── Smart_Home_Setup_Guide_package.zip (1.2 MB)
├── audit_trail/
│   ├── 20251118_021920_e-book_planning.json
│   ├── 20251118_022003_chapter_section_planning.json
│   └── ... (28+ timestamped validation logs)
├── checkpoint.json                         (28/28 sections, 100%)
└── ... (research/, planning/, drafts/, chapters/, manuscript/)

4. Recovery Scenario

If the workflow crashes at section 15:

The Payoff: Quiz Creator in Hours, Not Weeks

After extracting these four patterns, I wanted to test if they actually worked. Could I add a new creator module quickly, or was I just building elaborate scaffolding that would collapse under real use?

I decided to build Quiz Creator-a module for generating multiple-choice quizzes for my daughters. Subject-based questions, difficulty progression, immediate feedback. Nothing revolutionary, but a good test of the patterns.

With MCP, Unified Scheduler, Filesystem Discipline, and Checkpoint Recovery already in place, Quiz Creator took a few hours to build:

No retry logic to write-inherited from base orchestrator. No checkpoint system to build-already there. No artifact storage to implement-filesystem pattern handled it. No scheduler integration-just implemented the handler interface.

That's when I knew the patterns were working. Not when I finished documenting them, but when I could ship a new module in hours instead of weeks.

Validation through velocity - The best proof that patterns work isn't clean architecture diagrams. It's building Quiz Creator in a few hours instead of drowning in duplication for the sixth time.

Pattern 5: Observability - See It, Fix It, Trust It

Long-running jobs feel alive because they are alive. Mneme streams WebSocket events throughout every workflow:

Real-time Event Stream
ebook_progress    | phase: researching    | "Gathering sources..."
ebook_progress    | phase: drafting       | section 5/28
image_progress    | step: 12/20           | preview: base64_png
code_todo_update  | task: T002            | in_progress → completed
music_phase       | stage: B              | "Synthesis 73%..."
metrics_update    | tokens: 15,420        | role: research_llm, ms: 3,200

This real-time feedback transforms "wait and hope" into "watch and understand." When something fails, the event stream shows exactly where and why. When something succeeds, you can see the full audit trail leading to the result.

Lessons: What Works, What I'd Change

What I'd Keep

What I'd Change

Why This Matters to AI Engineering Teams


© Dave Wheeler · wellerdavis.com · Built one error message at a time.