Building Mneme, Part 7: System Patterns - Drowning in Duplication, Saved by Structure

Three months into building Mneme, I had a crisis. E-book Creator worked. Image Creator worked. Music Creator... mostly worked. But they shared almost no code. Every new module meant reimplementing retry logic, checkpoint recovery, progress updates, and artifact storage. I was building five systems, not one platform.

That's when I started extracting patterns-not because I love abstraction, but because I was drowning in duplication.

This is the story of four patterns that saved the project: MCP (standardized tool calling), Unified Scheduler (one heartbeat for all creators), Filesystem Discipline (predictable artifact organization), and Checkpoint Recovery (resilience through bounded retries). Plus the moment I knew they were working-when Quiz Creator took only hours to build.

TL;DR - Mneme's patterns emerged from pain, not planning. MCP standardized tool calling (Pandoc integration went from 3 days to 20 minutes). Unified Scheduler fixed a 3am GPU memory collision. Filesystem discipline made finding artifacts trivial. Checkpoint recovery saved a 4-hour e-book from 90% crash. The payoff: adding Quiz Creator took hours, not weeks.

Pattern 1: MCP - From Crude Hacks to Standard Protocol

Before MCP: The Tool Calling Wilderness

Prior to November 2024, I was building tool calling into my AI products using custom, brittle approaches. At work, I had a deep research tool with crude but somewhat reliable function calling-JSON schema validation, manual parsing, error-prone integration with external services.

Every tool was a special case. Adding Pandoc support to E-book Creator took three days of writing custom adapters, handling edge cases, and debugging why the LLM's function calls didn't match my schema.

MCP Changes Everything

When Anthropic launched the Model Context Protocol (MCP) in November 2024, I immediately recognized what it meant: a standard handshake between LLMs and tools. No more custom adapters for every integration. No more JSON schema mismatches. Just a clean protocol that any MCP server could implement.

The First Integrations

I started with filesystem access-the most fundamental tool. Mneme needed to read, write, edit, and search files within sandboxed project directories. With MCP's filesystem_server, that became a standard capability across all modules.

Next came web search for research in Lesson and E-book Creator. Before MCP, I had custom web scraping with fragile HTML parsing. With MCP's fetch_server (deterministic web fetch + HTML→markdown conversion), research became reliable.

The Before/After Moment

When I needed to add Pandoc support to Blog Creator, I braced for another three-day integration marathon. Instead, it took 20 minutes:

Before MCP (E-book Creator)

Day 1: Write custom Pandoc wrapper
Day 2: Debug schema mismatches
Day 3: Handle error cases and format variations
Result: Pandoc works for e-books only

After MCP (Blog Creator)

Minute 1-5: Add pandoc_server to MCP config
Minute 6-15: Update persona prompt to use "convert" tool
Minute 16-20: Test markdown → HTML conversion
Result: Pandoc works everywhere, reliably

MCP Tools in Mneme

Today, Mneme uses MCP for all tool integrations:

filesystem_server: Read/write/edit/search within sandboxed project directories
fetch_server: Web research with deterministic HTML→markdown conversion
puppeteer_server: Headless browser for navigation, screenshots, form filling
pandoc_server: Format conversion (markdown → EPUB/PDF/HTML)
comfyui_client: Image and music generation workflows

Why MCP matters - Tool isolation means the same persona logic works whether generating an EPUB, PNG, or FLAC. No module-specific plumbing. No special cases. Just standard tool calls.

Pattern 2: Unified Scheduler - The 3AM Crisis

Before: Six Cron Jobs, One Disaster

Early Mneme had six cron jobs running independently-one for each creator module. Each job would wake up, check its MongoDB collection for pending work, and dispatch tasks. Simple. Decentralized. And completely unsustainable.

At 3am one night, two jobs collided: Image Creator and Music Creator both fired up simultaneously, each loading their models into GPU memory. The RTX 4080's 16GB VRAM wasn't enough for both. Both workflows crashed. Both projects were marked as failed and needed manual intervention.

I woke up to error notifications and spent the next hour debugging why two unrelated systems had taken each other down.

The Unified Scheduler Emerges

That 3am debugging session led to a simple insight: one heartbeat, not six. Instead of N cron jobs per module, Mneme needed a single scheduler that understood concurrency, resource constraints, and priority across all creators.

Unified Scheduler Architecture

UnifiedProjectScheduler (wakes every N minutes)
  │
  ├─ EbookProjectHandler.find_pending_work()    → [WorkItem...]
  ├─ ImageProjectHandler.find_pending_work()    → [WorkItem...]
  ├─ MusicProjectHandler.find_pending_work()    → [WorkItem...]
  ├─ CodeProjectHandler.find_pending_work()     → [WorkItem...]
  └─ QuizProjectHandler.find_pending_work()     → [WorkItem...]
       ↓
  ProjectDispatcher.dispatch(items, respect_concurrency=True)
       ↓
  {Module}Orchestrator.process_workflow()

What Changed

Before (Six Cron Jobs)

Each module polls independently
No coordination between jobs
Resource collisions crash workflows
Adding new module = new cron job

After (Unified Scheduler)

Single heartbeat discovers all pending work
Respects concurrency limits per module
Prevents resource collisions (VRAM, CPU)
Adding module = implement handler interface

Benefits Beyond Stability

The Unified Scheduler didn't just prevent crashes-it made the system predictable:

Global sleep cycles: Energy-saving quiet hours respected across all modules
Priority queues: User-initiated projects jump ahead of automated background work
Back-pressure: When one module is overloaded, others continue normally
Observability: Single dashboard showing all active/pending work across creators

Pattern 3: Filesystem Discipline - IDs, Not Chaos

Design Goal from Day One

I like to keep things organized. From the start, I knew Mneme needed a predictable filesystem structure: one directory per project, consistent subdirectories for different artifact types, no mixing module outputs.

The structure I settled on uses project IDs as directory names. Is it great for humans? No-691bd6ce65d90a698ea95ace doesn't tell me what's inside. But it works perfectly for Mneme, and I solved the human problem with a simple UX fix.

Standard Structure

~/mneme_data/{module}s/{project_id}/
├── research/              # Initial research and planning
├── planning/              # Chapter structure, section outlines
├── drafts/                # Raw generated content
├── content/               # Validated, compiled content
├── chapters/              # Final chapter markdown
├── manuscript/            # Assembled full manuscript
├── artifacts/             # EPUB, PDF, cover images, ZIP packages
├── audit_trail/           # Validator output, decisions, timestamped logs
├── checkpoint.json        # Recovery metadata
└── meta.json              # Project metadata

The UX Fix: Show the ID

The solution to "IDs aren't human-readable" was simple: display the project ID in every panel of Mneme's web interface. Now when I need to find a project's artifacts, I don't have to query MongoDB for the ID-I just look at the UI, copy the ID, and navigate directly to the directory.

Small UX decisions like this make the difference between "theoretically organized" and "actually usable."

Why This Structure Matters

Predictability: Publishing scripts know exactly where artifacts live
Auditability: Every decision logged with timestamp in audit_trail/
Recovery: Checkpoint state stored at project root, easy to inspect
Shareability: Entire project is self-contained, can be zipped and moved

Pattern 4: Checkpoint Recovery - The 90% Crash

The Facepalm Moment

Early in Mneme's development, I kicked off a 4-hour e-book generation before bed. The local LLM pipeline was still a bit unreliable-occasional timeout errors, sporadic model crashes. I figured, "What's the worst that could happen?"

At 90% completion-28 sections generated, validation complete, artifacts being assembled-the LLM crashed with a cryptic CUDA error. Four hours of work, gone. I woke up to a failed project with no way to resume.

Facepalm. But at least it was running locally-no cloud costs wasted!

Checkpoint Discipline

That failure taught me checkpoint discipline. Every long-running workflow now checkpoints at phase boundaries:

Checkpoint Structure

{
  "project_id": "691bd6ce65d90a698ea95ace",
  "checkpoint_timestamp": "2025-11-18T04:16:07Z",
  "checkpoint_metadata": {
    "last_completed_section": {
      "chapter_index": 7,
      "section_index": 4,
      "section_title": "Preparing for Tomorrow's Threats",
      "timestamp": "2025-11-18T04:16:07Z",
      "failed": false
    },
    "progress": {
      "completed_sections": 28,
      "total_sections": 28,
      "percentage": 100.0
    },
    "current_phase": "researching",
    "resume_point": {
      "status": "researching",
      "can_resume": true
    }
  }
}

Bounded Retries

Checkpoints enable smart recovery, but they're not magic. Some failures are transient (network timeout), others are permanent (malformed input). Mneme uses bounded retries with exponential backoff:

First failure: Retry immediately from checkpoint
Second failure: Wait 5 minutes, retry
Third failure: Wait 15 minutes, retry
Fourth failure: Mark as failed, escalate to human

This prevents infinite loops while giving transient issues time to resolve (network recovery, GPU memory cleanup, temporary service outages).

Recovery Wins

Since implementing checkpoint recovery:

Zero 4-hour e-books lost to late-stage crashes
Image generation workflows resume after ComfyUI restarts
Code Creator picks up from last completed task after validation failure
Music Creator recovers from YuE's two-stage pipeline failures

Concrete Example: Smart Home Setup Guide E-book

Let me show you how all four patterns work together in a real workflow. When a user creates an e-book project, here's what happens:

1. Scheduler Discovers Work

UnifiedScheduler wakes up (every 5 minutes)
  → EbookProjectHandler.find_pending_work()
    → Finds project 691bd6ce65d90a698ea95ace in "planning" status
    → Dispatches to EbookOrchestrator

2. MCP Tools in Action

Phase: Research
  → fetch_server: Retrieve web articles on smart home security
  → filesystem_server: Save research to research/ directory
  → Checkpoint: research phase complete

Phase: Planning
  → LLM: Generate chapter structure (8 chapters, 28 sections)
  → filesystem_server: Save planning to planning/ directory
  → Checkpoint: planning phase complete

Phase: Drafting
  → For each section (1-28):
    → LLM: Generate section content
    → filesystem_server: Save to drafts/{chapter}_{section}.md
    → Checkpoint: section N complete
    → If failure: Retry from last checkpoint (max 3 attempts)

Phase: Validation
  → Read all drafts, validate quality/coherence
  → filesystem_server: Save audit logs to audit_trail/
  → Checkpoint: validation complete

Phase: Assembly
  → Compile chapters into manuscript
  → pandoc_server: Convert markdown → EPUB
  → pandoc_server: Convert markdown → PDF
  → filesystem_server: Save to artifacts/
  → Checkpoint: assembly complete

3. Filesystem Structure After Completion

~/mneme_data/ebooks/691bd6ce65d90a698ea95ace/
├── artifacts/
│   ├── Smart_Home_Setup_Guide.epub        (536 KB)
│   ├── Smart_Home_Setup_Guide.pdf         (647 KB)
│   ├── Smart_Home_Setup_Guide.md          (111 KB)
│   ├── Smart_Home_Setup_Guide_cover.jpg   (484 KB)
│   └── Smart_Home_Setup_Guide_package.zip (1.2 MB)
├── audit_trail/
│   ├── 20251118_021920_e-book_planning.json
│   ├── 20251118_022003_chapter_section_planning.json
│   └── ... (28+ timestamped validation logs)
├── checkpoint.json                         (28/28 sections, 100%)
└── ... (research/, planning/, drafts/, chapters/, manuscript/)

4. Recovery Scenario

If the workflow crashes at section 15:

Checkpoint shows last completed: section 14
UnifiedScheduler finds project in "auto_resuming" status
EbookOrchestrator loads checkpoint, resumes from section 15
Sections 1-14 already exist, no regeneration needed
Bounded retry: max 3 attempts to complete section 15
If success: continue to section 16, normal flow
If 3 failures: mark as failed, human review required

The Payoff: Quiz Creator in Hours, Not Weeks

After extracting these four patterns, I wanted to test if they actually worked. Could I add a new creator module quickly, or was I just building elaborate scaffolding that would collapse under real use?

I decided to build Quiz Creator-a module for generating multiple-choice quizzes for my daughters. Subject-based questions, difficulty progression, immediate feedback. Nothing revolutionary, but a good test of the patterns.

With MCP, Unified Scheduler, Filesystem Discipline, and Checkpoint Recovery already in place, Quiz Creator took a few hours to build:

Hour 1: Define quiz schema, create MongoDB collection
Hour 2: Implement QuizProjectHandler (find_pending_work interface)
Hour 3: Write persona prompts for question generation
Hour 4: Test end-to-end workflow, generate first quiz

No retry logic to write-inherited from base orchestrator. No checkpoint system to build-already there. No artifact storage to implement-filesystem pattern handled it. No scheduler integration-just implemented the handler interface.

That's when I knew the patterns were working. Not when I finished documenting them, but when I could ship a new module in hours instead of weeks.

Validation through velocity - The best proof that patterns work isn't clean architecture diagrams. It's building Quiz Creator in a few hours instead of drowning in duplication for the sixth time.

Pattern 5: Observability - See It, Fix It, Trust It

Long-running jobs feel alive because they are alive. Mneme streams WebSocket events throughout every workflow:

Real-time Event Stream

ebook_progress    | phase: researching    | "Gathering sources..."
ebook_progress    | phase: drafting       | section 5/28
image_progress    | step: 12/20           | preview: base64_png
code_todo_update  | task: T002            | in_progress → completed
music_phase       | stage: B              | "Synthesis 73%..."
metrics_update    | tokens: 15,420        | role: research_llm, ms: 3,200

This real-time feedback transforms "wait and hope" into "watch and understand." When something fails, the event stream shows exactly where and why. When something succeeds, you can see the full audit trail leading to the result.

Lessons: What Works, What I'd Change

What I'd Keep

MCP abstraction: Tool standardization pays dividends every time you add a new integration
Unified Scheduler: Single heartbeat prevents resource collisions and simplifies reasoning
Filesystem discipline: Predictable structure makes everything from debugging to publishing trivial
Bounded retries + checkpoints: Turns intermittent failures into normal operations

What I'd Change

Event catalog: Add semantic event names with contracts (currently events are somewhat ad-hoc)
Test-first validation: Expand pre-generation checks to reduce downstream retries
Dependency tracking: Better understanding of which projects depend on shared resources (LoRA models, checkpoints)

Why This Matters to AI Engineering Teams

Patterns reduce cognitive load: Teams reuse the same spine for new modules instead of reinventing orchestration
Tool isolation (MCP) enables swapping: Replace Pandoc with a different converter without touching orchestrators
Recovery-first thinking normalizes failure: Intermittent LLM errors become checkpoint-and-retry, not incident response
Filesystem + metadata discipline accelerates audits: Find any artifact from any project in seconds, not queries
Velocity validates architecture: Quiz Creator in hours proved the patterns worked better than any design review