Three months into building Mneme, I had a crisis. E-book Creator worked. Image Creator worked. Music Creator... mostly worked. But they shared almost no code. Every new module meant reimplementing retry logic, checkpoint recovery, progress updates, and artifact storage. I was building five systems, not one platform.
That's when I started extracting patterns-not because I love abstraction, but because I was drowning in duplication.
This is the story of four patterns that saved the project: MCP (standardized tool calling), Unified Scheduler (one heartbeat for all creators), Filesystem Discipline (predictable artifact organization), and Checkpoint Recovery (resilience through bounded retries). Plus the moment I knew they were working-when Quiz Creator took only hours to build.
Pattern 1: MCP - From Crude Hacks to Standard Protocol
Before MCP: The Tool Calling Wilderness
Prior to November 2024, I was building tool calling into my AI products using custom, brittle approaches. At work, I had a deep research tool with crude but somewhat reliable function calling-JSON schema validation, manual parsing, error-prone integration with external services.
Every tool was a special case. Adding Pandoc support to E-book Creator took three days of writing custom adapters, handling edge cases, and debugging why the LLM's function calls didn't match my schema.
MCP Changes Everything
When Anthropic launched the Model Context Protocol (MCP) in November 2024, I immediately recognized what it meant: a standard handshake between LLMs and tools. No more custom adapters for every integration. No more JSON schema mismatches. Just a clean protocol that any MCP server could implement.
The First Integrations
I started with filesystem access-the most fundamental tool. Mneme needed to read, write, edit, and search files within sandboxed project directories. With MCP's filesystem_server, that became a standard capability across all modules.
Next came web search for research in Lesson and E-book Creator. Before MCP, I had custom web scraping with fragile HTML parsing. With MCP's fetch_server (deterministic web fetch + HTML→markdown conversion), research became reliable.
The Before/After Moment
When I needed to add Pandoc support to Blog Creator, I braced for another three-day integration marathon. Instead, it took 20 minutes:
- Day 1: Write custom Pandoc wrapper
- Day 2: Debug schema mismatches
- Day 3: Handle error cases and format variations
- Result: Pandoc works for e-books only
- Minute 1-5: Add pandoc_server to MCP config
- Minute 6-15: Update persona prompt to use "convert" tool
- Minute 16-20: Test markdown → HTML conversion
- Result: Pandoc works everywhere, reliably
MCP Tools in Mneme
Today, Mneme uses MCP for all tool integrations:
- filesystem_server: Read/write/edit/search within sandboxed project directories
- fetch_server: Web research with deterministic HTML→markdown conversion
- puppeteer_server: Headless browser for navigation, screenshots, form filling
- pandoc_server: Format conversion (markdown → EPUB/PDF/HTML)
- comfyui_client: Image and music generation workflows
Pattern 2: Unified Scheduler - The 3AM Crisis
Before: Six Cron Jobs, One Disaster
Early Mneme had six cron jobs running independently-one for each creator module. Each job would wake up, check its MongoDB collection for pending work, and dispatch tasks. Simple. Decentralized. And completely unsustainable.
At 3am one night, two jobs collided: Image Creator and Music Creator both fired up simultaneously, each loading their models into GPU memory. The RTX 4080's 16GB VRAM wasn't enough for both. Both workflows crashed. Both projects were marked as failed and needed manual intervention.
I woke up to error notifications and spent the next hour debugging why two unrelated systems had taken each other down.
The Unified Scheduler Emerges
That 3am debugging session led to a simple insight: one heartbeat, not six. Instead of N cron jobs per module, Mneme needed a single scheduler that understood concurrency, resource constraints, and priority across all creators.
UnifiedProjectScheduler (wakes every N minutes)
│
├─ EbookProjectHandler.find_pending_work() → [WorkItem...]
├─ ImageProjectHandler.find_pending_work() → [WorkItem...]
├─ MusicProjectHandler.find_pending_work() → [WorkItem...]
├─ CodeProjectHandler.find_pending_work() → [WorkItem...]
└─ QuizProjectHandler.find_pending_work() → [WorkItem...]
↓
ProjectDispatcher.dispatch(items, respect_concurrency=True)
↓
{Module}Orchestrator.process_workflow()
What Changed
- Each module polls independently
- No coordination between jobs
- Resource collisions crash workflows
- Adding new module = new cron job
- Single heartbeat discovers all pending work
- Respects concurrency limits per module
- Prevents resource collisions (VRAM, CPU)
- Adding module = implement handler interface
Benefits Beyond Stability
The Unified Scheduler didn't just prevent crashes-it made the system predictable:
- Global sleep cycles: Energy-saving quiet hours respected across all modules
- Priority queues: User-initiated projects jump ahead of automated background work
- Back-pressure: When one module is overloaded, others continue normally
- Observability: Single dashboard showing all active/pending work across creators
Pattern 3: Filesystem Discipline - IDs, Not Chaos
Design Goal from Day One
I like to keep things organized. From the start, I knew Mneme needed a predictable filesystem structure: one directory per project, consistent subdirectories for different artifact types, no mixing module outputs.
The structure I settled on uses project IDs as directory names. Is it great for humans? No-691bd6ce65d90a698ea95ace doesn't tell me what's inside. But it works perfectly for Mneme, and I solved the human problem with a simple UX fix.
Standard Structure
~/mneme_data/{module}s/{project_id}/
├── research/ # Initial research and planning
├── planning/ # Chapter structure, section outlines
├── drafts/ # Raw generated content
├── content/ # Validated, compiled content
├── chapters/ # Final chapter markdown
├── manuscript/ # Assembled full manuscript
├── artifacts/ # EPUB, PDF, cover images, ZIP packages
├── audit_trail/ # Validator output, decisions, timestamped logs
├── checkpoint.json # Recovery metadata
└── meta.json # Project metadata
The UX Fix: Show the ID
The solution to "IDs aren't human-readable" was simple: display the project ID in every panel of Mneme's web interface. Now when I need to find a project's artifacts, I don't have to query MongoDB for the ID-I just look at the UI, copy the ID, and navigate directly to the directory.
Small UX decisions like this make the difference between "theoretically organized" and "actually usable."
Why This Structure Matters
- Predictability: Publishing scripts know exactly where artifacts live
- Auditability: Every decision logged with timestamp in audit_trail/
- Recovery: Checkpoint state stored at project root, easy to inspect
- Shareability: Entire project is self-contained, can be zipped and moved
Pattern 4: Checkpoint Recovery - The 90% Crash
The Facepalm Moment
Early in Mneme's development, I kicked off a 4-hour e-book generation before bed. The local LLM pipeline was still a bit unreliable-occasional timeout errors, sporadic model crashes. I figured, "What's the worst that could happen?"
At 90% completion-28 sections generated, validation complete, artifacts being assembled-the LLM crashed with a cryptic CUDA error. Four hours of work, gone. I woke up to a failed project with no way to resume.
Facepalm. But at least it was running locally-no cloud costs wasted!
Checkpoint Discipline
That failure taught me checkpoint discipline. Every long-running workflow now checkpoints at phase boundaries:
{
"project_id": "691bd6ce65d90a698ea95ace",
"checkpoint_timestamp": "2025-11-18T04:16:07Z",
"checkpoint_metadata": {
"last_completed_section": {
"chapter_index": 7,
"section_index": 4,
"section_title": "Preparing for Tomorrow's Threats",
"timestamp": "2025-11-18T04:16:07Z",
"failed": false
},
"progress": {
"completed_sections": 28,
"total_sections": 28,
"percentage": 100.0
},
"current_phase": "researching",
"resume_point": {
"status": "researching",
"can_resume": true
}
}
}
Bounded Retries
Checkpoints enable smart recovery, but they're not magic. Some failures are transient (network timeout), others are permanent (malformed input). Mneme uses bounded retries with exponential backoff:
- First failure: Retry immediately from checkpoint
- Second failure: Wait 5 minutes, retry
- Third failure: Wait 15 minutes, retry
- Fourth failure: Mark as failed, escalate to human
This prevents infinite loops while giving transient issues time to resolve (network recovery, GPU memory cleanup, temporary service outages).
Recovery Wins
Since implementing checkpoint recovery:
- Zero 4-hour e-books lost to late-stage crashes
- Image generation workflows resume after ComfyUI restarts
- Code Creator picks up from last completed task after validation failure
- Music Creator recovers from YuE's two-stage pipeline failures
Concrete Example: Smart Home Setup Guide E-book
Let me show you how all four patterns work together in a real workflow. When a user creates an e-book project, here's what happens:
1. Scheduler Discovers Work
UnifiedScheduler wakes up (every 5 minutes)
→ EbookProjectHandler.find_pending_work()
→ Finds project 691bd6ce65d90a698ea95ace in "planning" status
→ Dispatches to EbookOrchestrator
2. MCP Tools in Action
Phase: Research
→ fetch_server: Retrieve web articles on smart home security
→ filesystem_server: Save research to research/ directory
→ Checkpoint: research phase complete
Phase: Planning
→ LLM: Generate chapter structure (8 chapters, 28 sections)
→ filesystem_server: Save planning to planning/ directory
→ Checkpoint: planning phase complete
Phase: Drafting
→ For each section (1-28):
→ LLM: Generate section content
→ filesystem_server: Save to drafts/{chapter}_{section}.md
→ Checkpoint: section N complete
→ If failure: Retry from last checkpoint (max 3 attempts)
Phase: Validation
→ Read all drafts, validate quality/coherence
→ filesystem_server: Save audit logs to audit_trail/
→ Checkpoint: validation complete
Phase: Assembly
→ Compile chapters into manuscript
→ pandoc_server: Convert markdown → EPUB
→ pandoc_server: Convert markdown → PDF
→ filesystem_server: Save to artifacts/
→ Checkpoint: assembly complete
3. Filesystem Structure After Completion
~/mneme_data/ebooks/691bd6ce65d90a698ea95ace/ ├── artifacts/ │ ├── Smart_Home_Setup_Guide.epub (536 KB) │ ├── Smart_Home_Setup_Guide.pdf (647 KB) │ ├── Smart_Home_Setup_Guide.md (111 KB) │ ├── Smart_Home_Setup_Guide_cover.jpg (484 KB) │ └── Smart_Home_Setup_Guide_package.zip (1.2 MB) ├── audit_trail/ │ ├── 20251118_021920_e-book_planning.json │ ├── 20251118_022003_chapter_section_planning.json │ └── ... (28+ timestamped validation logs) ├── checkpoint.json (28/28 sections, 100%) └── ... (research/, planning/, drafts/, chapters/, manuscript/)
4. Recovery Scenario
If the workflow crashes at section 15:
- Checkpoint shows last completed: section 14
- UnifiedScheduler finds project in "auto_resuming" status
- EbookOrchestrator loads checkpoint, resumes from section 15
- Sections 1-14 already exist, no regeneration needed
- Bounded retry: max 3 attempts to complete section 15
- If success: continue to section 16, normal flow
- If 3 failures: mark as failed, human review required
The Payoff: Quiz Creator in Hours, Not Weeks
After extracting these four patterns, I wanted to test if they actually worked. Could I add a new creator module quickly, or was I just building elaborate scaffolding that would collapse under real use?
I decided to build Quiz Creator-a module for generating multiple-choice quizzes for my daughters. Subject-based questions, difficulty progression, immediate feedback. Nothing revolutionary, but a good test of the patterns.
With MCP, Unified Scheduler, Filesystem Discipline, and Checkpoint Recovery already in place, Quiz Creator took a few hours to build:
- Hour 1: Define quiz schema, create MongoDB collection
- Hour 2: Implement QuizProjectHandler (find_pending_work interface)
- Hour 3: Write persona prompts for question generation
- Hour 4: Test end-to-end workflow, generate first quiz
No retry logic to write-inherited from base orchestrator. No checkpoint system to build-already there. No artifact storage to implement-filesystem pattern handled it. No scheduler integration-just implemented the handler interface.
That's when I knew the patterns were working. Not when I finished documenting them, but when I could ship a new module in hours instead of weeks.
Pattern 5: Observability - See It, Fix It, Trust It
Long-running jobs feel alive because they are alive. Mneme streams WebSocket events throughout every workflow:
ebook_progress | phase: researching | "Gathering sources..." ebook_progress | phase: drafting | section 5/28 image_progress | step: 12/20 | preview: base64_png code_todo_update | task: T002 | in_progress → completed music_phase | stage: B | "Synthesis 73%..." metrics_update | tokens: 15,420 | role: research_llm, ms: 3,200
This real-time feedback transforms "wait and hope" into "watch and understand." When something fails, the event stream shows exactly where and why. When something succeeds, you can see the full audit trail leading to the result.
Lessons: What Works, What I'd Change
What I'd Keep
- MCP abstraction: Tool standardization pays dividends every time you add a new integration
- Unified Scheduler: Single heartbeat prevents resource collisions and simplifies reasoning
- Filesystem discipline: Predictable structure makes everything from debugging to publishing trivial
- Bounded retries + checkpoints: Turns intermittent failures into normal operations
What I'd Change
- Event catalog: Add semantic event names with contracts (currently events are somewhat ad-hoc)
- Test-first validation: Expand pre-generation checks to reduce downstream retries
- Dependency tracking: Better understanding of which projects depend on shared resources (LoRA models, checkpoints)
Why This Matters to AI Engineering Teams
- Patterns reduce cognitive load: Teams reuse the same spine for new modules instead of reinventing orchestration
- Tool isolation (MCP) enables swapping: Replace Pandoc with a different converter without touching orchestrators
- Recovery-first thinking normalizes failure: Intermittent LLM errors become checkpoint-and-retry, not incident response
- Filesystem + metadata discipline accelerates audits: Find any artifact from any project in seconds, not queries
- Velocity validates architecture: Quiz Creator in hours proved the patterns worked better than any design review