Building Mneme, Part 5: Music Creator

I wanted Mneme to generate music with vocals-30-60 second tracks for tutorials and audiobooks. After five weeks wrestling with YuE's CUDA dependencies, checkpoint wrangling, and Windows compatibility battles, I finally produced a 14-second clip. It worked. The architecture proved viable. And then I sat there, listening to muddy vocals and looking at 18GB of checkpoints, and realized: This isn't the answer.

Was Music Creator a failure? No. I got it working. But it was certainly Mneme's least capable module-and paradoxically, one of the most valuable learning experiences. The real win wasn't music at all: it was Sound Creator, the short-form sound effects module that emerged from Music Creator's infrastructure and became an instant family favorite.

TL;DR - Music Creator taught me when engineering success isn't product success. YuE delivered a 14-second proof-of-concept after brutal CUDA pinning and checkpoint validation. Stable Audio Open improved quality and extended to 47 seconds, but still wasn't publishable. Then I pivoted to Sound Creator (1-5 second sound effects), reused 95% of the infrastructure, and shipped in 3 days. My daughters loved it immediately. Sometimes the best outcome is what you discover, not what you planned.

Chapter 1: YuE - The 14-Second Breakthrough

YuE (乐) promised semantic token generation followed by audio synthesis: a two-stage architecture that could turn lyrics into music with vocals on a single RTX 4080. The model was sophisticated, the theory sound, the implementation... complex.

The CUDA Hell

Getting YuE running was five weeks of dependency archaeology:

FlashAttention2: No reliable Windows wheels-runtime errors every attempt. Solution: force SDPA (Scaled Dot-Product Attention) everywhere.
PyTorch pinning: Installing unrelated packages pulled CPU-only PyTorch, breaking CUDA silently. Solution: pin exact versions (2.6.0+cu124), verify wheel metadata before every install.
bitsandbytes int8: Inference failures on Windows. Solution: run fp16 for stability, accept the VRAM cost.
Checkpoint chaos: 18GB across multiple files, some from HuggingFace, some from research repos. Missing files caused cryptic 'NoneType' object has no attribute 'seek' errors. Solution: manifest with SHA256 hashes, preflight validation, refuse to run without complete set.

Known-good configuration (hard-won)
• NVIDIA driver: latest compatible with CUDA
• CUDA toolkit: cu124
• PyTorch/Torchvision/Torchaudio: 2.6.0+cu124 / 0.21.0+cu124 / 2.6.0+cu124
• Attention: sdpa (not FlashAttention2)
• Precision: fp16 across both stages
• Checkpoints: 18GB total, validated by manifest

The Moment It Worked

After pinning the stack, validating all checkpoints, and building a resumable ComfyUI workflow with artifact caching between stages, I clicked "Generate." Twenty minutes later, the progress bar completed. I played the audio.

14 seconds. Vocals. Music. It worked.

For about an hour, I felt like a genius. Then I listened again. The vocals were muddy. The instrumentation was vague. And I knew: this wasn't something I'd publish.

What YuE proved
• End-to-end viability on RTX 4080 (16GB)-but at operational cost
• Two-stage architecture with cached intermediates-resilient but complex
• Reproducibility via seeds and version pinning-when nothing broke
• 14-second maximum duration before VRAM exhaustion-not publishable

Chapter 2: Stable Audio Open - Better, But Not Enough

I couldn't ship muddy 14-second clips. The YuE experiment succeeded in teaching me the ceiling. Now I needed a better solution.

I evaluated alternatives: Suno and Udio had great quality but were API-only with usage costs. MusicGen was simpler than YuE but still multi-stage with quality trade-offs. Then I found Stable Audio Open: single-stage, 47 seconds per clip, designed for ComfyUI, Apache 2.0 license.

Migration in Two Days

The infrastructure I'd built for YuE-prompt generation, project organization, WebSocket progress, artifact storage-worked perfectly for Stable Audio Open. I swapped the ComfyUI workflow, replaced 18GB of YuE checkpoints with a 1.5GB Stable Audio model, and ran a test.

47 seconds. Clear vocals. Actual instrumentation. Three times longer than YuE, and noticeably better quality.

YuE → Stable Audio Improvements

Duration: 14s → 47s (3.4× increase)
Quality: muddy → clear vocals
Checkpoints: 18GB → 1.5GB
CUDA pinning: required → standard PyTorch
Reliability: two-stage failures → single-stage stability

What Stable Audio Taught Me

Operational simplicity matters as much as capability
Quality improvements compound user value
Infrastructure reuse accelerates iteration
But 47 seconds still wasn't publishable for tutorials

Stable Audio Open was better, but it still wasn't the answer. Users wanted 30-60 second tracks that sounded professional. Segment stitching might extend duration, but quality remained the bottleneck. Music Creator worked-but it was still the least capable module in Mneme.

Chapter 3: Sound Creator - The Unexpected Win

While evaluating what to do about music, I noticed something in the Stable Audio documentation: it excelled at short-form audio. Not 30-60 second songs-1-5 second sound effects.

I thought about tutorials, e-books, and audiobooks. They didn't just need background music-they needed sound effects. Notification chimes. Transition swooshes. Section dividers. Ambient textures. The kind of audio that's expensive to license but trivial to describe.

I had 95% of the infrastructure already built from Music Creator. What if I pivoted?

Sound Creator: Built in 3 Days

Instead of five weeks (YuE) or two days (Stable Audio migration), Sound Creator took three days because the hard work was already done:

Reused prompt generation pipeline (genre, mood, instrumentation → sound description)
Reused ComfyUI integration (job queue, WebSocket progress, artifact storage)
Reused project organization (sound libraries organized by type/mood)
Added new feature: duration parameter (1-5 seconds, user-configurable)
Added new feature: batch generation for sound effect libraries

Sound Creator Capabilities
Input: "Cheerful notification chime, bell-like, 2 seconds"
Output: Clean 2-second audio clip, perfect for UI transitions

Input: "Deep cinematic whoosh, rising tension, 3 seconds"
Output: Professional sound effect for tutorial section dividers

Input: "Gentle rain ambiance, peaceful, 5 seconds"
Output: Background texture for audiobook chapters

The Moment I Knew It Was Different

I showed Sound Creator to my daughters. With Music Creator and Stable Audio, they'd politely listened to 14-47 second clips and said "That's cool, Dad." With Sound Creator, they immediately started generating sounds: spaceship engines, cartoon boings, magical sparkles, explosion effects. They weren't being polite-they were playing.

One of them generated a perfect "level up" chime and asked if she could use it in a video project. That's when I knew: Sound Creator wasn't a consolation prize for Music Creator's limitations. It was better.

The Full Journey: What Worked, What Didn't, What Mattered

YuE (乐)

Duration: 14 seconds max
Quality: Muddy vocals, vague instrumentation
Development: 5 weeks of CUDA hell
Checkpoints: 18GB to maintain
User reaction: "Technically impressive"
Verdict: Proof of concept, not shippable

Stable Audio Open

Duration: 47 seconds
Quality: Clear vocals, real instrumentation
Development: 2 days (infrastructure reuse)
Checkpoints: 1.5GB, standard PyTorch
User reaction: "Much better"
Verdict: Better, but still not publishable

Sound Creator (The Real Win)

Duration: 1-5 seconds (perfect for use case)
Quality: Professional-grade sound effects
Development: 3 days (95% infrastructure reuse)
Checkpoints: Same 1.5GB Stable Audio model
User reaction: "Can I use this in my project?"
Verdict: Instant family favorite, actually useful

Lessons: When Success Looks Different Than Planned

Music Creator taught me lessons I couldn't have learned any other way:

1. Engineering Success ≠ Product Success

YuE worked. I proved the two-stage architecture was viable on consumer hardware. But proving viability isn't the same as delivering value. The 14-second breakthrough was an important milestone-and a clear signal to pivot.

2. Infrastructure Compounds

Five weeks on YuE felt like a loss when I realized it wasn't shippable. But that infrastructure-ComfyUI integration, prompt generation, project organization, artifact storage-enabled the Stable Audio migration in 2 days and Sound Creator in 3 days. The investment wasn't wasted; it was foundational.

3. Users Define "Better"

I chased 30-60 second music clips because that's what I thought tutorials needed. My daughters showed me that 2-second sound effects were more useful, more fun, and more immediately valuable. Product direction isn't always obvious from technical capabilities.

4. Constraints Reveal Opportunities

Music Creator's limitations-duration ceiling, quality issues, operational complexity-forced me to ask: "What could this infrastructure do well?" The answer was sound effects, not songs. The constraint became the insight.

5. Know When to Pivot

I could have spent another five weeks trying to extend YuE to 30 seconds, or optimizing Stable Audio for longer clips. Instead, I asked: "What problem can I solve today with what I've built?" Sound Creator was the answer.

The Real Lesson - Music Creator wasn't a failure. It was a structured way to learn what not to build-and discover what I should build instead. I wouldn't trade that learning for anything.

Technical Notes: What Carried Forward

For teams building similar systems, here's what transferred from YuE → Stable Audio → Sound Creator:

ComfyUI patterns: Job queue, WebSocket progress, artifact storage-worked across all three implementations
Prompt engineering: Style tags, mood descriptors, instrumentation hints-adapted easily from music to sound effects
Project organization: Sound libraries as collections, organized by type/mood/duration-reused Music Creator's structure
Validation pipeline: Duration checks, quality gates, file format verification-95% unchanged
Observability: Timings, VRAM tracking, error taxonomy-invaluable for debugging all three systems

Code Reuse by Numbers

YuE implementation:        5 weeks, ~3,500 lines (including ComfyUI nodes)
Stable Audio migration:    2 days,  ~400 lines changed
Sound Creator:             3 days,  ~600 lines new (UI + batch features)

Infrastructure carried forward: ~95%

What's Next

Sound Creator is shipping. Music Creator remains on hold-not abandoned, but waiting for the right technology to emerge. When a model arrives that can deliver 30-60 seconds of publishable-quality music on consumer hardware with reasonable operational overhead, the infrastructure is ready.

In the meantime, Sound Creator delivers immediate value:

Tutorial section dividers and transitions
Audiobook chapter markers and ambiance
UI notification sounds for Mneme itself
My daughters' video projects

Sometimes the best outcome isn't what you planned-it's what you learned along the way.

Key Takeaways for AI Teams

From the technical journey:

Pin your dependencies ruthlessly: CUDA, PyTorch, model checkpoints-bleeding-edge models demand version discipline
Validate assets preflight: 18GB checkpoint failures at runtime are expensive; catch them before the job runs
Design for recovery: Stage separation, cached intermediates, resumable jobs-save iteration time when things break
Prefer operational simplicity: SDPA over FlashAttention2, fp16 over int8-stability beats theoretical optimality

From the product journey:

Proof-of-concept ≠ shippable: Making it work once doesn't mean you should ship it
Listen to user reactions: "Technically impressive" vs "Can I use this?" tells you everything
Infrastructure compounds: Five weeks on YuE enabled 2-day and 3-day pivots later
Constraints reveal opportunities: What your system does well matters more than what you wanted it to do
Know when to pivot: The goal is value, not validating your original plan