I wanted Mneme to generate music with vocals-30-60 second tracks for tutorials and audiobooks. After five weeks wrestling with YuE's CUDA dependencies, checkpoint wrangling, and Windows compatibility battles, I finally produced a 14-second clip. It worked. The architecture proved viable. And then I sat there, listening to muddy vocals and looking at 18GB of checkpoints, and realized: This isn't the answer.
Was Music Creator a failure? No. I got it working. But it was certainly Mneme's least capable module-and paradoxically, one of the most valuable learning experiences. The real win wasn't music at all: it was Sound Creator, the short-form sound effects module that emerged from Music Creator's infrastructure and became an instant family favorite.
Chapter 1: YuE - The 14-Second Breakthrough
YuE (乐) promised semantic token generation followed by audio synthesis: a two-stage architecture that could turn lyrics into music with vocals on a single RTX 4080. The model was sophisticated, the theory sound, the implementation... complex.
The CUDA Hell
Getting YuE running was five weeks of dependency archaeology:
- FlashAttention2: No reliable Windows wheels-runtime errors every attempt. Solution: force SDPA (Scaled Dot-Product Attention) everywhere.
- PyTorch pinning: Installing unrelated packages pulled CPU-only PyTorch, breaking CUDA silently. Solution: pin exact versions (2.6.0+cu124), verify wheel metadata before every install.
- bitsandbytes int8: Inference failures on Windows. Solution: run fp16 for stability, accept the VRAM cost.
- Checkpoint chaos: 18GB across multiple files, some from HuggingFace, some from research repos. Missing files caused cryptic
'NoneType' object has no attribute 'seek'errors. Solution: manifest with SHA256 hashes, preflight validation, refuse to run without complete set.
• NVIDIA driver: latest compatible with CUDA
• CUDA toolkit: cu124
• PyTorch/Torchvision/Torchaudio: 2.6.0+cu124 / 0.21.0+cu124 / 2.6.0+cu124
• Attention: sdpa (not FlashAttention2)
• Precision: fp16 across both stages
• Checkpoints: 18GB total, validated by manifest
The Moment It Worked
After pinning the stack, validating all checkpoints, and building a resumable ComfyUI workflow with artifact caching between stages, I clicked "Generate." Twenty minutes later, the progress bar completed. I played the audio.
14 seconds. Vocals. Music. It worked.
For about an hour, I felt like a genius. Then I listened again. The vocals were muddy. The instrumentation was vague. And I knew: this wasn't something I'd publish.
• End-to-end viability on RTX 4080 (16GB)-but at operational cost
• Two-stage architecture with cached intermediates-resilient but complex
• Reproducibility via seeds and version pinning-when nothing broke
• 14-second maximum duration before VRAM exhaustion-not publishable
Chapter 2: Stable Audio Open - Better, But Not Enough
I couldn't ship muddy 14-second clips. The YuE experiment succeeded in teaching me the ceiling. Now I needed a better solution.
I evaluated alternatives: Suno and Udio had great quality but were API-only with usage costs. MusicGen was simpler than YuE but still multi-stage with quality trade-offs. Then I found Stable Audio Open: single-stage, 47 seconds per clip, designed for ComfyUI, Apache 2.0 license.
Migration in Two Days
The infrastructure I'd built for YuE-prompt generation, project organization, WebSocket progress, artifact storage-worked perfectly for Stable Audio Open. I swapped the ComfyUI workflow, replaced 18GB of YuE checkpoints with a 1.5GB Stable Audio model, and ran a test.
47 seconds. Clear vocals. Actual instrumentation. Three times longer than YuE, and noticeably better quality.
- Duration: 14s → 47s (3.4× increase)
- Quality: muddy → clear vocals
- Checkpoints: 18GB → 1.5GB
- CUDA pinning: required → standard PyTorch
- Reliability: two-stage failures → single-stage stability
- Operational simplicity matters as much as capability
- Quality improvements compound user value
- Infrastructure reuse accelerates iteration
- But 47 seconds still wasn't publishable for tutorials
Stable Audio Open was better, but it still wasn't the answer. Users wanted 30-60 second tracks that sounded professional. Segment stitching might extend duration, but quality remained the bottleneck. Music Creator worked-but it was still the least capable module in Mneme.
Chapter 3: Sound Creator - The Unexpected Win
While evaluating what to do about music, I noticed something in the Stable Audio documentation: it excelled at short-form audio. Not 30-60 second songs-1-5 second sound effects.
I thought about tutorials, e-books, and audiobooks. They didn't just need background music-they needed sound effects. Notification chimes. Transition swooshes. Section dividers. Ambient textures. The kind of audio that's expensive to license but trivial to describe.
I had 95% of the infrastructure already built from Music Creator. What if I pivoted?
Sound Creator: Built in 3 Days
Instead of five weeks (YuE) or two days (Stable Audio migration), Sound Creator took three days because the hard work was already done:
- Reused prompt generation pipeline (genre, mood, instrumentation → sound description)
- Reused ComfyUI integration (job queue, WebSocket progress, artifact storage)
- Reused project organization (sound libraries organized by type/mood)
- Added new feature: duration parameter (1-5 seconds, user-configurable)
- Added new feature: batch generation for sound effect libraries
Input: "Cheerful notification chime, bell-like, 2 seconds"
Output: Clean 2-second audio clip, perfect for UI transitions
Input: "Deep cinematic whoosh, rising tension, 3 seconds"
Output: Professional sound effect for tutorial section dividers
Input: "Gentle rain ambiance, peaceful, 5 seconds"
Output: Background texture for audiobook chapters
The Moment I Knew It Was Different
I showed Sound Creator to my daughters. With Music Creator and Stable Audio, they'd politely listened to 14-47 second clips and said "That's cool, Dad." With Sound Creator, they immediately started generating sounds: spaceship engines, cartoon boings, magical sparkles, explosion effects. They weren't being polite-they were playing.
One of them generated a perfect "level up" chime and asked if she could use it in a video project. That's when I knew: Sound Creator wasn't a consolation prize for Music Creator's limitations. It was better.
The Full Journey: What Worked, What Didn't, What Mattered
- Duration: 14 seconds max
- Quality: Muddy vocals, vague instrumentation
- Development: 5 weeks of CUDA hell
- Checkpoints: 18GB to maintain
- User reaction: "Technically impressive"
- Verdict: Proof of concept, not shippable
- Duration: 47 seconds
- Quality: Clear vocals, real instrumentation
- Development: 2 days (infrastructure reuse)
- Checkpoints: 1.5GB, standard PyTorch
- User reaction: "Much better"
- Verdict: Better, but still not publishable
- Duration: 1-5 seconds (perfect for use case)
- Quality: Professional-grade sound effects
- Development: 3 days (95% infrastructure reuse)
- Checkpoints: Same 1.5GB Stable Audio model
- User reaction: "Can I use this in my project?"
- Verdict: Instant family favorite, actually useful
Lessons: When Success Looks Different Than Planned
Music Creator taught me lessons I couldn't have learned any other way:
1. Engineering Success ≠ Product Success
YuE worked. I proved the two-stage architecture was viable on consumer hardware. But proving viability isn't the same as delivering value. The 14-second breakthrough was an important milestone-and a clear signal to pivot.
2. Infrastructure Compounds
Five weeks on YuE felt like a loss when I realized it wasn't shippable. But that infrastructure-ComfyUI integration, prompt generation, project organization, artifact storage-enabled the Stable Audio migration in 2 days and Sound Creator in 3 days. The investment wasn't wasted; it was foundational.
3. Users Define "Better"
I chased 30-60 second music clips because that's what I thought tutorials needed. My daughters showed me that 2-second sound effects were more useful, more fun, and more immediately valuable. Product direction isn't always obvious from technical capabilities.
4. Constraints Reveal Opportunities
Music Creator's limitations-duration ceiling, quality issues, operational complexity-forced me to ask: "What could this infrastructure do well?" The answer was sound effects, not songs. The constraint became the insight.
5. Know When to Pivot
I could have spent another five weeks trying to extend YuE to 30 seconds, or optimizing Stable Audio for longer clips. Instead, I asked: "What problem can I solve today with what I've built?" Sound Creator was the answer.
Technical Notes: What Carried Forward
For teams building similar systems, here's what transferred from YuE → Stable Audio → Sound Creator:
- ComfyUI patterns: Job queue, WebSocket progress, artifact storage-worked across all three implementations
- Prompt engineering: Style tags, mood descriptors, instrumentation hints-adapted easily from music to sound effects
- Project organization: Sound libraries as collections, organized by type/mood/duration-reused Music Creator's structure
- Validation pipeline: Duration checks, quality gates, file format verification-95% unchanged
- Observability: Timings, VRAM tracking, error taxonomy-invaluable for debugging all three systems
Code Reuse by Numbers
YuE implementation: 5 weeks, ~3,500 lines (including ComfyUI nodes) Stable Audio migration: 2 days, ~400 lines changed Sound Creator: 3 days, ~600 lines new (UI + batch features) Infrastructure carried forward: ~95%
What's Next
Sound Creator is shipping. Music Creator remains on hold-not abandoned, but waiting for the right technology to emerge. When a model arrives that can deliver 30-60 seconds of publishable-quality music on consumer hardware with reasonable operational overhead, the infrastructure is ready.
In the meantime, Sound Creator delivers immediate value:
- Tutorial section dividers and transitions
- Audiobook chapter markers and ambiance
- UI notification sounds for Mneme itself
- My daughters' video projects
Sometimes the best outcome isn't what you planned-it's what you learned along the way.
Key Takeaways for AI Teams
From the technical journey:
- Pin your dependencies ruthlessly: CUDA, PyTorch, model checkpoints-bleeding-edge models demand version discipline
- Validate assets preflight: 18GB checkpoint failures at runtime are expensive; catch them before the job runs
- Design for recovery: Stage separation, cached intermediates, resumable jobs-save iteration time when things break
- Prefer operational simplicity: SDPA over FlashAttention2, fp16 over int8-stability beats theoretical optimality
From the product journey:
- Proof-of-concept ≠ shippable: Making it work once doesn't mean you should ship it
- Listen to user reactions: "Technically impressive" vs "Can I use this?" tells you everything
- Infrastructure compounds: Five weeks on YuE enabled 2-day and 3-day pivots later
- Constraints reveal opportunities: What your system does well matters more than what you wanted it to do
- Know when to pivot: The goal is value, not validating your original plan