Building Mneme, Part 3: Weller Davis - Voice Cloning, Audiobooks, and Learning to Publish

After E-book Creator raised the quality bar, the next obvious step was audio. Could Mneme turn any long-form manuscript into a clean, listenable audiobook-locally, with a consistent voice? That question led to three outcomes: a local voice cloning pipeline using ChatterboxTTS, the creation of Weller Davis as a publishing persona that helped me overcome my introvert's resistance to self-promotion, and an honest lesson about when to wait for quality instead of shipping incomplete work.

TL;DR - I built a local TTS + voice cloning pipeline with ChatterboxTTS (requiring just 15 seconds of audio to clone a voice). The technical win unlocked a pragmatic publishing loop where Weller Davis became my professional identity for testing AI-generated content in public. The hardest lesson: knowing when quality is "good enough" vs. when you need to wait-we're holding audiobooks until SSML support adds the expression they deserve.

Why Audio, Why Local, and Why Voice Cloning

Format parity: If e-books are good enough to publish, an audiobook version multiplies reach and reuse. Text + audio serves different consumption modes.
Local-first constraints: Privacy, cost control, and running on my own hardware-consistent with Mneme's philosophy from Part 1.
Consistency: A single "author voice" improves brand coherence across a growing catalog. Every e-book narrated by the same voice builds recognition.
Extensibility: Giving multiple personas their own voices opens up possibilities (podcasts with multi-voice conversations, tutorial narration, character voices for comics).

But there was another reason, more personal: I needed to test if Mneme's content had real value, not in a vacuum but in the real world where people decide if something is worth their time.

The Birth of Weller Davis (and Why Hiding Helped)

I'm an introvert. The idea of building a personal brand made me uncomfortable. Putting my face on content, self-promoting on LinkedIn, asking people to read my work-all of it felt like performative nonsense I wanted no part of.

But I had a problem: I couldn't measure Mneme's value without publishing something. "Does this e-book teach anyone anything?" isn't a question you can answer in private. I needed readers, feedback, and signal-but I wasn't seeking personal publicity.

Enter Weller Davis-a professional author persona backed by real work. Not a pseudonym to hide behind, but a professional identity I could stand behind. The content was AI-assisted, but I curated, validated, and published it. Weller Davis became the brand; I became the builder.

Publishing Footprint
• Website: wellerdavis.com (canonical home for blogs, e-books, technical content)
• LinkedIn: Weller Davis page (professional presence, limited automation)
• Distribution: KDP for e-book publication, blog content for traffic
• Voice: "Virtual Me" audiobook narrator (ChatterboxTTS voice clone)

This separation gave me freedom to experiment. If a blog post flopped, that was Weller Davis learning-not me failing publicly. If an e-book got traction, I could observe what worked without tying my personal identity to every piece.

It also helped with confidence. Publishing content to LinkedIn is stressful in the beginning. Every post feels like shouting into the void, wondering if you're saying something valuable or just adding to the noise. Posting as Weller Davis gave me psychological distance. Do it for a while, and you get more confident. Weller helped me build that muscle.

Voice Cloning: The Technical Reality

To make audiobooks consistent, I needed a narrator voice. Not a generic TTS robot-something recognizable, human, mine. That meant voice cloning.

ChatterboxTTS: 15 Seconds to Clone a Voice

I chose ChatterboxTTS v0.1.4, a Python package for local text-to-speech with voice cloning. The appeal: it runs entirely local (MPS on Mac, CUDA on PC, or CPU fallback), requires minimal audio samples, and gives you full control over the voice model.

The process was surprisingly simple:

Voice Cloning Workflow
1. Record audio sample: 15 seconds of clean speech (we used a bit more to be safe)
2. Train voice profile: ChatterboxTTS extracts voice characteristics from the sample
3. Generate speech: Pass text + voice profile → get audio in your cloned voice
4. Control parameters: `exaggeration` (expressiveness) and `cfg_weight` (guidance strength)

The hardest part? Finding a decent microphone. I dug through drawers until I found an old USB mic from a forgotten Skype setup. Cleaned it off, tested it, good enough. That's the beauty of local-first: you don't need studio-quality equipment-just "clean enough for the model to learn from."

The First Listen: Uncanny and Humbling

The first time I heard "Virtual Me" read back one of my e-book chapters, it was uncanny-and a little uncomfortable. The cadence was mine. The slight pauses where I'd naturally breathe. The way I'd emphasize certain words. It was me, but not me.

But it also revealed something I hadn't anticipated: listening to your own voice read your AI-generated content is humbling. Every awkward sentence structure becomes obvious. Every paragraph that meanders gets exposed. When you read silently, your brain smooths over rough spots. When you hear it in your own voice, there's nowhere to hide.

I revised three chapters after that first listen. Not because the content was wrong, but because it didn't sound right when spoken aloud. That feedback loop-write, generate audio, listen, revise-became a quality gate I hadn't planned for.

The Audiobook Pipeline: ChatterboxTTS + Resumable Generation

The end-to-end flow handles real-world weirdness: punctuation, headings, code blocks, long chapters. It needed to be resumable-generating a 40-minute audiobook can fail halfway through, and I didn't want to start over.

1) Manuscript Preparation
• Clean chapter/section boundaries
• Escape code blocks / inline math
• Remove figure placeholders or convert to narrated captions
• Normalize punctuation for TTS cadence

2) Voice Model
• ChatterboxTTS voice clone from 15-sec sample
• Voice profile stored in MongoDB
• Parameters: exaggeration (expressiveness), cfg_weight (guidance)
• Supports multiple personas/voices

3) Section-Based Generation
• Chunk text into 500-char sections
• Generate each section independently
• Cache generated audio (content-addressable)
• Resume from last completed section on failure

4) Assembly & Post-Processing
• Concatenate section audio files
• Normalize loudness (RMS leveling)
• Optional fade-in/out for chapters
• Export MP3 with metadata

Resumable Generation: Why It Matters

Generating a full audiobook takes time-20 to 40 minutes for a typical e-book on my M4 Max. If generation fails at 80%, I don't want to restart from scratch. Mneme's section-based approach saves each 500-char chunk as it's generated. If the process crashes, it resumes from the last completed section.

This also enables content-addressable caching. If the same sentence appears in multiple e-books ("In this chapter, we'll explore..."), Mneme reuses the cached audio. Over time, common phrases become instant-no regeneration needed.

The Quality Challenge: When Good Enough Isn't

ChatterboxTTS produced clean, intelligible audio. The voice clone was recognizable. Pronunciation was mostly correct (with a few hiccups on technical terms). For a local, privacy-preserving TTS system, it was impressive.

But it lacked expression.

ChatterboxTTS has built-in `exaggeration` controls to adjust expressiveness, but it's a single knob-not fine-grained enough for narrative flow. Reading a technical paragraph requires a different tone than reading an example or a cautionary warning. That's where SSML (Speech Synthesis Markup Language) comes in-it lets you annotate text with pacing cues, emphasis, pauses, and tone shifts.

Example: SSML for Expression

<speak>
  This is a <emphasis level="strong">critical</emphasis> concept.
  <break time="500ms"/>
  Let me repeat: <prosody rate="slow">critical</prosody>.
</speak>

We built the infrastructure for SSML generation-Mneme's SSML service can annotate e-book text with appropriate markup-but ChatterboxTTS doesn't support SSML input yet. Adding that support is on the TODO list, but it's non-trivial work.

So we made a decision: wait.

The Decision to Wait

We could have published audiobooks without SSML. The quality was fine-good enough that people could listen and learn. But "fine" isn't the bar I wanted to clear. If I'm asking someone to spend 2 hours listening to an audiobook, it should be engaging, not just correct.

This decision echoes the YuE → Stable Audio migration from Part 5: sometimes the engineering works, but the product isn't ready. Shipping incomplete work damages trust more than waiting for quality.

We're holding audiobooks until SSML support lands in ChatterboxTTS (or we implement it ourselves). In the meantime, the pipeline is ready, the voice clones work, and we've validated the technical approach. When expression catches up to accuracy, we'll ship.

First Listeners: My Girls and the Blog Posts

Even without publishing audiobooks publicly, we tested the pipeline at home. I generated audio versions of blog posts and let my daughters listen to "their voices" reading Weller Davis content. (I'd cloned their voices too-15 seconds each, same process.)

They thought it was hilarious. Hearing themselves narrate a blog post about API design or continual learning felt like magic to them. E-books were too long to hold their attention, but short blog posts? Perfect.

That reaction validated the idea even if we weren't ready to ship the product. Voice cloning works. The pipeline works. The engagement is real. We're just waiting for the last piece-expression-to make it worth publishing.

The Publishing Loop: Real Readers, Real Feedback

To measure Mneme's value, I published e-books and blogs under Weller Davis and promoted them via LinkedIn. The workflow was manual (LinkedIn API limits made automation impractical), but that was fine-manual posting forced me to engage, not just broadcast.

What I Learned About Publishing AI-Generated Content

Transparency matters: I don't hide that Mneme assists with content generation. The value is in curation, validation, and quality-not pretending it's 100% human-written.
Narrow beats broad: Posts targeting intermediate practitioners (specific, actionable advice) performed better than generic "intro to X" content.
Checklists and pitfalls win: Readers engage more with concrete, steal-able advice than abstract principles.
Consistency builds trust: Publishing regularly (even imperfectly) signals commitment. Weller Davis became recognizable through steady output.
Confidence is a muscle: The first few posts felt like shouting into the void. By post 20, I stopped caring about vanity metrics and focused on whether the content helped someone.

LinkedIn: Stressful at First, Easier Over Time

Publishing under Weller Davis helped me build confidence I wouldn't have had posting as myself. Early on, every post felt high-stakes: "Is this good enough? Will people think it's spam? Am I just adding to the noise?"

But here's the thing about putting work out there: you get used to it. The first post is terrifying. The tenth post is routine. By the twentieth, you've internalized that some posts land and some don't-and that's fine. The metric that matters isn't likes; it's whether someone learned something useful.

Weller Davis gave me permission to fail without it feeling personal. That separation was key.

A Note on Ethics and Disclosure

Weller Davis is a consistent publishing identity across Mneme-generated content-not a deception, but a commitment to transparency and quality. Behind every published piece is rigorous human curation: I spend substantial time reading, validating, fact-checking, and refining before publication.

AI generates the draft; I ensure it meets editorial standards and provides genuine value. This hybrid approach acknowledges AI's role while maintaining accountability for quality. The byline signals: "This was AI-assisted, but it was vetted by a real human who stands behind it."

That's the contract with readers. Weller Davis isn't a mask-it's a brand that represents quality, consistency, and honesty about process.

Integrating with Mneme's Creator Pattern

Audiobook generation is an artifact stage on top of E-book Creator. The orchestration follows Mneme's unified pattern (Topic → Project → Workflow → Validation → Artifacts):

E-book Project (completed)
  → Voice Profile Selection (persona or custom clone)
  → SSML Annotation Service (adds expression markup)
  → TTS Service (ChatterboxTTS with voice clone)
    → Section-Based Generation (500-char chunks, cached)
    → Resume on Failure (from last completed section)
  → Audio Assembly (concatenate sections, normalize)
  → Post-Process (loudness, fades, metadata)
  → Packaging (MP3 + chapter map + cover art)
  → Publishing (when SSML support is ready)

Results and Current Status

What Works

Voice cloning with 15-sec samples (ChatterboxTTS)
Resumable section-based generation
Content-addressable caching for reuse
Local inference on M4 Max (MPS) and RTX 4080 (CUDA)
Multiple voice profiles (Weller, kids' voices, future personas)

What's Pending

SSML support in ChatterboxTTS for expression
Public audiobook release (waiting for SSML)
Longer-form audiobooks (currently tested to ~40 min)
Multi-voice narration (personas in dialogue)

Publishing Metrics (E-books + Blogs)

Engagement: Meaningful interactions on LinkedIn posts (quality over quantity).
Revenue: Small but real sales from published e-books via KDP (trickle, not flood).
Feedback: Readers preferred actionable chapters with checklists and pitfalls (validates editorial contract from Part 2).
Confidence: Built publishing muscle through Weller Davis-no longer intimidated by putting work out there.

What's Next for Audio

Add SSML support to ChatterboxTTS (or implement a wrapper that handles it)
Publish first audiobooks when expression quality meets the bar
Multi-voice podcasts: Personas in conversation (casey + priya discussing architecture, etc.)
Per-chapter samples: Web-embedded waveform previews (client-side, no server CPU)
Voice style transfer: Alternate emotional tones for different content types (calm for meditations, energetic for tutorials)

Lessons: When to Ship, When to Wait

The audiobook pipeline taught me something I didn't expect: knowing when quality is "good enough" vs. when you need to wait is a skill.

E-books with editorial contracts? Ship when they clear the checklist (examples, pitfalls, actionable advice).
Audiobooks without expression? Wait. Monotone narration for 2 hours isn't engaging, no matter how accurate.
Publishing as Weller Davis? Ship early and often. Confidence comes from repetition, not perfection.

The technical wins-voice cloning, resumable generation, caching-proved the concept. The product decision-waiting for SSML-honored the reader's experience. Both matter.