Text-to-Speech Best Practices: Control Delivery, Emotion & Pronunciation

Great text-to-speech output starts with great input. The difference between robotic narration and natural, expressive speech comes down to how you structure and annotate your text.

This guide covers proven techniques to enhance TTS results — from basic punctuation tricks to advanced v3 audio tags and multi-speaker dialogue prompting.

Speech Controls

Pauses

Use <break time="x.xs" /> to insert natural pauses of up to 3 seconds.

"Hold on, let me think." <break time="1.5s" /> "Alright, I've got it."

Note: The v3 model does not support SSML break tags. For v3, use audio tags, ellipses, and text structure to control pauses instead. See the Prompting v3 section below.

Tips for pauses:

Use <break> tags consistently to maintain natural speech flow. Overuse can cause instability.
Different voices handle pauses differently, especially those trained with filler sounds like "uh" or "ah."
Alternatives: dashes (- or --) for short pauses, ellipses (...) for hesitant tones — though these are less consistent.

"It… well, it might work."
"Wait — what's that noise?"

Pronunciation

Phoneme Tags

Specify exact pronunciation using SSML phoneme tags. Supported alphabets include CMU Arpabet and the International Phonetic Alphabet (IPA).

Note: Phoneme tags are compatible with Flash v2 and English v1 models only.

CMU Arpabet example:

<phoneme alphabet="cmu-arpabet" ph="M AE1 D IH0 S AH0 N">
  Madison
</phoneme>

IPA example:

<phoneme alphabet="ipa" ph="ˈæktʃuəli">
  actually
</phoneme>

We recommend CMU Arpabet for the most consistent and predictable results. Phoneme tags only work for individual words — for multi-word names, create a separate tag for each word.

Ensure correct stress marking for multi-syllable words:

<!-- Correct: stress markers included -->
<phoneme alphabet="cmu-arpabet" ph="P R AH0 N AH0 N S IY EY1 SH AH0 N">
  pronunciation
</phoneme>

<!-- Incorrect: missing stress markers -->
<phoneme alphabet="cmu-arpabet" ph="P R AH N AH N S IY EY SH AH N">
  pronunciation
</phoneme>

Alias Tags

For models that don't support phoneme tags, try writing words more phonetically. Capital letters, dashes, apostrophes, or single quotation marks around letters can shift emphasis. For example, "trapezii" could be spelled "trapezIi" to emphasize the "ii."

You can also use alias tags in a pronunciation dictionary to specify alternative readings:

<lexeme>
  <grapheme>Claughton</grapheme>
  <alias>Cloffton</alias>
</lexeme>

Or ensure acronyms are always expanded:

<lexeme>
  <grapheme>UN</grapheme>
  <alias>United Nations</alias>
</lexeme>

Pronunciation Dictionaries

For projects with recurring brand names, character names, or acronyms, you can upload a pronunciation dictionary file (.pls or .txt format) that maps words to their desired pronunciation.

When a dictionary word is encountered, the AI will use the specified replacement automatically. Searches are case-sensitive, and only the first matching rule applies.

CMU Arpabet dictionary example:

<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0"
      xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
      alphabet="cmu-arpabet" xml:lang="en-GB">
  <lexeme>
    <grapheme>apple</grapheme>
    <phoneme>AE P AH L</phoneme>
  </lexeme>
  <lexeme>
    <grapheme>UN</grapheme>
    <alias>United Nations</alias>
  </lexeme>
</lexicon>

IPA dictionary example:

<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0"
      xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
      alphabet="ipa" xml:lang="en-GB">
  <lexeme>
    <grapheme>Apple</grapheme>
    <phoneme>ˈæpl̩</phoneme>
  </lexeme>
  <lexeme>
    <grapheme>UN</grapheme>
    <alias>United Nations</alias>
  </lexeme>
</lexicon>

Useful open-source tools for generating pronunciation dictionaries:

Sequitur G2P — learns pronunciation rules from data
Phonetisaurus — G2P trained on CMUdict
eSpeak — generates phoneme transcriptions from text
CMU Pronouncing Dictionary — pre-built English dictionary

Emotion

Convey emotions through narrative context or explicit dialogue tags. This helps the AI understand the tone to emulate.

"You're leaving?" she asked, her voice trembling with sadness.
"That's it!" he exclaimed triumphantly.

Explicit dialogue tags yield more predictable results than relying solely on context. The model will speak out the emotional delivery guides — you can remove them in post-production if unwanted.

Pace

Pacing is heavily influenced by the audio used to create the voice. When creating custom voices, use longer, continuous samples to avoid unnaturally fast speech.

For direct speed control, use the speed setting (available in TTS and via the API):

Default: 1.0 (no adjustment)
Slower: down to 0.7
Faster: up to 1.2
Extreme values may affect quality

You can also control pacing through natural, narrative writing:

"I… I thought you'd understand," he said, his voice slowing with disappointment.

Quick Troubleshooting

Inconsistent pauses: Use <break time="x.xs" /> syntax (or audio tags for v3)
Pronunciation errors: Use CMU Arpabet or IPA phoneme tags
Emotion mismatch: Add narrative context or explicit tags — remember to remove guidance text in post-production

Text Normalization

Complex items like phone numbers, zip codes, and emails can be mispronounced. This happens when specific patterns aren't well-represented in the model's training data, especially with smaller models.

Tip: Normalization is enabled by default for all TTS models to help improve pronunciation of numbers, dates, and other complex text elements.

Common Problem Patterns

TTS models can struggle with:

Phone numbers (123-456-7890)
Currencies ($47,345.67)
Dates (2024-01-01)
Time (9:23 AM)
Addresses (123 Main St, Anytown, USA)
URLs (example.com/link/to/resource)
Unit abbreviations (TB instead of Terabyte)
Keyboard shortcuts (Ctrl + Z)

How to Fix It

1. Use larger models — larger models generalize better. For example, a Multilingual v2 model correctly reads "$1,000,000" as "one million dollars," while a smaller Flash model may say "one thousand thousand dollars."

2. Normalize in your LLM prompt — if using an LLM to generate TTS input, add normalization instructions:

Convert the output text into a format suitable for text-to-speech.
Ensure that numbers, symbols, and abbreviations are expanded for clarity
when read aloud. Expand all abbreviations to their full spoken forms.

Example conversions:
"$42.50" → "forty-two dollars and fifty cents"
"555-555-5555" → "five five five, five five five, five five five five"
"2nd" → "second"
"Dr." → "Doctor"
"Ctrl + Z" → "control z"
"100km" → "one hundred kilometers"
"14:30" → "two thirty PM"

3. Preprocess with code — use regex-based normalization before sending text to the model:

import inflect
import re

p = inflect.engine()

def normalize_text(text: str) -> str:
    # Convert monetary values
    def money_replacer(match):
        currency_map = {"$": "dollars", "£": "pounds", "€": "euros", "¥": "yen"}
        currency_symbol, num = match.groups()
        num_clean = num.replace(',', '')

        if '.' in num_clean:
            dollars, cents = num_clean.split('.')
            return f"{p.number_to_words(int(dollars))} {currency_map.get(currency_symbol, 'currency')} and {p.number_to_words(int(cents))} cents"
        return f"{p.number_to_words(int(num_clean))} {currency_map.get(currency_symbol, 'currency')}"

    text = re.sub(r"([$£€¥])(\d+(?:,\d{3})*(?:\.\d{2})?)", money_replacer, text)

    # Convert phone numbers
    def phone_replacer(match):
        return ", ".join(
            " ".join(p.number_to_words(int(d)) for d in group)
            for group in match.groups()
        )
    text = re.sub(r"(\d{3})-(\d{3})-(\d{4})", phone_replacer, text)

    return text

Prompting v3

The v3 model introduces significantly more expressive speech generation. This section covers v3-specific techniques for voice selection, audio tags, multi-speaker dialogue, and creative prompting.

Important: v3 does not support SSML break tags. Use audio tags, punctuation (ellipses), and text structure to control pauses and pacing instead.

Voice Selection

The most important parameter for v3 is voice selection. The chosen voice must be similar enough to your desired delivery. For example, a shouting voice won't respond well to a [whispering] tag.

Choosing the right voice:

Emotionally diverse — For expressive voices, include a broad range of emotional tones in the reference audio. Include both neutral and dynamic samples.
Targeted niche — For specific use cases (sports commentary, meditation, etc.), maintain consistent emotion throughout the dataset.
Neutral — Neutral voices tend to be more stable across languages and styles, providing a reliable baseline.

Stability Setting

The stability slider is the most important setting in v3. It controls how closely the generated voice adheres to the original reference audio.

Setting	Behavior
Creative	More emotional and expressive, but prone to hallucinations
Natural	Closest to the original voice recording — balanced and neutral
Robust	Highly stable but less responsive to directional prompts, similar to v2

For maximum expressiveness with audio tags, use Creative or Natural settings. Robust reduces responsiveness to directional prompts.

Audio Tags

v3 introduces emotional control through audio tags — direct voices to laugh, whisper, act sarcastic, express curiosity, and much more.

The voice you choose and its training samples affect tag effectiveness. Some tags work well with certain voices while others may not. Don't expect a whispering voice to suddenly shout with a [shout] tag.

Control vocal delivery and emotional expression:

[laughs], [laughs harder], [starts laughing], [wheezing]
[whispers]
[sighs], [exhales]
[sarcastic], [curious], [excited], [crying], [snorts], [mischievously]

[whispers] I never knew it could be this way, but I'm glad we're here.

Sound Effect Tags

Add environmental sounds:

[gunshot], [applause], [clapping], [explosion]
[swallows], [gulps]

[applause] Thank you all for coming tonight! [gunshot] What was that?

Special & Experimental Tags

Creative applications:

[strong X accent] — replace X with desired accent
[sings], [woo], [fart]

[strong French accent] "Zat's life, my friend — you can't control everysing."

Warning: Experimental tags may be less consistent across different voices. Test thoroughly before production use.

Punctuation Matters

Punctuation significantly affects v3 delivery:

Ellipses (...) — add pauses and weight
CAPITALIZATION — increases emphasis
Standard punctuation — provides natural speech rhythm

"It was a VERY long day [sigh] … nobody listens anymore."

Single Speaker Examples

Expressive Monologue

"Okay, you are NOT going to believe this.

You know how I've been totally stuck on that short story?

Like, staring at the screen for HOURS, just... nothing?

[frustrated sigh] I was seriously about to just trash the whole thing.
Start over. Give up, probably. But then!

Last night, I was just doodling, not even thinking about it, right?

And this one little phrase popped into my head. Just... completely
out of the blue. And it wasn't even for the story, initially.

But then I typed it out, just to see. And it was like... the
FLOODGATES opened!

Suddenly, I knew exactly where the character needed to go,
what the ending had to be...

It all just CLICKED. [happy gasp] I stayed up till, like, 3 AM,
just typing like a maniac.

Didn't even stop for coffee! [laughs] And it's... it's GOOD!
Like, really good.

It feels so... complete now, you know? Like it finally has a soul.

I am so incredibly PUMPED to finish editing it now.

It went from feeling like a chore to feeling like... MAGIC.
Seriously, I'm still buzzing!"

Dynamic & Humorous

[laughs] Alright...guys - guys. Seriously.

[exhales] Can you believe just how - realistic - this sounds now?

[laughing hysterically] I mean OH MY GOD...it's so good.

Like you could never do this with the old model.

For example [pauses] could you switch my accent in the old model?

[dismissive] didn't think so. [excited] but you can now!

Check this out... [cute] I'm going to speak with a french accent
now..and between you and me

[whispers] I don't know how. [happy] ok.. here goes.
[strong French accent] "Zat's life, my friend — you can't
control everysing."

[giggles] isn't that insane? Watch, now I'll do a Russian accent -

[strong Russian accent] "Dee Goldeneye eez fully operational
and rready for launch."

[sighs] Absolutely, insane! Isn't it..?

Customer Service Simulation

[professional] "Thank you for calling Tech Solutions.
My name is Sarah, how can I help you today?"

[sympathetic] "Oh no, I'm really sorry to hear you're having
trouble with your new device. That sounds frustrating."

[questioning] "Okay, could you tell me a little more about
what you're seeing on the screen?"

[reassuring] "Alright, based on what you're describing, it sounds
like a software glitch. We can definitely walk through some
troubleshooting steps to try and fix that."

Multi-Speaker Dialogue

v3 handles multi-voice prompts effectively. Assign distinct voices for each speaker to create realistic conversations.

Dialogue Showcase

Speaker 1: [excitedly] Sam! Have you tried the new V3?

Speaker 2: [curiously] Just got it! The clarity is amazing.
I can actually do whispers now—
[whispers] like this!

Speaker 1: [impressed] Ooh, fancy! Check this out—
[dramatically] I can do full Shakespeare now!
"To be or not to be, that is the question!"

Speaker 2: [giggling] Nice! Though I'm more excited about
the laugh upgrade. Listen to this—
[with genuine belly laugh] Ha ha ha!

Speaker 1: [delighted] That's so much better than our old
"ha. ha. ha." robot chuckle!

Speaker 2: [amazed] Wow! V2 me could never. I'm actually
excited to have conversations now instead of just...
talking at people.

Speaker 1: [warmly] Same here! It's like we finally got our
personality software fully installed.

Overlapping Timing

Speaker 1: [starting to speak] So I was thinking we could—

Speaker 2: [jumping in] —test our new timing features?

Speaker 1: [surprised] Exactly! How did you—

Speaker 2: [overlapping] —know what you were thinking?
Lucky guess!

Speaker 1: [pause] Sorry, go ahead.

Speaker 2: [cautiously] Okay, so if we both try to talk
at the same time—

Speaker 1: [overlapping] —we'll probably crash the system!

Speaker 2: [panicking] Wait, are we crashing? I can't tell
if this is a feature or a—

Speaker 1: [interrupting, then stopping abruptly] Bug!
...Did I just cut you off again?

Speaker 2: [sighing] Yes, but honestly? This is kind of fun.

Enhancing Input with AI

You can automatically generate relevant audio tags for your text using an LLM. Here's a system prompt that works well:

You are an AI assistant that enhances dialogue text for speech generation.

Your goal: integrate audio tags (e.g., [laughing], [sighs]) into dialogue
to make it more expressive, while STRICTLY preserving the original text.

Rules:
- DO add audio tags from this list (or similar): [happy], [sad], [excited],
  [angry], [whisper], [annoyed], [thoughtful], [surprised], [laughing],
  [chuckles], [sighs], [clears throat], [short pause], [long pause],
  [exhales sharply], [inhales deeply]
- DO place tags before or after the dialogue segment they modify
- DO add emphasis via CAPITALS, punctuation (!?), and ellipses
- DO NOT alter, add, or remove any words from the original text
- DO NOT create tags from existing narrative descriptions
- DO NOT use non-audio tags like [standing], [grinning], [pacing]

Example:
Input:  "Are you serious? I can't believe you did that!"
Output: "[appalled] Are you serious? [sighs] I can't believe you did that!"

Reply ONLY with the enhanced text.

v3 Tips Summary

Tag combinations — combine multiple audio tags for complex emotional delivery
Voice matching — match tags to your voice's character. A serious voice won't respond well to [giggles]
Text structure — use natural speech patterns, proper punctuation, and clear emotional context
Experimentation — there are many more effective tags beyond this list. Try descriptive emotional states and actions to discover what works for your specific use case

Creative Control Techniques

While waiting for advanced features like Director's Mode, here are techniques to maximize creativity:

Narrative styling — write prompts like scripts to guide tone and pacing
Layered outputs — generate segments separately and layer them in an audio editor
Phonetic experimentation — try alternate spellings to achieve desired pronunciation
Manual adjustments — combine individual segments in post-production for precise timing
Feedback iteration — iterate by tweaking descriptions, tags, or emotional cues

What's Next?

Now that you have these techniques, head over to GenSong's Text to Speech tool and start experimenting. Start simple — try adding a few audio tags to a paragraph of text — and build up to multi-speaker dialogues as you get comfortable with how different voices respond to your prompts.

Happy creating!

Text-to-Speech Best Practices: Control Delivery, Emotion & Pronunciation

Speech Controls

Pauses

Pronunciation

Phoneme Tags

Alias Tags

Pronunciation Dictionaries

Emotion

Pace

Quick Troubleshooting

Text Normalization

Common Problem Patterns

How to Fix It

Prompting v3

Voice Selection

Stability Setting

Audio Tags

Sound Effect Tags

Special & Experimental Tags

Punctuation Matters

Single Speaker Examples

Expressive Monologue

Dynamic & Humorous

Customer Service Simulation

Multi-Speaker Dialogue

Dialogue Showcase

Overlapping Timing

Enhancing Input with AI

v3 Tips Summary

Creative Control Techniques

What's Next?

Author

Categories

Table of Contents

More Posts

Getting Started with AI Music Generation

Text-to-Speech Best Practices: Control Delivery, Emotion & Pronunciation

Speech Controls

Pauses

Pronunciation

Phoneme Tags

Alias Tags

Pronunciation Dictionaries

Emotion

Pace

Quick Troubleshooting

Text Normalization

Common Problem Patterns

How to Fix It

Prompting v3

Voice Selection

Stability Setting

Audio Tags

Voice-Related Tags

Sound Effect Tags

Special & Experimental Tags

Punctuation Matters

Single Speaker Examples

Expressive Monologue

Dynamic & Humorous

Customer Service Simulation

Multi-Speaker Dialogue

Dialogue Showcase

Overlapping Timing

Enhancing Input with AI

v3 Tips Summary

Creative Control Techniques

What's Next?

Author

Categories

Table of Contents

More Posts

Getting Started with AI Music Generation