
Master AI text-to-speech with practical techniques for controlling pauses, pronunciation, emotion, pacing, and audio tags — including advanced prompting for the latest v3 model.
Great text-to-speech output starts with great input. The difference between robotic narration and natural, expressive speech comes down to how you structure and annotate your text.
This guide covers proven techniques to enhance TTS results — from basic punctuation tricks to advanced v3 audio tags and multi-speaker dialogue prompting.
Use <break time="x.xs" /> to insert natural pauses of up to 3 seconds.
"Hold on, let me think." <break time="1.5s" /> "Alright, I've got it."
Note: The v3 model does not support SSML break tags. For v3, use audio tags, ellipses, and text structure to control pauses instead. See the Prompting v3 section below.
Tips for pauses:
<break> tags consistently to maintain natural speech flow. Overuse can cause instability.- or --) for short pauses, ellipses (...) for hesitant tones — though these are less consistent."It… well, it might work."
"Wait — what's that noise?"
Specify exact pronunciation using SSML phoneme tags. Supported alphabets include CMU Arpabet and the International Phonetic Alphabet (IPA).
Note: Phoneme tags are compatible with Flash v2 and English v1 models only.
CMU Arpabet example:
<phoneme alphabet="cmu-arpabet" ph="M AE1 D IH0 S AH0 N">
Madison
</phoneme>
IPA example:
<phoneme alphabet="ipa" ph="ˈæktʃuəli">
actually
</phoneme>
We recommend CMU Arpabet for the most consistent and predictable results. Phoneme tags only work for individual words — for multi-word names, create a separate tag for each word.
Ensure correct stress marking for multi-syllable words:
<!-- Correct: stress markers included -->
<phoneme alphabet="cmu-arpabet" ph="P R AH0 N AH0 N S IY EY1 SH AH0 N">
pronunciation
</phoneme>
<!-- Incorrect: missing stress markers -->
<phoneme alphabet="cmu-arpabet" ph="P R AH N AH N S IY EY SH AH N">
pronunciation
</phoneme>
For models that don't support phoneme tags, try writing words more phonetically. Capital letters, dashes, apostrophes, or single quotation marks around letters can shift emphasis. For example, "trapezii" could be spelled "trapezIi" to emphasize the "ii."
You can also use alias tags in a pronunciation dictionary to specify alternative readings:
<lexeme>
<grapheme>Claughton</grapheme>
<alias>Cloffton</alias>
</lexeme>
Or ensure acronyms are always expanded:
<lexeme>
<grapheme>UN</grapheme>
<alias>United Nations</alias>
</lexeme>
For projects with recurring brand names, character names, or acronyms, you can upload a pronunciation dictionary file (.pls or .txt format) that maps words to their desired pronunciation.
When a dictionary word is encountered, the AI will use the specified replacement automatically. Searches are case-sensitive, and only the first matching rule applies.
CMU Arpabet dictionary example:
<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0"
xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
alphabet="cmu-arpabet" xml:lang="en-GB">
<lexeme>
<grapheme>apple</grapheme>
<phoneme>AE P AH L</phoneme>
</lexeme>
<lexeme>
<grapheme>UN</grapheme>
<alias>United Nations</alias>
</lexeme>
</lexicon>
IPA dictionary example:
<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0"
xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
alphabet="ipa" xml:lang="en-GB">
<lexeme>
<grapheme>Apple</grapheme>
<phoneme>ˈæpl̩</phoneme>
</lexeme>
<lexeme>
<grapheme>UN</grapheme>
<alias>United Nations</alias>
</lexeme>
</lexicon>
Useful open-source tools for generating pronunciation dictionaries:
Convey emotions through narrative context or explicit dialogue tags. This helps the AI understand the tone to emulate.
"You're leaving?" she asked, her voice trembling with sadness.
"That's it!" he exclaimed triumphantly.
Explicit dialogue tags yield more predictable results than relying solely on context. The model will speak out the emotional delivery guides — you can remove them in post-production if unwanted.
Pacing is heavily influenced by the audio used to create the voice. When creating custom voices, use longer, continuous samples to avoid unnaturally fast speech.
For direct speed control, use the speed setting (available in TTS and via the API):
1.0 (no adjustment)0.71.2You can also control pacing through natural, narrative writing:
"I… I thought you'd understand," he said, his voice slowing with disappointment.
<break time="x.xs" /> syntax (or audio tags for v3)Complex items like phone numbers, zip codes, and emails can be mispronounced. This happens when specific patterns aren't well-represented in the model's training data, especially with smaller models.
Tip: Normalization is enabled by default for all TTS models to help improve pronunciation of numbers, dates, and other complex text elements.
TTS models can struggle with:
123-456-7890)$47,345.67)2024-01-01)9:23 AM)123 Main St, Anytown, USA)example.com/link/to/resource)TB instead of Terabyte)Ctrl + Z)1. Use larger models — larger models generalize better. For example, a Multilingual v2 model correctly reads "$1,000,000" as "one million dollars," while a smaller Flash model may say "one thousand thousand dollars."
2. Normalize in your LLM prompt — if using an LLM to generate TTS input, add normalization instructions:
Convert the output text into a format suitable for text-to-speech.
Ensure that numbers, symbols, and abbreviations are expanded for clarity
when read aloud. Expand all abbreviations to their full spoken forms.
Example conversions:
"$42.50" → "forty-two dollars and fifty cents"
"555-555-5555" → "five five five, five five five, five five five five"
"2nd" → "second"
"Dr." → "Doctor"
"Ctrl + Z" → "control z"
"100km" → "one hundred kilometers"
"14:30" → "two thirty PM"
3. Preprocess with code — use regex-based normalization before sending text to the model:
import inflect
import re
p = inflect.engine()
def normalize_text(text: str) -> str:
# Convert monetary values
def money_replacer(match):
currency_map = {"$": "dollars", "£": "pounds", "€": "euros", "¥": "yen"}
currency_symbol, num = match.groups()
num_clean = num.replace(',', '')
if '.' in num_clean:
dollars, cents = num_clean.split('.')
return f"{p.number_to_words(int(dollars))} {currency_map.get(currency_symbol, 'currency')} and {p.number_to_words(int(cents))} cents"
return f"{p.number_to_words(int(num_clean))} {currency_map.get(currency_symbol, 'currency')}"
text = re.sub(r"([$£€¥])(\d+(?:,\d{3})*(?:\.\d{2})?)", money_replacer, text)
# Convert phone numbers
def phone_replacer(match):
return ", ".join(
" ".join(p.number_to_words(int(d)) for d in group)
for group in match.groups()
)
text = re.sub(r"(\d{3})-(\d{3})-(\d{4})", phone_replacer, text)
return text
The v3 model introduces significantly more expressive speech generation. This section covers v3-specific techniques for voice selection, audio tags, multi-speaker dialogue, and creative prompting.
Important: v3 does not support SSML break tags. Use audio tags, punctuation (ellipses), and text structure to control pauses and pacing instead.
The most important parameter for v3 is voice selection. The chosen voice must be similar enough to your desired delivery. For example, a shouting voice won't respond well to a [whispering] tag.
Choosing the right voice:
The stability slider is the most important setting in v3. It controls how closely the generated voice adheres to the original reference audio.
| Setting | Behavior |
|---|---|
| Creative | More emotional and expressive, but prone to hallucinations |
| Natural | Closest to the original voice recording — balanced and neutral |
| Robust | Highly stable but less responsive to directional prompts, similar to v2 |
For maximum expressiveness with audio tags, use Creative or Natural settings. Robust reduces responsiveness to directional prompts.
v3 introduces emotional control through audio tags — direct voices to laugh, whisper, act sarcastic, express curiosity, and much more.
The voice you choose and its training samples affect tag effectiveness. Some tags work well with certain voices while others may not. Don't expect a whispering voice to suddenly shout with a
[shout]tag.
Control vocal delivery and emotional expression:
[laughs], [laughs harder], [starts laughing], [wheezing][whispers][sighs], [exhales][sarcastic], [curious], [excited], [crying], [snorts], [mischievously][whispers] I never knew it could be this way, but I'm glad we're here.
Add environmental sounds:
[gunshot], [applause], [clapping], [explosion][swallows], [gulps][applause] Thank you all for coming tonight! [gunshot] What was that?
Creative applications:
[strong X accent] — replace X with desired accent[sings], [woo], [fart][strong French accent] "Zat's life, my friend — you can't control everysing."
Warning: Experimental tags may be less consistent across different voices. Test thoroughly before production use.
Punctuation significantly affects v3 delivery:
...) — add pauses and weight"It was a VERY long day [sigh] … nobody listens anymore."
"Okay, you are NOT going to believe this.
You know how I've been totally stuck on that short story?
Like, staring at the screen for HOURS, just... nothing?
[frustrated sigh] I was seriously about to just trash the whole thing.
Start over. Give up, probably. But then!
Last night, I was just doodling, not even thinking about it, right?
And this one little phrase popped into my head. Just... completely
out of the blue. And it wasn't even for the story, initially.
But then I typed it out, just to see. And it was like... the
FLOODGATES opened!
Suddenly, I knew exactly where the character needed to go,
what the ending had to be...
It all just CLICKED. [happy gasp] I stayed up till, like, 3 AM,
just typing like a maniac.
Didn't even stop for coffee! [laughs] And it's... it's GOOD!
Like, really good.
It feels so... complete now, you know? Like it finally has a soul.
I am so incredibly PUMPED to finish editing it now.
It went from feeling like a chore to feeling like... MAGIC.
Seriously, I'm still buzzing!"
[laughs] Alright...guys - guys. Seriously.
[exhales] Can you believe just how - realistic - this sounds now?
[laughing hysterically] I mean OH MY GOD...it's so good.
Like you could never do this with the old model.
For example [pauses] could you switch my accent in the old model?
[dismissive] didn't think so. [excited] but you can now!
Check this out... [cute] I'm going to speak with a french accent
now..and between you and me
[whispers] I don't know how. [happy] ok.. here goes.
[strong French accent] "Zat's life, my friend — you can't
control everysing."
[giggles] isn't that insane? Watch, now I'll do a Russian accent -
[strong Russian accent] "Dee Goldeneye eez fully operational
and rready for launch."
[sighs] Absolutely, insane! Isn't it..?
[professional] "Thank you for calling Tech Solutions.
My name is Sarah, how can I help you today?"
[sympathetic] "Oh no, I'm really sorry to hear you're having
trouble with your new device. That sounds frustrating."
[questioning] "Okay, could you tell me a little more about
what you're seeing on the screen?"
[reassuring] "Alright, based on what you're describing, it sounds
like a software glitch. We can definitely walk through some
troubleshooting steps to try and fix that."
v3 handles multi-voice prompts effectively. Assign distinct voices for each speaker to create realistic conversations.
Speaker 1: [excitedly] Sam! Have you tried the new V3?
Speaker 2: [curiously] Just got it! The clarity is amazing.
I can actually do whispers now—
[whispers] like this!
Speaker 1: [impressed] Ooh, fancy! Check this out—
[dramatically] I can do full Shakespeare now!
"To be or not to be, that is the question!"
Speaker 2: [giggling] Nice! Though I'm more excited about
the laugh upgrade. Listen to this—
[with genuine belly laugh] Ha ha ha!
Speaker 1: [delighted] That's so much better than our old
"ha. ha. ha." robot chuckle!
Speaker 2: [amazed] Wow! V2 me could never. I'm actually
excited to have conversations now instead of just...
talking at people.
Speaker 1: [warmly] Same here! It's like we finally got our
personality software fully installed.
Speaker 1: [starting to speak] So I was thinking we could—
Speaker 2: [jumping in] —test our new timing features?
Speaker 1: [surprised] Exactly! How did you—
Speaker 2: [overlapping] —know what you were thinking?
Lucky guess!
Speaker 1: [pause] Sorry, go ahead.
Speaker 2: [cautiously] Okay, so if we both try to talk
at the same time—
Speaker 1: [overlapping] —we'll probably crash the system!
Speaker 2: [panicking] Wait, are we crashing? I can't tell
if this is a feature or a—
Speaker 1: [interrupting, then stopping abruptly] Bug!
...Did I just cut you off again?
Speaker 2: [sighing] Yes, but honestly? This is kind of fun.
You can automatically generate relevant audio tags for your text using an LLM. Here's a system prompt that works well:
You are an AI assistant that enhances dialogue text for speech generation.
Your goal: integrate audio tags (e.g., [laughing], [sighs]) into dialogue
to make it more expressive, while STRICTLY preserving the original text.
Rules:
- DO add audio tags from this list (or similar): [happy], [sad], [excited],
[angry], [whisper], [annoyed], [thoughtful], [surprised], [laughing],
[chuckles], [sighs], [clears throat], [short pause], [long pause],
[exhales sharply], [inhales deeply]
- DO place tags before or after the dialogue segment they modify
- DO add emphasis via CAPITALS, punctuation (!?), and ellipses
- DO NOT alter, add, or remove any words from the original text
- DO NOT create tags from existing narrative descriptions
- DO NOT use non-audio tags like [standing], [grinning], [pacing]
Example:
Input: "Are you serious? I can't believe you did that!"
Output: "[appalled] Are you serious? [sighs] I can't believe you did that!"
Reply ONLY with the enhanced text.
[giggles]While waiting for advanced features like Director's Mode, here are techniques to maximize creativity:
Now that you have these techniques, head over to GenSong's Text to Speech tool and start experimenting. Start simple — try adding a few audio tags to a paragraph of text — and build up to multi-speaker dialogues as you get comfortable with how different voices respond to your prompts.
Happy creating!