Enter the text content for this dialogue segment.
Select the voice character for this dialogue.
Single speaker
Xavier: [calm] Welcome to the AI studio, where photos come to life with AI Avatar Lip Sync. [excited] Upload an image and an audio file, then watch your avatar speak naturally.
Multi-speaker dialogue
Juniper: [excitedly] Hey James! Have you tried the new ElevenLabs V3?
James: [curiously] Yeah, just got it! The emotion is so amazing. I can actually do whispers now— [whispering] like this!
Text to Speech That Acts the Script, Not Just Reads It
This text to speech tool is an AI voice generator built for performance, not playback: write a script, assign each line to one of 113 voices, and direct the delivery with audio tags like [whispers], [excited], or [interrupting]. It runs ElevenLabs' v3 dialogue engine — the expressive generation, now generally available — and speaks 75 languages with automatic detection. If you want a page read aloud, a reader app does that; if you want a scene performed, this is the booth. The director's manual is below.
A Reader Reads. A Performer Delivers.
Two kinds of tool share the name "text to speech." Pick the right species first.
Text reader apps
Built for consumption: they read articles, PDFs, and screens aloud in a steady, neutral voice — accessibility and listen-while-commuting tools.
Great for intake. Not built to produce content.
A voice performance engine — this page
Built for production: scripted lines, cast voices, emotional direction, multi-speaker scenes — output you publish, not output you follow along to.
If the audio is the product, you are in the right place.
Reviewers draw the same line inside ElevenLabs' own lineup: the older v2 line remains the steadier pick for flat narration, while v3 — the engine here — is consistently rated stronger wherever emotion, dialogue, and delivery matter.
Audio Tags: Stage Directions for Voices
Square-bracket cues the engine performs — ElevenLabs' own docs group them into four jobs.
Emotional shifts
Set or flip the feeling mid-line; the read follows the bracket.
[excited] [annoyed] [sarcastic] [flustered] [sighs]
Rhythm and pacing
Control tempo and hesitation the way punctuation never quite can.
[fast-paced] [hesitates] [pause] [drawn out]
Turn-taking and interruptions
The dialogue-native group: speakers cut in, overlap, and trade lines like a real conversation.
[interrupting] [overlapping] [cuts in]
Identity and character
Push a voice into a role without changing the voice itself.
[childlike tone] [deep voice] [pirate voice] [robotic tone]
Even sound effects ride in brackets — official examples run from [laughs] to [gunshot] and [explosion]. Use them like seasoning: one or two per passage, placed immediately before the words they direct.
The Most Important Setting on the Page
Reviewers keep reaching the same verdict: the stability mode decides how much acting you get — and how much risk.
Creative
Maximum expressiveness and the strongest response to audio tags — with a documented tendency to improvise, occasionally beyond the script.
Character work, drama, anything where flat delivery is the failure mode. Review each take.
Natural
The default and the balance point: close to the original voice, reliable tag response, few surprises.
Podcasts, explainers, most production work — start here.
Robust
Maximum consistency, minimum drama: steady output that holds across long passages but largely ignores directional tags.
Long neutral narration where uniformity beats expression.
Working rule: direct in Creative or Natural; endure in Robust. Tags need headroom to act.
Writing for More Than One Voice
Multi-speaker output is line-based: each line carries its own text and its own voice.
One line, one speaker.
The editor assigns a voice per line — alternate lines to build an exchange, and give each recurring character a fixed voice for the whole script.
Budget the 5,000 characters.
The cap covers all lines combined. A two-voice scene splits the budget — trim stage chatter that a single bracket can express instead.
Stage interruptions with tags, not dashes.
[interrupting] and [overlapping] at the start of a line cue the engine to collide turns naturally — the dialogue behavior punctuation alone cannot trigger.
Read it aloud once before generating.
If a human stumbles on the line, the model inherits the stumble. Awkward scripts make awkward audio in any voice.
Casting From 113 Voices Without Auditioning All of Them
Every voice has an instant preview. The shortcut is knowing what to listen for.
- Cast by role, not by vibe: narrator, host, character — shortlist three per role and preview each with your actual opening line.
- Contrast pairs win in dialogue: two similar voices blur together; pick distinct registers so listeners always know who is speaking.
- Match voice to language: accents shift between languages on the same voice — preview in the language you will publish.
- Lock the cast before tuning tags: changing a voice resets your sense of timing. Decide who speaks, then direct how.
Four Productions This Studio Handles
Each card pairs the brief with the direction that makes it work.
A two-host podcast, no studio
The brief: A weekly show with banter, not alternating monologues.
The direction: Two contrasting voices, Natural mode, [overlapping] on reactions and [laughs] where it genuinely lands.
What returns: A conversational episode that sounds produced, ready for the feed.
Producer note: Write the banter loose — interruption tags do the chemistry work scripts usually fake.
Audiobook chapters with a full cast
The brief: Narration plus distinct character voices, chapter by chapter.
The direction: A Robust narrator for continuity; Creative character lines with one emotion tag per scene.
What returns: A multi-voice chapter that holds attention without a recording booth.
Producer note: Generate chapter by chapter under the character budget, reusing the same cast every time.
A thirty-second spot in five takes
The brief: Ad copy that needs energy, a beat of doubt, and a confident close.
The direction: One charismatic voice, Creative mode, [excited] open, [pause] before the offer.
What returns: Broadcast-paced delivery you can A/B against alternate reads in minutes.
Producer note: Spell out numbers and symbols — 'twenty percent off' reads cleaner than '20% off.'
The voice track for a talking avatar
The brief: A presenter video needs its voiceover first.
The direction: One steady voice, Natural mode, minimal tags — lip-sync prefers clean, even delivery.
What returns: A dialogue-engine voice track that drops straight into the AI Avatar tool on this site.
Producer note: Keep it dry: heavy emotion tags and effects fight the lip-sync downstream.
Where Expressive TTS Pushes Back
Five behaviors that surprise first-time directors — and the adjustment for each.
Creative mode sometimes improvises beyond the script.
Direction: That is the documented trade for expressiveness. Audition important lines, keep Creative for character moments, and let Natural carry the spine of the piece.
A tag gets read literally or silently skipped.
Direction: Three checks in order: the mode (Robust dampens tags — move up), the placement (brackets directly before the target words), the density (one or two per passage; stacked tags compete).
Long projects hit the 5,000-character ceiling.
Direction: Chapter the script, keep voice assignments and mode identical across renders, and join the files in an editor — consistency holds because the cast never changed.
Numbers, symbols, and abbreviations read unpredictably.
Direction: Write them out: "doctor" not "Dr.", "twenty twenty-six" when you want the year spoken that way. The script is the pronunciation contract.
Smaller languages carry stronger accents on some voices.
Direction: Preview candidates in the target language before committing — voice character travels, but accent quality varies voice by voice across the 75 options.
The Direction Playbook
Pulled from ElevenLabs' best-practices guidance, then checked against production use.
Punctuation is pacing
Commas breathe, periods stop, ellipses trail, em dashes cut. The engine reads punctuation as timing — rewrite the rhythm before reaching for another tag.
Tags direct what follows
Place the bracket immediately before the words it governs, inside the right line. A [whispers] at line start whispers the line; buried mid-sentence, it whispers only the tail.
The same line, directed
Flat
"Welcome back to the show. Today we have some really exciting news about the project."
Directed
"[excited] Welcome back to the show! [pause] Today… we finally get to talk about the project."
Same words, two performances. The directed version commits to an emotion up front, buys a beat of suspense with a tag and an ellipsis, and lets punctuation finish the acting.
This Studio, a Recording Booth, or a Reader App?
Three ways to turn words into audio.
This studio
Scripted, performed audio — dialogue, directed narration, character voices — produced at writing speed in 75 languages.
A recording booth
A specific human performance, legal reads with sign-off, or a brand voice contractually tied to a person.
A reader app
Consuming text aloud — articles, PDFs, screens. Listening tools, not production tools.
How the Text to Speech Studio Works
Script, cast, direct — the booth sits at the top of this page.
Write the script in lines
One speaker per line, up to 5,000 characters in total. Mark the emotional beats you already hear in your head.
Cast and preview the voices
Assign a voice per line from the 113-voice library — preview with your real opening line, not the sample text.
Direct, generate, retake
Drop in audio tags, pick the stability mode, and generate. Retake single lines by adjusting their tags instead of re-rolling the whole scene.
Text to Speech: The Director Questions
Performance, casting, and consistency — answered from official docs and production use.
The Voice Is Step One
Give it a face, cut it into footage, or build the scene around it.
Your Script Already Knows How It Wants to Sound
Cast the voices, drop the tags, pick the mode — and this text to speech studio performs it back in any of 75 languages. Dialogue-ready, at the top of this page.