Model

Dialogue0 / 5,000

Dialogue 1

text

Enter the text content for this dialogue segment.

voice

Select the voice character for this dialogue.

Audio Tags

[excited][happy][sad][angry][surprised]More tags

Language

Stability

Single speaker

Text to Speech

Xavier: [calm] Welcome to the AI studio, where photos come to life with AI Avatar Lip Sync. [excited] Upload an image and an audio file, then watch your avatar speak naturally.

Multi-speaker dialogue

Text to Dialogue

Juniper: [excitedly] Hey James! Have you tried the new ElevenLabs V3?

James: [curiously] Yeah, just got it! The emotion is so amazing. I can actually do whispers now— [whispering] like this!

Text to Speech That Acts the Script, Not Just Reads It

This text to speech tool is an AI voice generator built for performance, not playback: write a script, assign each line to one of 113 voices, and direct the delivery with audio tags like [whispers], [excited], or [interrupting]. It runs ElevenLabs' v3 dialogue engine — the expressive generation, now generally available — and speaks 75 languages with automatic detection. If you want a page read aloud, a reader app does that; if you want a scene performed, this is the booth. The director's manual is below.

Multi-Speaker Dialogue

Audio Tags Control

113 AI Voices

75 Languages

Free Online

Fast Generation

A Reader Reads. A Performer Delivers.

Two kinds of tool share the name "text to speech." Pick the right species first.

Text reader apps

Built for consumption: they read articles, PDFs, and screens aloud in a steady, neutral voice — accessibility and listen-while-commuting tools.

Great for intake. Not built to produce content.

A voice performance engine — this page

Built for production: scripted lines, cast voices, emotional direction, multi-speaker scenes — output you publish, not output you follow along to.

If the audio is the product, you are in the right place.

Reviewers draw the same line inside ElevenLabs' own lineup: the older v2 line remains the steadier pick for flat narration, while v3 — the engine here — is consistently rated stronger wherever emotion, dialogue, and delivery matter.

Audio Tags: Stage Directions for Voices

Square-bracket cues the engine performs — ElevenLabs' own docs group them into four jobs.

Emotional shifts

Set or flip the feeling mid-line; the read follows the bracket.

[excited] [annoyed] [sarcastic] [flustered] [sighs]

Rhythm and pacing

Control tempo and hesitation the way punctuation never quite can.

[fast-paced] [hesitates] [pause] [drawn out]

Turn-taking and interruptions

The dialogue-native group: speakers cut in, overlap, and trade lines like a real conversation.

[interrupting] [overlapping] [cuts in]

Identity and character

Push a voice into a role without changing the voice itself.

[childlike tone] [deep voice] [pirate voice] [robotic tone]

Even sound effects ride in brackets — official examples run from [laughs] to [gunshot] and [explosion]. Use them like seasoning: one or two per passage, placed immediately before the words they direct.

The Most Important Setting on the Page

Reviewers keep reaching the same verdict: the stability mode decides how much acting you get — and how much risk.

Creative

Maximum expressiveness and the strongest response to audio tags — with a documented tendency to improvise, occasionally beyond the script.

Character work, drama, anything where flat delivery is the failure mode. Review each take.

Natural

The default and the balance point: close to the original voice, reliable tag response, few surprises.

Podcasts, explainers, most production work — start here.

Robust

Maximum consistency, minimum drama: steady output that holds across long passages but largely ignores directional tags.

Long neutral narration where uniformity beats expression.

Working rule: direct in Creative or Natural; endure in Robust. Tags need headroom to act.

Writing for More Than One Voice

Multi-speaker output is line-based: each line carries its own text and its own voice.

One line, one speaker.

The editor assigns a voice per line — alternate lines to build an exchange, and give each recurring character a fixed voice for the whole script.

Budget the 5,000 characters.

The cap covers all lines combined. A two-voice scene splits the budget — trim stage chatter that a single bracket can express instead.

Stage interruptions with tags, not dashes.

[interrupting] and [overlapping] at the start of a line cue the engine to collide turns naturally — the dialogue behavior punctuation alone cannot trigger.

Read it aloud once before generating.

If a human stumbles on the line, the model inherits the stumble. Awkward scripts make awkward audio in any voice.

Casting From 113 Voices Without Auditioning All of Them

Every voice has an instant preview. The shortcut is knowing what to listen for.

Cast by role, not by vibe: narrator, host, character — shortlist three per role and preview each with your actual opening line.
Contrast pairs win in dialogue: two similar voices blur together; pick distinct registers so listeners always know who is speaking.
Match voice to language: accents shift between languages on the same voice — preview in the language you will publish.
Lock the cast before tuning tags: changing a voice resets your sense of timing. Decide who speaks, then direct how.

Four Productions This Studio Handles

Each card pairs the brief with the direction that makes it work.

A two-host podcast, no studio

The brief: A weekly show with banter, not alternating monologues.

The direction: Two contrasting voices, Natural mode, [overlapping] on reactions and [laughs] where it genuinely lands.

What returns: A conversational episode that sounds produced, ready for the feed.

Producer note: Write the banter loose — interruption tags do the chemistry work scripts usually fake.

Audiobook chapters with a full cast

The brief: Narration plus distinct character voices, chapter by chapter.

The direction: A Robust narrator for continuity; Creative character lines with one emotion tag per scene.

What returns: A multi-voice chapter that holds attention without a recording booth.

Producer note: Generate chapter by chapter under the character budget, reusing the same cast every time.

A thirty-second spot in five takes

The brief: Ad copy that needs energy, a beat of doubt, and a confident close.

The direction: One charismatic voice, Creative mode, [excited] open, [pause] before the offer.

What returns: Broadcast-paced delivery you can A/B against alternate reads in minutes.

Producer note: Spell out numbers and symbols — 'twenty percent off' reads cleaner than '20% off.'

The voice track for a talking avatar

The brief: A presenter video needs its voiceover first.

The direction: One steady voice, Natural mode, minimal tags — lip-sync prefers clean, even delivery.

What returns: A dialogue-engine voice track that drops straight into the AI Avatar tool on this site.

Producer note: Keep it dry: heavy emotion tags and effects fight the lip-sync downstream.

Where Expressive TTS Pushes Back

Five behaviors that surprise first-time directors — and the adjustment for each.

Creative mode sometimes improvises beyond the script.

Direction: That is the documented trade for expressiveness. Audition important lines, keep Creative for character moments, and let Natural carry the spine of the piece.

A tag gets read literally or silently skipped.

Direction: Three checks in order: the mode (Robust dampens tags — move up), the placement (brackets directly before the target words), the density (one or two per passage; stacked tags compete).

Long projects hit the 5,000-character ceiling.

Direction: Chapter the script, keep voice assignments and mode identical across renders, and join the files in an editor — consistency holds because the cast never changed.

Numbers, symbols, and abbreviations read unpredictably.

Direction: Write them out: "doctor" not "Dr.", "twenty twenty-six" when you want the year spoken that way. The script is the pronunciation contract.

Smaller languages carry stronger accents on some voices.

Direction: Preview candidates in the target language before committing — voice character travels, but accent quality varies voice by voice across the 75 options.

The Direction Playbook

Pulled from ElevenLabs' best-practices guidance, then checked against production use.

Punctuation is pacing

Commas breathe, periods stop, ellipses trail, em dashes cut. The engine reads punctuation as timing — rewrite the rhythm before reaching for another tag.

Tags direct what follows

Place the bracket immediately before the words it governs, inside the right line. A [whispers] at line start whispers the line; buried mid-sentence, it whispers only the tail.

The same line, directed

Flat

"Welcome back to the show. Today we have some really exciting news about the project."

Directed

"[excited] Welcome back to the show! [pause] Today… we finally get to talk about the project."

Same words, two performances. The directed version commits to an emotion up front, buys a beat of suspense with a tag and an ellipsis, and lets punctuation finish the acting.

This Studio, a Recording Booth, or a Reader App?

Three ways to turn words into audio.

This studio

Scripted, performed audio — dialogue, directed narration, character voices — produced at writing speed in 75 languages.

A recording booth

A specific human performance, legal reads with sign-off, or a brand voice contractually tied to a person.

A reader app

Consuming text aloud — articles, PDFs, screens. Listening tools, not production tools.

How the Text to Speech Studio Works

Script, cast, direct — the booth sits at the top of this page.

Write the script in lines

One speaker per line, up to 5,000 characters in total. Mark the emotional beats you already hear in your head.

Cast and preview the voices

Assign a voice per line from the 113-voice library — preview with your real opening line, not the sample text.

Direct, generate, retake

Drop in audio tags, pick the stability mode, and generate. Retake single lines by adjusting their tags instead of re-rolling the whole scene.

Text to Speech: The Director Questions

Performance, casting, and consistency — answered from official docs and production use.

Three levers, in order: switch the stability mode out of Robust (Natural and Creative respond to direction), add one audio tag before the line that needs feeling — [excited], [sighs], [whispers] — and rewrite the punctuation for rhythm: ellipses trail, dashes cut. Flat output is almost never the voice; it is an undirected script. One bracket and one mode change usually transform the read.

Natural is the working default: close to the voice, responsive to tags, few surprises. Step up to Creative when the line must act — drama, characters, comedic timing — and accept that it occasionally improvises; review each take. Reserve Robust for long, deliberately flat narration where consistency beats expression and tags can be ignored. Direct in Creative or Natural; endure in Robust.

Yes — that is a designed feature of the dialogue engine, not a trick. Start the incoming line with [interrupting] or [overlapping] and the engine collides the turns with natural timing; [cuts in] lands a harder break. ElevenLabs' own dialogue documentation showcases exactly this spontaneous turn-taking. Keep the interrupted line written past the cut point so there is something to talk over.

Broadly yes: tags describe the performance, not the vocabulary, so [whispers] whispers in Spanish as readily as in English across the 75 supported languages. Practical caveats: emotional nuance lands strongest in widely-trained languages, and accent quality varies by voice — preview your cast in the target language before production. Auto-detection handles mixed-language scripts line by line.

Run the three-check: mode first (Robust deliberately dampens directional tags — move to Natural or Creative), placement second (the bracket governs the words immediately after it, inside that line), density third (stacked tags compete; keep one or two per passage). If it still misses, rephrase with a synonym — [thrilled] sometimes lands where [excited] slid past.

Fix the inputs and the output follows: same voice, same stability mode, same tag style on every chapter render. Split the script into chapters under the 5,000-character ceiling, generate sequentially, and join the files in an editor — listeners hear one narrator because nothing about the narrator changed. Robust mode adds a further layer of uniformity for long neutral reads.

Direction of travel. A reader app turns existing text into audio for you to consume — articles, PDFs, screens — in a steady utility voice. This tool turns scripts into produced audio for an audience: cast voices, emotional direction, multi-speaker scenes, retakes. If the audio is the product rather than the convenience, you want the performance engine.

Line by line, one speaker per line, each line assigned its own voice — the exchange builds itself. Keep one voice per character across the whole script, alternate lines for conversation, open reaction lines with turn-taking tags like [overlapping], and budget the 5,000 characters across the cast. Read it aloud once before generating; where you stumble, the model will too.

Yes, two ways that stack. Identity tags push a voice into a role — [pirate voice], [childlike tone], [robotic tone], [deep voice] — while the 113-voice library carries genuinely different registers and regional colors to cast from. Cast the closest natural voice first, then nudge with one identity tag; a tag-only transformation on a mismatched voice sounds like costume, not character.

Generate here, animate next door: write the script, cast one steady voice, render in Natural mode with minimal tags, download — then open the AI Avatar tool on this site and attach the file as the audio input alongside a portrait. That chain is the site's production pipeline: script to voice to talking presenter without recording anything. Keep avatar tracks dry; lip-sync reads clean delivery best.

Because "2026", "Dr.", and "20%" each have several legitimate spoken forms, and the engine picks one. Take control by spelling intent: "twenty twenty-six", "doctor", "twenty percent off". Same for acronyms — write "N. A. S. A." when you want letters; "NASA" risks a word-read. Production scripts treat the text as the pronunciation contract, and the engine honors it.

Different engines for different jobs, and this page picked the performer. Community consensus holds that the older v2 line stays steadier for flat long-form narration, while v3 — now generally available — is the stronger engine wherever expression matters: audio tags, emotional range, and true multi-speaker dialogue are v3 capabilities. A studio built around directed multi-voice production runs v3, and its Robust mode covers most of the steady-narration ground v2 owned.

The Voice Is Step One

Give it a face, cut it into footage, or build the scene around it.

AI Avatar Generator

AI Video Generator

AI Video Editor

Your Script Already Knows How It Wants to Sound

Cast the voices, drop the tags, pick the mode — and this text to speech studio performs it back in any of 75 languages. Dialogue-ready, at the top of this page.

Text to Speech That Acts the Script, Not Just Reads It

Text to Speech That Acts the Script, Not Just Reads It

A Reader Reads. A Performer Delivers.

Text reader apps

A voice performance engine — this page

Audio Tags: Stage Directions for Voices

Emotional shifts

Rhythm and pacing

Turn-taking and interruptions

Identity and character

The Most Important Setting on the Page

Creative

Natural

Robust

Writing for More Than One Voice

Casting From 113 Voices Without Auditioning All of Them

Four Productions This Studio Handles

A two-host podcast, no studio

Audiobook chapters with a full cast

A thirty-second spot in five takes

The voice track for a talking avatar

Where Expressive TTS Pushes Back

The Direction Playbook

Punctuation is pacing

Tags direct what follows

The same line, directed

This Studio, a Recording Booth, or a Reader App?

This studio

A recording booth

A reader app

How the Text to Speech Studio Works

Write the script in lines

Cast and preview the voices

Direct, generate, retake

Text to Speech: The Director Questions

Why does my speech sound flat — how do I make it expressive?

Creative, Natural, or Robust — which stability mode should I pick?

Can two voices interrupt each other like a real conversation?

Do audio tags work in every language?

Why did the voice ignore my audio tag?

How do I keep one narrator consistent across long content?

What's the difference between this and a text reader app?

How do I write a script for multiple speakers?

Can it do accents and character voices?

How do I voice an AI avatar with this?

Why do numbers and abbreviations come out wrong?

ElevenLabs v2 vs v3 — why does this tool run v3?

The Voice Is Step One

Your Script Already Knows How It Wants to Sound

Text to Speech That Acts the Script, Not Just Reads It

A Reader Reads. A Performer Delivers.

Text reader apps

A voice performance engine — this page

Audio Tags: Stage Directions for Voices

Emotional shifts

Rhythm and pacing

Turn-taking and interruptions

Identity and character

The Most Important Setting on the Page

Creative

Natural

Robust

Writing for More Than One Voice

Casting From 113 Voices Without Auditioning All of Them

Four Productions This Studio Handles

A two-host podcast, no studio

Audiobook chapters with a full cast

A thirty-second spot in five takes

The voice track for a talking avatar

Where Expressive TTS Pushes Back

The Direction Playbook

Punctuation is pacing

Tags direct what follows

The same line, directed

This Studio, a Recording Booth, or a Reader App?

This studio

A recording booth

A reader app

How the Text to Speech Studio Works

Write the script in lines

Cast and preview the voices