Familiar | AI voices for your Foundry NPCs

What it does

Voice turns the AI's words into speech. When a character talks or you ask for narration, Familiar sends the line to the text-to-speech provider you picked and plays the audio back in the chat, right there in Foundry. The table hears the scene instead of reading it off the screen.

You set two voices. A DM or narrator voice carries description and the beats between people; a character voice carries the people themselves. The AI chooses between them line by line, so a stretch of narration and the innkeeper answering it come out in the right voice without you switching anything.

Two voices: a DM or narrator voice and a separate character voice, picked per line by the AI.
Several characters in one exchange play back in order, gapless, as a single take.
Speaks the narration and the action too, so the parts between the quoted lines are heard, not skipped.

Your key stays in your browser

Voice is bring-your-own-key. You paste your own provider key into Familiar's voice settings, and the audio is made browser-direct: your Foundry tab calls the provider, decodes the reply, and plays it. The key lives in your browser. It never reaches a Familiar server, because there is no Familiar server in the audio path.

So there is no extra subscription to us and no second bill from us. You pay your provider for what you generate, the same way the rest of Familiar handles keys, and nothing routes through anyone else on the way.

Nothing sits between you and the provider. The key you paste stays on your machine, and the audio is generated and played in your own browser.

Set it up

Four steps from nothing to a talking NPC. You do this once.

Open voice settings
From Familiar's chat window in Foundry, open the voice settings.
Paste a provider key
Pick one of the three providers and paste your API key. The key is per-user and stays in your browser.
Choose a voice and test
Choose a voice, optionally pick a model, and hit Test to hear it. Tune it until the voice sounds right.
Add a character voice (optional)
Set a second voice in the Character slot if you want narration and the cast to sound different. Leave it blank and every line uses your first voice.

Pick a provider

Three providers, all bring-your-own-key, all browser-direct. They trade cost against richness, so pick by how your night runs and what you want to spend. The figures below assume a typical session, around 15,000 characters of speech, and you pay the provider directly.

ElevenLabs has the largest voice library and a fast multilingual default. Pick the eleven_v3 model on a voice for audio tags and singing.
Cartesia is sub-100ms and covers 42 languages including Dutch, on the sonic-3 model. Its voices are UUID ids, so the in-app dropdown loads them after you paste your key.
OpenAI reuses the OpenAI key you may already have. Models run from gpt-4o-mini-tts, which you can steer, through tts-1 to tts-1-hd for the highest quality, with named voices like alloy and nova.

A simple split: ElevenLabs for a narration-heavy night, Cartesia when cost matters, OpenAI as the cheap default if you already hold an OpenAI key.

The three voice providers Familiar supports, what each is best for, and roughly what one session costs.
Provider	Best for	Per session
ElevenLabs	Narration quality (recommended)	~$4.50
Cartesia	Cheapest and fastest	~$0.20
OpenAI	Cheap default, reuses your key	~$0.30

BYOK: you pay the provider, not us, and costs scale with how much speech you generate. ElevenLabs Creator is a monthly plan (around $22 a month, with the per-session figure as overage on top); Cartesia and OpenAI bill purely per use.

Give each NPC its own voice

Past the two slots, you can pin a voice to one character. Assign a voice to an NPC once and the AI uses it automatically every time that NPC speaks, so the harbourmaster sounds like the harbourmaster from one session to the next.

This is the audible half of writing a character down. A written anchor keeps an NPC consistent on the page; a pinned voice keeps them consistent in the ear. The companion guide on writing characters covers the anchor side.

Voice & Image Generation

Emotion and singing

An optional emotion hint colours how a line is delivered. Tag it and the voice leans into the cue, from a whisper to a shout; leave it off and the line is read plainly.

Provider support differs. Cartesia and OpenAI take an emotion hint on any voice. ElevenLabs renders it only on the eleven_v3 model, the same model that unlocks singing, and skips it on the faster default.

Singing is the special case. Set the emotion to singing and the line is sung rather than spoken. It works on an ElevenLabs eleven_v3 voice only, so put that model on the voice you want to carry a song.

Emotion hints: angry, sad, happy, excited, calm, scared, whispering, shouting, nervous, tired, and singing.

On ElevenLabs, switch the voice to the eleven_v3 model for emotion or singing; the faster default skips both. Cartesia and OpenAI take an emotion hint as it is.

Play it for the whole table

By default the voice plays in one place: your browser, the GM's. Turn broadcast on and the audio also plays on every connected player's client, so the whole table hears it at once.

It is off by default and only the GM can switch it on, so nothing reaches your players until you choose to. Leave it off for a local read-aloud, turn it on when you want the room to share the moment.

If a voice will not play

One thing trips people up after switching providers: a voice has to belong to the provider it is set on. An OpenAI voice name like alloy will not play on Cartesia, which identifies its voices by UUID rather than by name.

The dropdown is the fix. After you paste a provider key, choose a voice from the in-app list, which loads that provider's real voices. Picking from the dropdown rather than carrying a name over from another provider keeps the voice and the provider matched.

What it does not do

Voice is speech and singing. A few neighbouring things are deliberately out of scope, so you know where the edge is:

No voice cloning.
No conversational or speech-to-speech mode.
No instrumental or background music.

Voice one NPC and play

Pick a character who matters to your next session, set a provider and a voice, and let the AI speak the next time the party talks to them. You stay in the scene while it carries the lines. Questions about voices, providers, or your first session are welcome in the Discord.

Looking for session transcription instead, speaking your table aloud and saving a searchable record? That is a separate feature, with its own guide on the way.

Give your NPCs an AI voice.

What it does

Your key stays in your browser

Set it up

Open voice settings

Paste a provider key

Choose a voice and test

Add a character voice (optional)

Pick a provider

Give each NPC its own voice

Emotion and singing

Play it for the whole table

If a voice will not play

What it does not do

Voice one NPC and play

More in Run your game

How to play D&D with AI

Prepping a published adventure for AI

Writing characters the AI can play

Staying consistent across sessions

Running combat with AI

Transcribing your sessions