Building with Voice AI

Over the past months we’ve been building Voice AI at the stealth startup I work at (stay tuned, coming soon!). One of our initial design partners built some of the most influential pre-AI voice tech in the industry and they’re now leaning on us to implement the next wave of AI powered voice software to help them scale their business.

This post is part 1 in a series where I’ll share a quick state of voice AI, challenges we’ve run into in production, what our stack looks like and why we chose it, choosing providers and models, and some tips for fine tuning your assistant.

State of Voice AI

The days of Alexa and Siri are gone. There’s impressive voice AI demos all over X showing that the tech has caught up to our imagination, you can build really smart low latency voice AI products today. Some impressive demos I’ve seen are:

Happy robot AI agents for logistics https://www.linkedin.com/posts/pablorpalafox_rapidfire-f3-supplychain-activity-7265809296020377601-LTVI?utm_source=social_share_send&utm_medium=member_desktop_web&rcm=ACoAAA_fxccBQCkPZQ1TTQ9Aa_gdL6-C82KWeS4
A big launch for Cartesia Sonic 3 https://x.com/krandiash/status/1983202316397453676?s=46&t=BPNHCkcq8BpDInSAuNgAdQ
ChatGPT voice

In my experience only about half the people who pick up our calls are actually aware that they’re talking to AI. Very cool, slightly scary.

On the other hand I also get at least one spam bot voicemail a day thats super low quality. So there’s a range and building quality products with this tech is not as easy as it seems. Getting something spun up over a weekend is one thing but getting it working reliably enough to put into production and in front of real customers is another.

How it works:

At a high level, a voice AI assistant is a pipeline with four main stages:

Audio input - User speaks
Transcription (STT) - Speech-to-text service converts audio to text
LLM reasoning - Model receives text, interprets intent, and generates response text
Text-to-speech (TTS) - Voice synthesis converts response text to audio
Audio output - User hears the response

Then it loops. The user responds, and the cycle repeats.

There’s a lot of moving parts and a lot of knobs you can turn to tune this experience. You can use different transcribers to get better quality text, you can use smarter LLMs to get better answers or faster LLMs to get lower latency, you can layer on turn detection, the list goes on.

It’s a beautiful system of tradeoffs that need to be balanced in order to provide a good experience which could vary across industries (sales, support, etc.) and use cases.

Tradeoffs

Latency

This is one of the most critical pieces to tune and get right. Pausing for longer than a second or two sounds awkward and robotic. The most important part, usually above response quality, is that the agent sounds natural, and that means sub second latency. All 5 steps above need to be done in less than a second in order to keep the conversation flowing and the response sounding natural. If one step is lagging then it propagates downstream.

Transcription: Garbage in = Garbage Out

I’ve found that the worst call recordings all come down to poor quality transcriptions. Sometimes this is because of bad audio, the user talking quietly or in a loud place, an accent that causes it to hear words incorrectly, or just low quality transcribers. Back to the 5 steps above, if transcription in step 2 is bad then the next 3 steps will be bad too, again issues propagate.

Voice Models: Speed vs Correctness

Similar to ChatGPT or Cursor we’re always trading off between what models to use for what situations. Usually it’s about quality vs cost but for voice I find the trade off is more about speed vs correctness. Cost is important but speed is a higher priority in order to keep the conversational flow. Lightweight models give snappy responses but sometimes lack quality. Finding the middle ground that works for your use case is key.

Challenges:

Start/Stop Speaking and Interruptions

Snappy well timed responses make the conversation flow naturally but it takes a lot of tuning to get there. Responses that are too snappy causes the agent to barge in while the user is still talking but not snappy enough sounds delayed and robotic. I’ve found that when a question is asked, many users take a moment of silence to think before responding or say “ummm…yeah” which could trigger the assistant to think it’s their turn to speak. Then as the user audio continues coming through the agent quickly stops as it’s interrupted leading to a choppy experience. On the other hand a slow start could feel unnatural and robotic if there’s always a 2 second pause after you're done speaking. Striking a balance is key.

Similarly knowing when to stop is also important. If a user makes any amount of noise the agent could think they’re trying to speak and will pause but then the conversation has to end up being pulled out of the assistant with repeated “please continue” or “go ahead” prompts to let it keep talking. Should it wait for the user to say several words before stopping or stop after hearing the user speak for x seconds? All of these behaviors need to be tuned.

End of turn (EOT) detection is another aspect to be aware of. There’s many different approaches to detecting when it's the agent's turn to start speaking. Some use the transcript, some use the audio, and some use both together. Getting a poor EOT signal exacerbated the issues described above. It’s hard to give snappy responses if we don’t know its our turn to speak and hard to avoid interruption if we get a EOT event while the user is finishing their thought.

Voicemail Detection and Automated Messaging

Personally I get a lot of spam bot voicemails and I noticed that a lot of them leave an awkward 15-20s silence before speaking. This turns out to be a very difficult problem to solve. There’s many good voicemail detection providers (google, twillio, etc.) but the edge cases are vast. What happens if someone picks up while the agent is leaving a message? What if you need to press 1 to leave a message? Many of the providers attempt to listen for a beep or an indicator that the recording is done but it’s very difficult to pin point so usually it's delayed or might not be detected at all. In this case it’s common to set max wait times to have the voicemail script start after x seconds if the beep isnt heard. Obviously this is a crude approach but in many cases its the best we can do.

As spam bots become more common the defense against them does too. Many businesses have a layer of automation to filter out bots with interactive voice response (IVR). I’ve started to see a lot of automation that asks you to speak your name and purpose of calling, then wait while a person decides if they want to pick up. Agents can attempt to navigate these but similar to voicemail its difficult to know when to

Testing and Evals:

Developing AI powered software comes with new challenges when dealing with non-deterministic logic and Voice AI adds even more layers to deal with.

For me this is the biggest challenge. Voice testing is fundamentally harder than testing text based AI systems. With voice you need to consider:

Did it understand the user correctly? (transcription accuracy)
Did it respond appropriately? (LLM correctness)
Did it sound natural? (TTS quality)
Did the timing feel right? (latency, interruption handling)

I’d say step 2 is really the only piece that has a common pattern for automated testing. The rest are left to judgement by monitoring live calls and manually experimenting with configurations. In my experience I've found that most people are manually evaluating these steps. At this point I’ve called myself hundreds of times trying to evaluate tweaks to my configurations.

Transcription accuracy in step 1 is in my opinion mostly out of your control other than selecting a transcriber. We’re building applications that leverage transcriber services so our job is to evaluate the leading providers, do some experimentation, and make a selection. I don’t think this is a worthwhile step to invest heavy automated testing into but you do need to constantly be taking a pulse check on quality by listening to calls and making a judgement. In the long run I’d love to get to the next level by collecting a suite of audio clips from real calls that had transcription issues and use those to do automated A/B evaluations of transcriber services…but for now the cost benefit is not worth it.

Step 2 is traditional evals that are starting to become common practice. You can use online and offline evaluations (see langfuse docs https://langfuse.com/docs/evaluation/overview\#online--offline-evaluation for a good overview) to isolate the LLM input/output and detect the quality of answers. I’d say this is the step that is most well supported right now. Investing heavily into this step! It’s table stakes for getting voice AI into production and working reliably.

Natural sounding audio in step 3 again is very difficult to test and is similar to step 1. Its mostly a manual process that requires your judgement. Some of these features are in your control and can be prompted or tuned, for example cartesia sonic 3 has syntax for including [laughter] or <emotion value="excited" >Oh wow,. In general though I’ve found this to be something that’s pretty good out of the box and it’s more about experimenting with different voice models to see how they fit for your use case. Do you want a man or a woman? Soft spoken or assertive? Do you want to clone your own voice? The innovation happening right now is around perfecting pacing, intonation, natural pauses and imperfections (ums and ahs), emotion, and laughter.

Finally step 4 around timing, latency, start/stop, interruption handling, etc. is difficult to automate. The best way to start is to manually experiment to push the limits and see how it feels. Try interrupting at different times or leaving long pauses while you think between answers to see if it interrupts you. You can configure what happens on interruption, how much you need to say to cause and interruption, does an “um” cause them to stop?, how long it waits to respond, what if there’s silence? All these things are tunable and can be tweaked to get the AI behavior just right. I’ve found that LLM as judge online evaluations for real calls is useful to detect behavior that’s outside your expected bounds. For example you can feed the full call log to an LLM and have it evaluate latency and interruptions based on the event log which does a much better job than relying only on the call transcript.

Selecting a Provider

There’s a range of options when choosing a voice provider. To start, it’s worth mentioning that I’m not considering low level frameworks like livekit or directly integrating with OpenAI + Twillio here, we’re looking for a fully managed voice agent platform with batteries included. All of these options allow you to quickly spin up a basic functioning agent in just a few hours. With that said, some of the popular options are:

Bland AI = fastest, easiest, but more limited in conversational depth and custom logic.
Retell AI = middle ground: strong voice-agent capabilities, good for operations with some engineering.
Vapi = deepest customisation, ideal if you’re building a voice product and need full control.

All are fully featured and have a great UX to build voice agents but they have different levels of abstraction. Vapi (standing for Voice API) is built for developers with a focus on the API with more configuration options while Retell and Bland are higher level abstractions that pre-select some sane configurations and try to limit overwhelming users with configurations. Also Retell and Bland are focused on workflow based designers while Vapi is fully prompt based (although they have a beta workflow features too). I’ve heard Bland has good edge case handling mechanisms while Vapi’s prompt based instructions can leave a bit more judgement to the LLM which can be more flexible/dynamic but also can be less predictable. I haven’t noticed significant differences in pricing between these options. There’s some nuances around buying phone numbers and call credits vs providing your own but those weren’t material differences from what I’ve seen.

Why we Chose VAPI

We ended up going with VAPI for a variety of reasons and so far its been a good decision. I’ll go into more detail in my next post but some of the we chose VAPI:

We liked that it was batteries included
Targeting engineers building products - strikes the right balance of abstractions and technical control
prompt based instructions offers flexibility that matched our product well
The range of options to experiment with is huge and growing fast
The team is shipping features super fast
Support and discord community seemed to be lively

I’ll admit that the amount of configurations to experiment with in VAPI is overwhelming but at the same time having more control is an advantage too. The support and discord is lively but also overwhelming, I don't usually get answers from humans but many of the things I’ve requested or flagged as bugs are fixed extremely fast even if I don't get a direct reply about it, they're listening to their users and their shipping code super fast.