A deep dive into how voice agents are evolving: from classic speech-to-text and text-to-speech pipelines to full speech-to-speech systems, edge streaming infrastructure, and real-time actions during voice flows. Compare platforms (ElevenLabs, Vapi, LiveKit, Pipecat), their trade-offs in latency, privacy, compliance, and see how businesses are getting ROI from deploying these agents.

Figure 1: From Conversations To Conversions
The rise of real-time voice AI agents is transforming how intelligent systems interact with humans from call center automation and virtual assistants to voice-first UIs. At the heart of this transformation is a race between emerging platforms like ElevenLabs, LiveKit, Pipecat, Vapi, etc. each tackling a different layer of the voice agent stack.
Adopting generative voice agents isnât just a technical upgrade, itâs essential for staying competitive and meeting business goals. According to Gartner-backed reporting in The Wall Street Journal, venture funding into voiceâAI startups surged from $315 million in 2022 to $2.1 billion by 2024, and Gartner predicts 75% of new contact centers will adopt generative AI by 2028 [1].
Meanwhile, analyst firms forecast the total conversationalâAI market expanding from around $13 billion in 2024 to nearly $50 billion by 2030 [2]. User-facing voice agents span customer service, healthcare, banking, and retail with ROI often measured through labor savings, better resolution rates, and roundâtheâclock availability.
Implementers like Synthflow AI claim subâ400 ms response times and serve over 1,000 enterprise clients in healthcare, finance, education which is an evidence of a booming business demand [3]. In healthcare, voice AI Cencora [Infinitus AI agent] Eva now completes as many calls as over 100 full-time staff and processes requests four times as fast as before [4]. Their test even showed that Eva even caught and addressed inconsistent information that came up during verification calls.
According to a Cencora case study [5], following the adoption of AI-powered prioritized outreach, there was a:
- 24% increase in dormant patients who started treatment.
- 8% increase in re-initiation of therapy among patients with a high non-adherence rate.
- 25% reduction in nurse call interventions, enabling the reallocation of nurses to other necessary interventions.
1. Key Business Metrics & Value Levers
a. AHT (Average Handle Time)
- What it is: The average time someone spends on a call including talking, being on hold, and post-call wrapâup.
- Why it matters: Quicker calls mean faster support and lower staffing costs.
- What voice AI does: Automating call routing, noteâtaking, and summaries helps shorten each call by around 12%, as AGIA found in a realâworld trial. That led them to save $87,000 a year, while also reducing hold time by 23% [6].
b. FCR (FirstâCall Resolution Rate)
- What it is: The percentage of calls where customers have their issue fixed on the first call no callbacks needed.
- Why it matters: Resolving calls right away improves customer happiness and avoids extra followâup costs.
- What voice AI does: Smart assistants guide the caller step-by-step and check answers in real-time boosting âfirst touchâ success by up to 25% in real implementations [7].
c. Agent Capacity
- What it is: How many calls each human agent or team member can handle in an hour.
- Why it matters: Support teams get more efficient when teamed with automation.
- What voice AI does: Companies deploying voice agents say they now use 40â50% fewer agents, yet still handle 20â30% more calls. In effect, each person now handles a lot more by working smarter, not harder [7].
d. Operational Cost Savings
- What it includes: All call center costs, agent salaries, telecom expenses, and the cost of setting up callbacks.
- What voice AI does: By automating routine follow-ups and repeating questions, businesses can cut total call center costs by 50% or more. Some case studies even report up to 70% savings within a month of launching a voice agent solution [8].
Voice agents can handle lowâvalue or repetitive calls 24/7, while humans focus on exceptions increasing quality per cost.
2. From Pipeline to Unified SpeechâtoâSpeech: Whatâs Changing
a. Traditional Pipeline Architecture
- User Speaks â STT (Speech To Text) â LLM (Large Language Model) â TTS (Text To Speech) â Voice Output
- Components are modular with streaming improvements, but still sequential
- A reliable way to convert an existing LLM-based application into a voice agent
b. Nextâgen SpeechâtoâSpeech Architecture
- Models ingest raw audio and generate speech directly, bypassing discrete STT (Speech To Text) and TTS (Text To Speech) steps
- Benefits: lower latency, fewer errorâpropagation phases, more ânaturalâ turnâtaking
While still in research and early deployment, direct speechâtoâspeech unlocks richer interruptions, emotional cadence, and pacing that pipeline models struggle with. It also lowers endâpoint latency and optimizes compute footprint potentially reducing infrastructure costs.
3. Transport in the Voice AI Agent Stack: WebRTC vs. WebSocket
Choosing the right media transport layer is critical for agent UX, cost, and performance. Hereâs how WebRTC compares with WebSocket in this context.
a. WebRTC
Typically reduces endâtoâend latency well below 150 ms (and as low as 20 ms in optimized settings), thanks to UDP (User Datagram Protocol) transport, buffering techniques, and builtâin media optimizations like forward error correction and echo cancellation.
b. WebSocket
May still deliver âfastâ audio (~200â400 ms), but it suffers from TCPâs headâofâline blocking, packet reordering issues, and buffering delays in case of packet loss.
In voiceâfirst AI agents, WebRTC delivers real-time audio that feels closer to natural conversation especially when paired with frameworks like LiveKit or Pipecat. WebSocket transports may work in demo or prototype scenarios, but often fall short in production under real-world network variability.
4. Edge Streaming Infrastructure: Bringing Compute Closer to the User
For real-time Voice AI agents, latency is everything. The difference between a fast, natural-sounding voice agent and one that feels sluggish often comes down to how close your compute and media processing are to the end user.
Edge streaming refers to the practice of capturing, processing, and streaming voice data on infrastructure thatâs geographically close to users, rather than relying entirely on centralized cloud data centers.
Instead of routing all audio through a distant AWS or GCP region, Voice AI agents can run parts of the speech pipeline like media handling, inference, etc. on regional or even on-device nodes.
This architecture drastically reduces:
- Latency (often below 150 ms)
- Bandwidth and network hops
- Privacy risks, since sensitive data doesnât leave the region
Example Use Case: Voice Healthcare Agent
Imagine a medical refill agent deployed via phone in the U.S.:
- The audio is streamed via WebRTC to an edge node located in the same region.
- A Speech To Speech model runs on an edge GPU, generating live voice responses.
- No transcription or call metadata leaves the region, satisfying HIPAA and reducing risk.
Round-trip latency stays below 250 ms, even with human interruptions.
5. Tools Integration: Performing Real-World Actions in Real-Time Voice Flows
Modern voice agents donât just talk, they act.
Whether itâs scheduling appointments, sending SMS reminders, looking up EHR records, or logging CRM events, real-time voice AI systems must go beyond language to interact with external systems seamlessly during conversation.
In voice agents, especially those using Speech To Speech (S2S) models this translates to function calling or tool use in response to live audio input, without interrupting flow or requiring explicit textual confirmation.
Agents can be configured with secure function bridges (e.g., via AWS Lambda, REST APIs, gRPC, or pub/sub queues) to execute tasks dynamically based on the real-time context of the conversation.
6. Platform Comparison: ElevenLabs, Vapi, LiveKit and Pipecat
ElevenLabs
Once just TTS, now offering streaming, realâtime voice agents with ultraânatural voices and voice cloning. Strength comes from voice quality and fast integration via API ideal for characterâdriven or brandingâsensitive deployments (e.g. narrated agents, dubbing workflows).
Vapi
A âfullâserviceâ agent stack, you write business logic (often async handlers), and Vapi wires up telephony, backpressure, STT/LLM/TTS orchestration. Strong for call workflows like appointment booking or outbound sales, as human escalation is built-in.
LiveKit
This is infrastructure for realâtime audio/video agents controlling media transport (WeRTC, edge optimization), LiveKit enables subâ200 ms latency streaming and multiplexed voice logic. The business upside is direct control over quality and onâpremise deployments.
Pipecat
Focused on lowâlatency, interruptible, streaming conversations. It integrates tightly with services like Whisper and GPT and emphasizes on-device or hybrid deployment. Business use cases include privacyâsensitive verticals (finance, healthcare) where compliance and offline operation matter.
Conclusion: Why Thereâs No Single Winner, Only Fit
The voiceâagent landscape is evolving rapidly. Businesses must match their use case with maturity stage:
- Want bright, brandârich agents fast? EleÂvenLabs and Vapi let you deploy quickly and start realizing ROI within 4â8 weeks.
- Require control, compliance, or drive huge volumes? Building on LiveKit or Pipecat could deliver stronger economics at scale.
Thinking three years out? Your platform should gracefully transition from modular pipelines to endâtoâend speechâtoâspeech. The shift isnât just technical, it affects latency, cost structure, and user experience.
Ultimately, this race is about building agents that feel human, cost a fraction of human labor, and scale like software. Choosing between ElevenLabs, LiveKit, Pipecat, Vapi, etc. isnât just about features, itâs about strategy: how your business will scale, automate, and differentiate in a voiceâfirst world.
References
- https://www.wsj.com/articles/ai-voice-agents-are-ready-to-take-your-call-a62cf03b
- https://www.forbes.com/councils/forbestechcouncil/2025/01/14/conversational-ai-trends-for-2025-and-beyond/
- https://www.businessinsider.com/synthflow-ai-pitch-deck-funding-voice-2025-6
- https://www.businessinsider.com/voice-ai-healthcare-admin-loneliness-companionship-2025-6
- https://emerj.com/artificial-intelligence-at-cencora/
- https://www.retellai.com/blog/ai-voice-agent-roi-enterprise-communications
- https://www.gnani.ai/resources/blogs/voice-ai-roi-measuring-more-than-just-aht-reduction/
- https://zudu.ai/blog/voice-ai-roi-how-businesses-reduced-call-center-costs-by-70