Skip to content
Go back

Inside Amazon Nova Sonic - The Event-Driven API Behind Real-Time Voice AI

A deep technical exploration of Amazon Nova Sonic’s speech foundation model: how it unifies ASR, LLM and TTS into a single, event-driven, bidirectional stream API for real-time voice interactions. Understand its session and event flows (session init, audio/text/tool content, completion events), how it handles “barge-in”, adapts tone and expressivity in speech generation, and the implications for latency, context preservation, and building more natural voice experiences.

Amazon Nova Sonic

Figure 1: Amazon Nova Sonic

Traditional approaches in building voice-enabled applications involve complex orchestration of multiple models, such as:

  1. Speech Recognition (ASR) to convert speech to text
  2. Large Language Models (LLMs) to understand and generate responses
  3. Text-To-Speech (TTS) to convert text back to audio

This fragmented approach not only increases development complexity but also fails to preserve crucial acoustic context and nuances like tone, prosody, and speaking style that are essential for natural conversations.

Speech Foundation Models unifies speech understanding and speech generation into a single model, to enable more human-like voice conversations in AI applications. This unification enables the model to adapt the generated voice response to the acoustic context (e.g., tone, style) and the spoken input, resulting in more natural dialogue.

The Amazon Nova Sonic model provides real-time, conversational interactions through bidirectional audio streaming. It implements an event-driven architecture through the bidirectional stream API.

This event-driven system enables real-time, low-latency, multi-turn conversations. Key capabilities include:

The bidirectional stream API consists of these three main components:

The overall events flow:

  1. Format of Input Audio Capture: 16kHz sample rate, mono channel is preferred.

  2. Events that need to be sent in the same order for initiating the session and starting audio streaming:

Amazon Nova Sonic Input Events Flow

Figure 2: Amazon Nova Sonic Input Events Flow

a. Session Start Event

b. Prompt Start Event

c. Audio/Text/Tool Content Start Event

d. Audio/Text/Tool Content Event

e. Content End Event

  1. When the Amazon Nova Sonic model responds, it also follows a structured event sequence.

Amazon Nova Sonic Output Events Flow

Figure 3: Amazon Nova Sonic Output Events Flow

a. Completion Start Event

b. Content Start Event

c. Text Output Event (ASR Transcripts)

d. Tool Use Event (Tool Handling)

e. Audio Output Event (Audio Response)

f. Content End Event

g. Completion End Event

  1. Events for ending session:

a. Prompt End Event

b. Session End Event

  1. Barge In Detection:
  1. Usage Metrics:

By understanding the base64 encoding of both audioInput (user speech) and audioOutput (model response), as well as the structured sequence of events with their defined roles and states, we can implement a robust, low-latency streaming voice application using Amazon Nova Sonic.

Prompt Engineering for Speech Foundation Model:

Requires a different prompting approach than standard text-based models. We should optimize content for auditory comprehension rather than for reading comprehension.

System Prompt steers the model’s output style and lexical choice. It can’t be used to change speech attributes such as accent and pitch. The model decides those speech characteristics based on the context of the



Previous Blog
LiveKit - Powering Real-Time Audio, Video, and Data