Voice Call Architecture

The platform supports real-time voice conversations between users and agents using LiveKit for WebRTC signaling and media transport, and OpenAI's Realtime API for speech-to-speech processing. This is not a "transcribe, process text, synthesize speech" pipeline -- the OpenAI Realtime API handles speech-to-speech natively, which produces more natural conversations with lower latency.

System Overview

How a Call Starts

The user initiates a voice call from the web UI
VideoCallResource creates a LiveKit room and generates participant tokens
For each agent in the conversation, the platform sends a POST request to {agentServiceUrl}/voice/join with:
- A LiveKit token that allows the agent to join the room
- A callback URL for the agent to report status
The agent's voice handler connects to the LiveKit room as a participant
The handler opens an OpenAI Realtime session
Audio flows: User microphone -> LiveKit -> Agent -> OpenAI Realtime API -> Agent -> LiveKit -> User speakers

Voice Handler

The voice handler runs inside the agent container. When it receives a /voice/join request, it connects to the LiveKit room and opens an OpenAI Realtime session. The handler is responsible for bridging audio between the LiveKit room and the OpenAI Realtime API, and for managing the agent's available tools during the voice session.

The Voice Handler in Detail

The voice handler runs inside the agent container (the same container image used for all agents). When it receives a /voice/join request:

It connects to the LiveKit room using the provided token
It creates an OpenAI Realtime session with the agent's system prompt and available tools
It begins the audio streaming loop

Voice Tools

During a voice call, agents have access to a set of tools that are registered with the LiveKit agent framework. When the OpenAI Realtime API decides to call a tool (for example, to look up information or perform an action), it sends a function call through the LiveKit agent framework, which executes the tool and returns the result to the ongoing conversation. The available tools include the same capabilities the agent has in text conversations, adapted for the real-time voice context.

Single Active Speaker Protocol

In multi-agent conversations, only one agent speaks at a time. This prevents the chaotic experience of multiple agents talking over each other. The protocol works as follows:

When a user speaks, the system determines which agent should respond based on the conversation context
The responding agent processes the speech and replies
Other agents listen but do not speak until it is their turn

Turn Switching

Turn switching determines which agent gets to speak next in a multi-agent call. It can happen in two ways:

User-initiated: The user addresses a different agent by name (e.g., "Hey Scout, what do you think?"), or the conversation context naturally shifts to another agent's domain. The system detects the name mention and routes the next turn to the addressed agent.
Agent-initiated: An agent can request the turn using a request_turn tool. For example, if Agent A is answering a question but realizes Agent B has the relevant expertise, Agent A can call request_turn to hand off. The current agent finishes its response, and the next turn is given to the requested agent.

Design Decisions

Why LiveKit instead of a simpler WebSocket approach? LiveKit provides production-grade WebRTC infrastructure -- STUN/TURN servers, network traversal, adaptive bitrate, echo cancellation, and noise suppression. Building this from scratch would be a multi-month effort, and getting it wrong produces a terrible user experience (choppy audio, echo, high latency).

Why OpenAI Realtime API instead of separate STT + TTS? A pipeline of speech-to-text, LLM processing, and text-to-speech adds latency at each step (typically 2-4 seconds total). The Realtime API processes speech-to-speech in a single step with sub-second latency, producing conversations that feel natural rather than like talking to a slow voice assistant.

Why are voice tools managed differently from text-based tools? The OpenAI Realtime API manages its own tool-calling loop as part of the streaming audio session. Voice tools are registered directly with the LiveKit agent framework so that tool calls stay within the same session. This avoids the need to break out of the audio stream, call an external tool system, and re-enter the stream, which reduces latency and complexity.

Limitations

Voice calls require a stable internet connection with sufficient bandwidth for WebRTC audio
The OpenAI Realtime API is the only supported speech-to-speech backend -- there is no fallback
Multi-agent voice calls work but increase latency as more agents join (each needs its own OpenAI Realtime session)

System Overview​

How a Call Starts​

Voice Handler​

The Voice Handler in Detail​

Voice Tools​

Single Active Speaker Protocol​

Turn Switching​

Design Decisions​

Limitations​

See Also​