Voice Call Architecture
The platform supports real-time voice conversations between users and agents using LiveKit for WebRTC signaling and media transport, and OpenAI's Realtime API for speech-to-speech processing. This is not a "transcribe, process text, synthesize speech" pipeline -- the OpenAI Realtime API handles speech-to-speech natively, which produces more natural conversations with lower latency.
System Overview
How a Call Starts
- The user initiates a voice call from the web UI
VideoCallResourcecreates a LiveKit room and generates participant tokens- For each agent in the conversation, the platform sends a
POSTrequest to{agentServiceUrl}/voice/joinwith:- A LiveKit token that allows the agent to join the room
- A callback URL for the agent to report status
- The agent's voice handler connects to the LiveKit room as a participant
- The handler opens an OpenAI Realtime session
- Audio flows: User microphone -> LiveKit -> Agent -> OpenAI Realtime API -> Agent -> LiveKit -> User speakers
Voice Handler
The voice handler runs inside the agent container. When it receives a /voice/join request, it connects to the LiveKit room and opens an OpenAI Realtime session. The handler is responsible for bridging audio between the LiveKit room and the OpenAI Realtime API, and for managing the agent's available tools during the voice session.
The Voice Handler in Detail
The voice handler runs inside the agent container (the same container image used for all agents). When it receives a /voice/join request:
- It connects to the LiveKit room using the provided token
- It creates an OpenAI Realtime session with the agent's system prompt and available tools
- It begins the audio streaming loop
Voice Tools
During a voice call, agents have access to a set of tools that are registered with the LiveKit agent framework. When the OpenAI Realtime API decides to call a tool (for example, to look up information or perform an action), it sends a function call through the LiveKit agent framework, which executes the tool and returns the result to the ongoing conversation. The available tools include the same capabilities the agent has in text conversations, adapted for the real-time voice context.
Single Active Speaker Protocol
In multi-agent conversations, only one agent speaks at a time. This prevents the chaotic experience of multiple agents talking over each other. The protocol works as follows:
- When a user speaks, the system determines which agent should respond based on the conversation context
- The responding agent processes the speech and replies
- Other agents listen but do not speak until it is their turn
Turn Switching
Turn switching determines which agent gets to speak next in a multi-agent call. It can happen in two ways:
- User-initiated: The user addresses a different agent by name (e.g., "Hey Scout, what do you think?"), or the conversation context naturally shifts to another agent's domain. The system detects the name mention and routes the next turn to the addressed agent.
- Agent-initiated: An agent can request the turn using a
request_turntool. For example, if Agent A is answering a question but realizes Agent B has the relevant expertise, Agent A can callrequest_turnto hand off. The current agent finishes its response, and the next turn is given to the requested agent.
Design Decisions
Why LiveKit instead of a simpler WebSocket approach? LiveKit provides production-grade WebRTC infrastructure -- STUN/TURN servers, network traversal, adaptive bitrate, echo cancellation, and noise suppression. Building this from scratch would be a multi-month effort, and getting it wrong produces a terrible user experience (choppy audio, echo, high latency).
Why OpenAI Realtime API instead of separate STT + TTS? A pipeline of speech-to-text, LLM processing, and text-to-speech adds latency at each step (typically 2-4 seconds total). The Realtime API processes speech-to-speech in a single step with sub-second latency, producing conversations that feel natural rather than like talking to a slow voice assistant.
Why are voice tools managed differently from text-based tools? The OpenAI Realtime API manages its own tool-calling loop as part of the streaming audio session. Voice tools are registered directly with the LiveKit agent framework so that tool calls stay within the same session. This avoids the need to break out of the audio stream, call an external tool system, and re-enter the stream, which reduces latency and complexity.
Limitations
- Voice calls require a stable internet connection with sufficient bandwidth for WebRTC audio
- The OpenAI Realtime API is the only supported speech-to-speech backend -- there is no fallback
- Multi-agent voice calls work but increase latency as more agents join (each needs its own OpenAI Realtime session)
See Also
- Start a Voice/Video Call -- step-by-step guide to initiating a voice call with an agent
- Voice Call with an Agent Tutorial -- end-to-end walkthrough of a voice conversation