Authors:

Mark Sze

Software Engineer at AG2.ai

Tvrtko Sternak

Machine Learning Engineer at Airt

Davor Runje

CTO at Airt

Davorin Ruševljan

Developer

TL;DR:

Build a real-time voice application using WebRTC and connect it with the RealtimeAgent. Demo implementation.
Optimized for Real-Time Interactions: Experience seamless voice communication with minimal latency and enhanced reliability.

Realtime Voice Applications with WebRTC

In our previous blog post, we introduced the WebSocketAudioAdapter, a simple way to stream real-time audio using WebSockets. While effective, WebSockets can face challenges with quality and reliability in high-latency or network-variable scenarios. Enter WebRTC.

Today, we’re excited to showcase the integration with OpenAI Realtime API with WebRTC, leveraging WebRTC’s peer-to-peer communication capabilities to provide a robust, low-latency, high-quality audio streaming experience directly from the browser.

Why WebRTC?

WebRTC (Web Real-Time Communication) is a powerful technology for enabling direct peer-to-peer communication between browsers and servers. It was built with real-time audio, video, and data transfer in mind, making it an ideal choice for real-time voice applications. Here are some key benefits:

1. Low Latency

WebRTC’s peer-to-peer design minimizes latency, ensuring natural, fluid conversations.

2. Adaptive Quality

WebRTC dynamically adjusts audio quality based on network conditions, maintaining a seamless user experience even in suboptimal environments.

3. Secure by Design

With encryption (DTLS and SRTP) baked into its architecture, WebRTC ensures secure communication between peers.

4. Widely Supported

WebRTC is supported by all major modern browsers, making it highly accessible for end users.

How It Works

This example demonstrates using WebRTC to establish low-latency, real-time interactions with OpenAI Realtime API with WebRTC from a web browser. Here’s how it works:

Request an Ephemeral API Key
- The browser connects to your backend via WebSockets to exchange configuration details, such as the ephemeral key and model information.
- WebSockets handle signaling to bootstrap the WebRTC session.
- The browser requests a short-lived API key from your server.
Generate an Ephemeral API Key
- Your backend generates an ephemeral key via the OpenAI REST API and returns it. These keys expire after one minute to enhance security.
Initialize the WebRTC Connection
- Audio Streaming: The browser captures microphone input and streams it to OpenAI while playing audio responses via an <audio> element.
- DataChannel: A DataChannel is established to send and receive events (e.g., function calls).
- Session Handshake: The browser creates an SDP offer, sends it to OpenAI with the ephemeral key, and sets the remote SDP answer to finalize the connection.
- The audio stream and events flow in real time, enabling interactive, low-latency conversations.

Example: Build a Voice-Enabled Weather Bot

Let’s walk through a practical example of using WebRTC to create a voice-enabled weather bot.

You can find the full example here.

1. Clone the Repository

Start by cloning the example project from GitHub:

git clone https://github.com/ag2ai/realtime-agent-over-webrtc.git
cd realtime-agent-over-webrtc

2. Set Up Environment Variables

Create a OAI_CONFIG_LIST file based on the provided OAI_CONFIG_LIST_sample:

cp OAI_CONFIG_LIST_sample OAI_CONFIG_LIST

In the OAI_CONFIG_LIST file, update the api_key with your OpenAI API key.

Supported key format

Currently WebRTC can be used only by API keys the begin with:

sk-proj

Other keys may result internal server error (500) on OpenAI server. For more details see this issue

(Optional) Create and Use a Virtual Environment

To avoid cluttering your global Python environment:

python3 -m venv env
source env/bin/activate

3. Install Dependencies

Install the required Python packages:

pip install -r requirements.txt

4. Start the Server

Run the application with Uvicorn:

uvicorn realtime_over_webrtc.main:app --port 5050

When the server starts, you should see:

INFO:     Started server process [12345]
INFO:     Uvicorn running on http://0.0.0.0:5050 (Press CTRL+C to quit)

5. Open the Application

Navigate to localhost:5050/start-chat in your browser. The application will request microphone permissions to enable real-time voice interaction.

6. Start Speaking

To get started, simply speak into your microphone and ask a question. For example, you can say:

“What’s the weather like in Rome?”

This initial question will activate the agent, and it will respond, showcasing its ability to understand and interact with you in real time.

Code review

WebRTC connection

A part of the WebRTC connection logic runs in browser, and it is implemented by AG2 javascript client library

While you would usually use npm package in your project, for simplicity in this example we are referencing AG2 javascript client library directly with:

<script src="https://github.com/ag2ai/ag2-js-client/releases/download/v0.2.0/index.global.js"></script>

And code snippet to connect to the AG2 server is as follows:

    const webRTC = new ag2client.WebRTC(socketUrl)
    await webRTC.connect();

Server implementation

This server implementation uses FastAPI to set up a WebRTC and WebSockets interaction, allowing clients to communicate with a chatbot powered by OpenAI’s Realtime API. The server provides endpoints for a simple chat interface and real-time audio communication.

Create an app using FastAPI

First, initialize a FastAPI app instance to handle HTTP requests and WebSocket connections.

app = FastAPI()

This creates an app instance that will be used to manage both regular HTTP requests and real-time WebSocket interactions.

Define the root endpoint for status

Next, define a root endpoint to verify that the server is running.

@app.get("/", response_class=JSONResponse)
async def index_page():
    return {"message": "WebRTC AG2 Server is running!"}

When accessed, this endpoint responds with a simple status message indicating that the WebRTC server is up and running.

Set up static files and templates

Mount a directory for static files (e.g., CSS, JavaScript) and configure templates for rendering HTML.

website_files_path = Path(__file__).parent / "website_files"

app.mount(
    "/static", StaticFiles(directory=website_files_path / "static"), name="static"
)

templates = Jinja2Templates(directory=website_files_path / "templates")

This ensures that static assets (like styling or scripts) can be served and that HTML templates can be rendered for dynamic responses.

Serve the chat interface page

Create an endpoint to serve the HTML page for the chat interface.

@app.get("/start-chat/", response_class=HTMLResponse)
async def start_chat(request: Request):
    """Endpoint to return the HTML page for audio chat."""
    port = request.url.port
    return templates.TemplateResponse("chat.html", {"request": request, "port": port})

This endpoint serves the chat.html page and provides the port number in the template, which is used for WebSockets connections.

Handle WebSocket connections for media streaming

Set up a WebSocket endpoint to handle real-time interactions, including receiving audio streams and responding with OpenAI’s model output.

@app.websocket("/session")
async def handle_media_stream(websocket: WebSocket):
    """Handle WebSocket connections providing audio stream and OpenAI."""
    await websocket.accept()

    logger = getLogger("uvicorn.error")

    realtime_agent = RealtimeAgent(
        name="Weather Bot",
        system_message="Hello there! I am an AI voice assistant powered by Autogen and the OpenAI Realtime API. You can ask me about weather, jokes, or anything you can imagine. Start by saying 'How can I help you'?",
        llm_config=realtime_llm_config,
        websocket=websocket,
        logger=logger,
    )

This WebSocket endpoint establishes a connection and creates a RealtimeAgent that will manage interactions with OpenAI’s Realtime API. It also includes logging for monitoring the process.

Register and implement real-time functions

Define custom real-time functions that can be called from the client side, such as fetching weather data.

    @realtime_agent.register_realtime_function(
        name="get_weather", description="Get the current weather"
    )
    def get_weather(location: Annotated[str, "city"]) -> str:
        logger.info(f"Checking the weather: {location}")
        return (
            "The weather is cloudy." if location == "Rome" else "The weather is sunny."
        )

Here, a weather-related function is registered with the RealtimeAgent. It responds with a simple weather message based on the input city.

Run the RealtimeAgent

Finally, run the RealtimeAgent to start handling the WebSocket interactions.

    await realtime_agent.run()

This starts the agent’s event loop, which listens for incoming messages and responds accordingly.

Conclusion

New integration of OpenAI Realtime API with WebRTC unlocks the full potential of WebRTC for real-time voice applications. With its low latency, adaptive quality, and secure communication, it’s the perfect tool for building interactive, voice-enabled applications.

Try it today and take your voice applications to the next level!

Edit this page

Streaming input and output using WebSockets Real-Time Voice Interactions with the WebSocket Audio Adapter

On this page

Realtime Voice Applications with WebRTC
Why WebRTC?
1. Low Latency
2. Adaptive Quality
3. Secure by Design
4. Widely Supported
How It Works
Example: Build a Voice-Enabled Weather Bot
1. Clone the Repository
2. Set Up Environment Variables
(Optional) Create and Use a Virtual Environment
3. Install Dependencies
4. Start the Server
5. Open the Application
6. Start Speaking
Code review
WebRTC connection
Server implementation
Create an app using FastAPI
Define the root endpoint for status
Set up static files and templates
Serve the chat interface page
Handle WebSocket connections for media streaming
Register and implement real-time functions
Run the RealtimeAgent
Conclusion

Blog

Real-Time Voice Interactions over WebRTC

Realtime Voice Applications with WebRTC

Why WebRTC?

1. Low Latency

2. Adaptive Quality

3. Secure by Design

4. Widely Supported

How It Works

Example: Build a Voice-Enabled Weather Bot

1. Clone the Repository

2. Set Up Environment Variables

(Optional) Create and Use a Virtual Environment

3. Install Dependencies

4. Start the Server

5. Open the Application

6. Start Speaking

Code review

WebRTC connection

Server implementation

Create an app using FastAPI

Define the root endpoint for status

Set up static files and templates

Serve the chat interface page

Handle WebSocket connections for media streaming

Register and implement real-time functions

Run the RealtimeAgent

Conclusion

Blog

​Realtime Voice Applications with WebRTC

​Why WebRTC?

​1. Low Latency

​2. Adaptive Quality

​3. Secure by Design

​4. Widely Supported

​How It Works

​Example: Build a Voice-Enabled Weather Bot

​1. Clone the Repository

​2. Set Up Environment Variables

​(Optional) Create and Use a Virtual Environment

​3. Install Dependencies

​4. Start the Server

​5. Open the Application

​6. Start Speaking

​Code review

​WebRTC connection

​Server implementation

​Create an app using FastAPI

​Define the root endpoint for status

​Set up static files and templates

​Serve the chat interface page

​Handle WebSocket connections for media streaming

​Register and implement real-time functions

​Run the RealtimeAgent

​Conclusion

Realtime Voice Applications with WebRTC

Why WebRTC?

1. Low Latency

2. Adaptive Quality

3. Secure by Design

4. Widely Supported

How It Works

Example: Build a Voice-Enabled Weather Bot

1. Clone the Repository

2. Set Up Environment Variables

(Optional) Create and Use a Virtual Environment

3. Install Dependencies

4. Start the Server

5. Open the Application

6. Start Speaking

Code review

WebRTC connection

Server implementation

Create an app using FastAPI

Define the root endpoint for status

Set up static files and templates

Serve the chat interface page

Handle WebSocket connections for media streaming

Register and implement real-time functions

Run the RealtimeAgent

Conclusion