OpenAI’s Realtime API: A Game Changer for Voice Assistants

OpenAI just launched the new Realtime API, allowing businesses to create voice assistant services on par with ChatGPT’s Advanced Voice Mode. If you haven’t tried the advanced voice mode yet, it’s available to all paid ChatGPT users, and it’s really impressive—definitely worth checking out.

This new Realtime API makes it easier to build your own voice assistant systems, leading to faster, more natural voice interactions becoming widespread.

The Rise of Voice Assistants

About six or seven years ago, I developed a voice assistant, but the experience wasn’t great. There were two main issues:

  1. AI’s understanding ability (which has since improved significantly with large language models),
  2. And latency problems.

Building a voice assistant used to involve three steps:

3-steps-of-voice-assitant

  1. Converting the user’s speech to text (Speech-to-Text, STT),
  2. Using AI to process the text and generate a response,
  3. Converting that response back to speech (Text-to-Speech, TTS) and playing it.

Latency would often lead to poor user experiences, preventing voice assistants from becoming truly popular. But with the development of the Realtime API, I believe voice assistants are entering a new era. Businesses can now enable real-time voice interactions, making fully voice-driven interfaces a reality. At the very least, this will shift us toward voice-first, visually-assisted interfaces, which will greatly enhance user experiences (similar to Meta’s Orion project).

Additionally, the Realtime API supports function calls, meaning voice assistants can do more than just answer questions—they can perform specific tasks. For example, users can place food orders or book hotels via voice, and the assistant will handle the entire process automatically.

Costs

Currently, the cost of processing audio with this API is $0.06 per minute for input and $0.24 per minute for output. For instance, a 3-minute phone call where the customer talks for 1 minute and the AI responds for 2 minutes would cost around $0.50. While this may seem expensive now, AI usage costs decrease by half each year, meaning this technology will soon be much more affordable. Moreover, with the rapid development of on-device AI, much of the processing will shift to user devices, eliminating this cost for businesses.

Real-World Use Cases

OpenAI mentioned Healthify, an Indian health tech company, which is already using the Realtime API to provide real-time voice assistant services.
Another example is the language learning app Speak, which uses the Realtime API to allow students to engage in role-playing conversations with AI, enhancing the effectiveness of language learning.
(Speak completed a $20M funding round a few months ago, with OpenAI among the investors.) Speck uses Realtime API to power its role-play feature

The potential of the Realtime API extends to various applications, including:

  • Customer service: Automating voice-based customer inquiries and orders to save time and increase efficiency.
  • Internal process automation: Helping businesses streamline internal workflows like meeting scheduling or report generation, boosting productivity.
  • Smart commerce: Enhancing the customer shopping experience through voice interactions, from product recommendations to fully automated order processing.

In short, tasks that previously required human involvement can now be handled much more efficiently by AI. The potential changes in interaction modes are worth considering for business leaders as we move forward.

References

  1. Introducing the Realtime API
  2. Realtime API Guide
  3. Case study of Healthify
  4. Speak Hits $500M Valuation, Expands Rapidly Across Markets