In my experience, having built the infra for this, Twilio's MediaStreams and PSTN had higher latency than a simple SIP that interfaces with WebRTC, which you could connect to your backend. Curious how this changes with the OpenAI realtime, since my approach was stringing gpt-4o with Elevenlabs for the backend.
Speech-to-Speech model from OpenAI integrated with Twilio's Voice channel demonstrating AI powered live call translation between 2 parties speaking different languages
In my experience, having built the infra for this, Twilio's MediaStreams and PSTN had higher latency than a simple SIP that interfaces with WebRTC, which you could connect to your backend. Curious how this changes with the OpenAI realtime, since my approach was stringing gpt-4o with Elevenlabs for the backend.
Only one way to find out :) Give the open source starter app a whirl and let us know what you think! https://github.com/twilio-samples/live-translation-openai-re...
Speech-to-Speech model from OpenAI integrated with Twilio's Voice channel demonstrating AI powered live call translation between 2 parties speaking different languages