From "OpenAI"... New audio models for voice interactions and simultaneous translation

OpenAI has unveiled three new real-time voice models aimed at developers working on voice assistants, instant translation, and speech-to-text applications directly through its APIs.

The new group includes models GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper, and the company says it provides more natural voice interactions, with support for live translation, and converting speech to text at a high response speed.

GPT-Realtime-2 is the most prominent of these models; It is designed to manage live voice conversations, with the ability to analyze requests, invoke tools, handle corrections, and continue the dialogue naturally.

OpenAI added several new features to the model, including the ability to provide short introductory phrases such as “Let me check this” before executing the task, while supporting calling several tools in parallel to keep the user informed of what is happening.

The company has improved its error handling mechanisms. The model now responds more smoothly when a problem occurs instead of stopping silently, in addition to expanding the context window from 32 thousand symbols to 128 thousand symbols.

OpenAI says the new model provides a better understanding of specialized terms, scientific names, and medical vocabulary, while supporting control of the tone of speech according to the nature of the situation. It also allows developers to choose the level of thinking and reasoning between several levels.

The GPT-Realtime-Translate model targets multilingual voice translation experiences with real-time performance; It supports translating more than 70 input languages into 13 output languages. The company confirms that the model maintains meaning while keeping pace with the speaker’s speed, even when using local dialects or specialized terminology.

As for GPT-Realtime-Whisper, it is a model dedicated to converting speech into direct text with low response time, and it can convert speech into text during speech, which makes it suitable for simultaneous translations, recording meetings, study lectures, and others.

OpenAI has made all three models available via the Realtime APIs, with pricing starting at $32 per million audio input tokens and $64 per million audio output tokens for the GPT-Realtime-2 model, while GPT-Realtime-Translate costs about $0.034 per minute, and GPT-Realtime-Whisper costs about $0.017 per minute.

The company indicates that developers can try out the new models via the Playground platform, and that it will continue to work on improving the audio experience within ChatGPT for ordinary users. (aitnews)