OpenAI’s AI model automatically recognizes speech and translates it into English

A pink waveform on a blue background, poetically suggesting audio.

Benj Edwards / Ars Technica

On Wednesday, OpenAI released a new open-source AI model called Whisper that recognizes and translates audio at a level that approximates human recognition ability. He can transcribe interviews, podcasts, conversations, etc.

OpenAI trained Whisper on 680,000 hours of audio data and corresponding transcriptions in about 10 languages ​​collected from the web. According to OpenAI, this open collection approach has led to “better robustness to accents, background noise, and technical language.” It can also detect the spoken language and translate it into English.

OpenAI describes Whisper as an encoder-decoder transformer, a type of neural network that can use context gleaned from input data to learn associations that can then be translated into model output. OpenAI presents this overview of how Whisper works:

The input audio is split into 30 second chunks, converted to a log-Mel spectrogram, and then passed to an encoder. A decoder is trained to predict the corresponding text caption, mixed with special tokens that direct the unique model to perform tasks such as language identification, sentence-level timestamps, multilingual voice transcription, and translation voice to English.

In open source Whisper, OpenAI hopes to introduce a new core model that others can build on in the future to improve speech processing and accessibility tools. OpenAI has a significant track record on this front. In January 2021, OpenAI released CLIP, an open-source computer vision model that arguably ignited the recent era of fast-paced image synthesis technologies such as DALL-E 2 and Stable Diffusion.

At Ars Technica, we tested Whisper against code available on GitHub, and provided it with several samples, including a podcast episode and a particularly difficult-to-understand audio section from a phone interview. Although it took a while running a standard Intel desktop processor (the tech doesn’t work in real time yet), Whisper did a good job of translating audio to text through the program. Python demo – much better than some audio transcription services we’ve tried in the past.

Example of console output from OpenAI's Whisper demo program when transcribing a podcast.
Enlarge / Example console output from OpenAI’s Whisper demo program while transcribing a podcast.

Benj Edwards / Ars Technica

With the proper setup, Whisper could easily be used to transcribe interviews, podcasts, and potentially translate podcasts produced in languages ​​other than English to English on your machine, for free. It’s a powerful combination that could potentially disrupt the transcription industry.

As with nearly every major new AI model these days, Whisper brings positive benefits and potential for misuse. On the model card for Whisper (under the “Broader Implications” section), OpenAI warns that Whisper could be used to automate monitoring or identify individual speakers in a conversation, but the company hopes it will be used “primarily for beneficial purposes”.

Leave a Reply

%d bloggers like this: