Speech recognition remains a difficult problem in AI and machine learning. In a step towards solving this problem, OpenAI today opened up Whisper, an automatic speech recognition system that the company says enables “robust” transcription in multiple languages as well as translation from those languages to the ‘English.
Countless organizations have developed high-performance speech recognition systems, which are at the heart of software and services from tech giants like Google, Amazon and Meta. But what makes Whisper different, according to OpenAI, is that it was trained on 680,000 hours of multilingual, “multitasking” data collected from the web, resulting in improved recognition of unique accents, the noise of background and technical jargon.
“The main intended users of [the Whisper] Models are AI researchers studying the robustness, generalization, capabilities, biases and constraints of the current model. However, Whisper is also potentially very useful as an automatic speech recognition solution for developers, especially for English speech recognition,” OpenAI wrote in the GitHub repository for Whisper, from which several versions of the system can be downloaded. . “[The models] show good ASR results in about 10 languages. They may exhibit additional capabilities…if they are refined on certain tasks such as voice activity detection, speaker classification, or speaker diarization, but have not been robustly evaluated in these areas.
Whisper has its limitations, especially in the area of text prediction. Because the system was trained on a large amount of “noisy” data, OpenAI warns that Whisper might include words in its transcriptions that weren’t actually spoken – possibly because it’s both trying to predict the next word in the audio and try to transcribe the audio itself. Additionally, Whisper does not perform equally well in all languages, suffering from a higher error rate when dealing with speakers of languages that are not well represented in the training data.
This last element is unfortunately not new in the world of voice recognition. Biases have long plagued even the best systems, with a 2020 Stanford study finding that systems from Amazon, Apple, Google, IBM and Microsoft made far fewer errors – around 35% – with white users. than with black users.
Despite this, OpenAI sees Whisper’s transcription capabilities being used to improve existing accessibility tools.
“Although the Whisper models cannot be used for real-time transcription out of the box, their speed and size suggest that others may be able to build applications on them that enable recognition and translation of near real-time speech,” the company continues on GitHub. “The real value of beneficial applications built on Whisper models suggests that the disparate performance of these models can have real economic implications… [W]We hope that the technology will be used primarily for beneficial purposes, making automatic speech recognition technology more accessible could enable more players to develop successful surveillance technologies or scale up existing surveillance efforts, as the speed and accuracy enable affordable automatic transcription and translation of large volumes of audio communication.
Whisper’s release is not necessarily indicative of OpenAI’s future plans. Although it is increasingly focusing on commercial efforts such as DALL-E 2 and GPT-3, the company is pursuing several purely theoretical lines of research, including AI systems that learn by watching videos.