OpenAI’s Whisper is revolutionary but (slightly) flawed
Speech recognition in machine learning has always been one of the most difficult tasks to perfect. The first speech recognition software was developed in the 1950s and we’ve come a long way since then.
Recently, OpenAI took a leap into the domain by introducing Whisper. The company says it “approaches the robustness and accuracy of English speech recognition at a human level” and can automatically recognize, transcribe and translate other languages such as Spanish, Italian and Japanese.
There’s no question that Whisper works better than any other commercial ASR (automatic speech recognition) system, such as Alexa, Siri, and Google Assistant. OpenAI, the company that usually does not live up to its name, decided to open source this model. The digital experience will radically change for many people, but is the model revolutionary?
Here’s what you need to know
As with almost every major new AI model these days, Whisper comes with benefits and potential risks. On Whisper’s under the “Broader Implications” section of the model card, OpenAI warns that it could be used to automate surveillance or identify individual speakers in a conversation, but the company hopes it will be used “primarily for useful purposes.”
Conversations have also surfaced on the internet about the challenges faced by the first users of this revolutionary transformer model. As an aside, OpenAI researchers chose the original transformer architecture because they wanted to prove that high-quality controlled ASR is possible when enough data is available.
The biggest challenge is that your laptop may not be as powerful as the computers of professional transcription services. For example, Mitchell Clarke fed the audio of a 24-minute interview into Whisper, which runs on the M1 MacBook Pro. It took almost an hour to transcribe the file. On the contrary, Otter completed the transcription within eight minutes.
Second, installing Whisper isn’t really user-friendly for everyone. Journalist Peter Sterne teamed up with GitHub developer Christina Warren to solve the problem by creating a “free, secure, and easy-to-use transcription app for journalists” based on Whisper’s ML model. Sterne said he decided the program, called Stage Whisper, should exist after running some interviews through it and determining it was “the best transcription he’d ever used, excluding human transcribers.”
Another red flag is that the prediction is often based on integer timestamps. Users noted that these are often less accurate; blurring the predicted distribution may help, but no conclusive study has yet been done. The timestamp decoding heuristics are a bit naive and could be improved along with word level timestamps.
Special malfunction case?
Whisper has also been described as a ‘peculiar failure case‘. The reason for this is that the model sometimes has shortcomings in the recognition quality.
(credits: https://docs.google.com/spreadsheets/d/1xdaK-RJZ2ftMKBME45aAeEmMHSJSxb3wW8-GzT1whgg/edit?usp=sharing)
When testing Whisper, Talon, and Nemo against the exact same test sets with the same text normalization, all major models performed well in general dictation. However, Whisper was painfully slow compared to the other models tested. When running GPU tests on the largest Talon 1B model and the Nemo xlarge (600M) model, much higher output can be achieved than any Whisper model, including Whisper Tiny (39M).
Whisper output is very good at producing coherent speech, even if it is completely incorrect about what has been said. When analyzing some worst case outputs, neither Talon nor Nemo models showed worst case results, something like this. Most of Talon’s mistakes in this test set were compound word splits.
A analysis of the paper felt that, in general, at least for Indian languages, translations are better. However, transcripts suffer from catastrophic failures.
In conclusion, Whisper is a very neat set of models and capabilities, especially the use cases for multilingualism and translation. It will probably be a great tool to guide the training of other models as well. However, given the observed failures, users should not use Whisper in production without a second model to check the output.