Today, I have tried a python package called whisper for transcribing a 30 minute German podcast from SPIEGEL.
This package is developed by OpenAI, the same company that has developed the famous ChatGPT.
To use it, I have downloaded the audio, read an article on whisper, installed a few packages, copied the lines of code from the article (Transcribe audio files with OpenAI’s Whisper | Towards Data Science) and launched the script. The script downloads the neural network on the computer and runs it offline on your hardware. It took just 1,5 hours to transcribe a 30 minute audio on my laptop with Intel Core i5-1235U. And I am very much satisfied with the result. What differs this package from other available free speech-to-text converters is the ability to use punctuation.
Below is an excerpt from the beginning of the text I got:
Unser Gehirn hat keine Speicherfunktion für eine Momentaufnahme. Den Moment erleben wir wirklich nur, wenn wir uns darauf konzentrieren. Sonst ist er aus unserem Leben, aus unserer Biografie ein für allemal verschwunden. Und das ist ja tragisch. Ideen für ein besseres Leben haben wir alle. Aber wie setzen wir sie im Alltag um? In diesem Podcast treffen wir jede Woche Menschen, die uns verraten, wie es klappen kann. Willkommen zu smarter leben. Ich bin Lenne Kafka und diesmal spreche ich mit Volker. Hallo, ich bin Volker Kitz. Ich habe ein Buch über Konzentration geschrieben und bin dafür in ein Schweigeseminar in den Himalaya gereist. Wissen Sie noch, was ich eben gerade im Intro erzählt habe? Oder was da für Musik drunter lief?
SUMMARY OF THE BELOW DISCUSSION
I’ve used it a couple of times to try out and indeed it works great. If someone doesn’t want to set up themselves they can use the following linke (which is using Whisper behind the scenes):
You do have to wait in a queue as a lot of people are using it, but it’s not been too bad in my experience.
Thanks. I wish there was just a simple .exe file I could install this with… sigh!
I’m the developer of freesubtitles.ai, glad you like it! The amount of people in the queue has fluctuated quite a bit, right now I’m just paying out of pocket for the servers to be able to let people transcribe content for free, I’m planning to offer some paid features that should hopefully bring in enough revenue whereby I can scale the project more and decrease the queue time. The goal is to have people using the public queue wait no more than 30 minutes (or, hopefully less). So far though people have already transcribed thousands of hours for free (the numbers on the site aren’t accurate since I had to move to a new server) so thus far I would consider it a success!
I don’t really know about the original whisper, but since we discussed whisper.cpp previously here is an idea on how this might work. Although it has to be said I don’t know Windows and don’t have access to such a machine currently (I’m winging it). So you will have to figure some things out on your own, but you can always ask Google or ChatGPT (e.g. “how do I run FFmpeg on Windows?”).
You should be able to just download a Windows executable from the release page on GitHub (Releases · ggerganov/whisper.cpp · GitHub), look under Assets for something like: ‘whisper-blas-bin-x64.zip’
Once unzipped you should see ‘main.exe’ which is the actual transcription software. Of course you need to have a model, the easiest way might be to download one from here: ggerganov/whisper.cpp at main
I mainly use the medium model for Chinese, which reaches good enough accuracy while running at moderate speed.
The (input) audio has to be in a compatible format as well, I use FFmpeg for the conversion, but there are certainly other ways. You can grab a current Windows binary from here: Releases · BtbN/FFmpeg-Builds · GitHub look for something like ‘ffmpeg-master-latest-win64-gpl.zip’ or LGPL, but nothing containing “shared” or “Linux”.
You wouldn’t really install this program like a regular Windows program, instead you can drag-and-drop the exe file in a terminal window (command line, not Powershell), type ‘-i’ drag-and-drop the audio file and then add the following ‘-ar 16000 -ac 1 -c:a pcm_s16le output.wav’
The end result should look something like this:
\C\Users\Path\to\ffmpeg.exe -i \C\Users\Path\to\input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav
Now everything should be in place to run whisper; again, by either dropping the ‘main.exe’ into the terminal window or by typing the path directly. You will have to do the same with the model file.
Here is how it looks for me:
./main -m models/ggml-medium.bin -f /Users/bamboo/test.wav -l zh -osrt
./main would become something like: \C\Users\Path\to\whisper.main
Similarly for the model file \C\Users\Path\to\ggml-medium.bin
And the audio file.
I don’t know if this makes any sense. If you are looking for a simpler way, I’m sure many people have already built applications around whisper, that are one-click installable. Maybe look around a bit.
Hi Andrey, whisper is really great, but the the original PyTorch implementation doesn’t seem to be particularly optimized for use on CPU. You have mentioned your processor, but not your GPU - this is in fact the most important factor.
Because I don’t have a suitable GPU, I use a re-implementation of the original whisper called ‘whisper.cpp’ GitHub - ggerganov/whisper.cpp: Port of OpenAI's Whisper model in C/C++ It is optimized for use on CPUs and handily outperforms the original on my system (Apple Mac mini M1). The medium model transcribes here in about 3x real time (i.e. 60m of audio take 20m) and the large one in about real time.
Maybe give that a try.
About a year ago my colleagues and me tested some ways to transcribe text, results were really poor then. At my university we have hundrets of people waiting for good methods to transcribe various types of voice recordings from noisy old field recordings to good quality modern studio recorded interviews. I got quite excited when I read about Whisper and I will run some tests soon. Just now I tested it with freesubtitles.ai - thanks for that, this is really brilliant!
It’s cool to see how fast technology in this field moves forward. At the moment there are quite some errors even in very clear studio recordings with good pronounciation (like actors and speakers) but compared to the last years you can really see the improvement.
Thank you very much for your generosity!
Thanks so much. Stupid me is using Chatgpt all the time now and I never thought of asking it for help with the install…
My laptop has only intel hd graphics, that is why I mentioned about the cpu only.
Thank you for the link, I will use it from now on since it is faster and optimized
Thank you for hinting that there is an optimized version of whisper
Hey no problem, Steve has been my language learning guru for a decade now, using his advice I was able to learn Spanish and German but in Serbian it was hard to find content with subtitles (which is my preferred way to get input for A2-C1) so this app was largely built for myself to be able to permit that, then once it was done I put it up so others could use it as well, glad it’s come full circle and that people can use it to help them, wouldn’t be here without Steve for sure!
I have updated my post to include all of your findings. Thank you everyone
Thanks a lot for providing this. It is really fantastic. There may be a bug though: today I tried to transcribe a Chinese interview podcast. The first minute was an intro in English and the rest was 40min or so of Mandarin. The transcription however was all in English. So I had to cut out that first minute of English in order to make it transcribe in Mandarin.
Did you put the language as Chinese? If it used Auto Detect it will only base it off the first 30 seconds.
Is your site down? I uploaded some audios but they are stuck in “processing” (100% uploaded)… (!?)
My current experience with it in Chinese is mixed. Yes, >90% is correct, but the mistakes [totally wrong character] are somewhat annoying as you have to either ignore them or manually correct them. With more difficult texts and new words I am sometimes at a loss what was actually meant. I would have thought ChatGPT/whisper would proof-read its transcript for meaning (!?). It should detect those wrong characters as those non-words do not make any sense…
Super interesting – does it run on MacOS?
The original whisper works but is rather slow on macOS, because the underlying PyTorch doesn’t take advantage of MPS or the accelerate framework. I assume the situation will improve in the future, see this for more information: General MPS op coverage tracking issue · Issue #77764 · pytorch/pytorch · GitHub
I myself run whisper.cpp (GitHub - ggerganov/whisper.cpp: Port of OpenAI's Whisper model in C/C++) on a Mac and have no problems. Compilation instructions are in the readme. Happy transcribing!
Sorry to hear that. Among the supported languages Chinese is just at the threshold of being usable and the results certainly cannot be compared to what can be achieved in English, Spanish or even German.
When you say 90%, that is actually a great outcome, even Openai gives us a number of around 85% afaik, and that is on pristine test data. As you say, the vast majority of mistakes are substitutions, but insertions or omissions would be even worse. I myself find the accuracy to be fine for my purposes, although I don’t use LingQ for reading transcripts anymore, because, indeed, the inaccuracies combined with LingQ’s way of handling Chinese, create a bit of a mess. When I did read transcripts here, I used to disable highlighting to reduce distractions and be able to focus on the text.
I never tried objectively difficult content (just stuff I find difficult to understand) but one suggestion would be to try the large model, if you haven’t already, it is really slow and memory intensive but definitely more accurate.
Regarding proof-reading, I don’t think we should ascribe Whisper anything resembling intelligence, it just makes predictions - basically like a weatherman it is sometimes right and sometimes wrong. The audio is split into 30 second chunks and they are processed one after the other in sequence, Whisper couldn’t go back and correct itself, even if it had any understanding of language. It doesn’t really know what word “makes sense” in a given context. Certainly, my knowledge of machine learning is cursory at best but I don’t think I’m mischaracterizing its capabilities here.
That being said, I have seen a few instances of emergent capabilities in regards to translations, which is rather fascinating.