This is how to generate perfect transcripts for audio podcasts for free

Great reply as always :slight_smile:

Yeah, I am not sure if it is 90% or 85%, I just randomly estimated that number.
I noticed it especially when I read transcripts from medical podcasts. The medical terms are the terms most likely to be wrong. This is a bummer because the whole point is to learn those :wink: I will try to use the large model…
You said you “don’t use LingQ for reading transcripts anymore”, is this just for transcripts or in general. If so, are you advanced enough to no longer need Lingq or have you found a better solution?

I currently use the Pleco app (available on iOS / Android) to read transcripts. Some advantages are the built in dictionaries (not user provided definitions) those are offline as well so I can read without needing Wifi. But the interface is rather clunky. It currently works for me although I’ll continue to experiment. Another alternative would be to just open a text file in a Webbrowser and use a pop-up dictionary. LingQ often creates a mess and gets in the way, also my word count is already inflated, so I only import the occasional YouTube video and books (languages other than Chinese or Japanese are obviously not affected.
Regarding Whisper, the creators are silent on the source of their training data, but it is reasonably certain that it consisted mainly of YouTube videos with subtitles. I suspect that there isn’t much medical content on there, especially in Chinese. That would mean that the model is just not very familiar with specialist terms.
Here are some random ideas (that I haven’t tested):

Thanks. I read most of my Chinese on PC, so Pleco is not really an option and I also enjoy that I can mark strings of words as Lingqs. This is really more important for me than 1-2 characters. I do use a mouse-over pop-up dictionary on top of Lingq and I find it essential for the reasons you mentioned.

For anyone finding this thread that doesn’t know what Python or C++ are, but wants to give this a try on their own, this video gives step-by-step instructions for how to install and use Whisper.

Thanks for this. I had a bunch of shows that I had no subtitles for and this worked brilliantly. I used ffmpeg to rip the audio track into a wav

ffmpeg -i episode.mp4 episode.wav

Then I used whisper to convert them into SRTs. It took about 20 minutes an episode with my GPU and the results were impressive.

Hello. Thank you from me, too. I installed python today on my Windows 11 box, and used the “base” multilingual model to produce some Chinese and Russian text files from audio files. I started with Chinese mp3s from (Voice of America) and Russian mp3s from It works! I can tolerate some errors in the output text. Maybe I’ll be able to use lingq to read Старик Хоттабыч , a famous children’s book. I have a hardback book copy of Старик Хоттабыч so errors in the computer text file would be easy to overcome.

I open the python output file this way:
f= open(“c:\somewhere\” + args.output_file, “w”, encoding=“utf-8”)

My Russian output has punctuation, but my Chinese output does not. Seeing that I had several thousand Chinese characters in my output and no new lines, I threw in a call to python textwrap.wrap() and that makes things nicer. I get maybe 80-85% correct characters in Chinese with “base” model and the sound files. I’ll experiment with the “medium” model, eight times slower, some day soon. I don’t own hardware that can hold the “large” model in VRAM.

All of these items (lingq, youtube, whisper, ffmpeg, etc) are part of a giant toolbox and it is fun to see what can be done.

If you want to do this on a Mac, MacWhisper is an excellent implementation that is really easy to install (you just download it like any other app). There is a paid version, but I found that I got nearly perfect transcriptions on the most accurate free model. Highly recommended.

I had trouble getting whisper to work on my Linux machine, and the windows computer I use does not have a compatible GPU, so it was very slow. I found this description of how to run whisper in a free virtual machine through Google Colabs and it’s worked extremely well. It’s very easy to set up and run.

Good news! Openai announced their ‘Whisper API’:

Example using curl:
-H “Authorization: Bearer $OPENAI_API_KEY”
-H “Content-Type: multipart/form-data”
-F model=“whisper-1”
-F file=“@/path/to/file/openai.mp3”

Source: Introducing ChatGPT and Whisper APIs

Documentation: OpenAI Platform

I don’t know the language options but this site is really good for transcribing my Spanish material.

It’s free, just upload video or audio to it. No account required. No trial limitations as of yet. I just had it transcribe a 1 hour 11 minute audio file. Transcribes really well but I haven’t compared it to Whisper. I’m not in the mood to go through a whisper installaiton process yet so this is a good quick option for anyone.

does anyone know if it can output srt format?

Yes, for the original whisper:
–output_format srt
[Edit] technically the original whisper already outputs all formats by default when used from the command line
For whisper.cpp:

1 Like

This is fantastic. I am now free (the freesubtitles site is currently overrun by 90+ requests, so basically unusable). Now I can run it myself :slight_smile:

By the way, there is a new tool with a different approach. AssemblyAI published Conformer-1. Basically, it does the same thing: transcribing speech from youtube and uploaded audiofiles, but has a convenient UI available on their website AssemblyAI | Playground
It is also free to use. Although I didn’t find out what are the limitations of the playground version. Also I think it returns the result of transcribing faster.

However, I didn’t find how to make timestamps using it. Whisper can do that.

In fact, here is a whole list of such tools:

I tried it with Hindi and found much better. Conformer-1 also gave the result in Latin alphabet instead of devanagari and had more mistakes.

MacWhisper is fantastic, thanks for posting it.

1 Like

Hi, Whisper has been integrated into LingQ and is now an option when importing a lesson:

Thank you for the heads-up! The button is located at the bottom of the page (off the screen on my PC) and if you aren’t looking for it, and scroll up, you wouldn’t even see it!

I just tried it by uploading the next chapter of the 3rd Harry Potter book that I hadn’t gotten to yet (ch 9 at 50mins) and it worked perfectly! Even better, since the text is in Iberian Spanish and the audio is in Latin American Spanish, by doing it this way, the text now matches the audio perfectly. As an added bonus, it created the whole chapter complete, without splitting it up into 3 or 4 parts. It took about 15-20 minutes for the whole audio to be processed.

I’m sure the author’s guild wouldn’t be pleased, but as long as the text isn’t shared, I think it’s perfectly legal.

A couple downsides that are obvious in hindsight. There are no quotation markings or paragraphs. Only one separate sentence after another. Obviously, Whisper cannot tell the difference between narration and speech and equally obviously, it cannot tell when sentences form a cohesive paragraph.

But this is only a minor negative that applies to those of us who enjoy reading the text in page mode.