Best Way to Generate Subtitles?

Thank you. Where can I find this software/app? Does it work in Farsi?

Thank you!

i tried getting this to work on my mac to no avail :frowning: not sure what i did wrong

Live Transcribe & Notification - The best app I have ever found.
Perfect for transcribing news or programs where the speech is clear and there is not much background noise. Also effective for movies but, of course, less so.

Thanks for the hint! I wasnā€™t aware of whisper, so I just tried it out.

The result is pretty good, not really worse than Amazon, but I only tested 1 minute of audio. Because this software is slow. One minute of audio took 20 minutes to process. This would result in something like 30h for a standard 90m podcast? Unfortunately this makes whisper unusable for me. For the record I ran this on a MacBook Pro 16 from 2019 (Intel) and used the sample code from the website:

Maybe someone else can try and share their numbers. I donā€™t know a thing about machine learning, so I probably did something wrong. If there are any performance tips, feel free to share them. Thanks in advance!

Also worth noting is that whisper doesnā€™t seem to be set up for long-form content: Stops working after long gap with no speech? Ā· openai/whisper Ā· Discussion #29 Ā· GitHub
So it looks like Iā€™ll still have to feed ā€œBig Techā€, for now.

@Cinderela
The instructions are here: GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision
and Persian is supported, please feel free to share your results, I would be interested.

Yes actually Whisper takes forever on CPU. However if you run on GPU the speed is faster. Using the small model I could process a 30 minute audio clip in approximately 12 minutes.

You can see the approximate times on their website I believe, but on a gpu the large, accurate model should approach real time while the fastest model should be around 32x faster.

For French I found some errors using the small model so I exclusively use the large model. I have been using it for tv episodes which are 21 minutes long. Each one takes approximately 18 minutes to process, and I just simply queue 10 or more of them up before I go to sleep.

If you donā€™t have a strong GPU, you may consider using google colab. Itā€™s a free resource using googles hardware. Iā€™m not too clear on the limitations but for whisper at least I tested it and didnā€™t run into any over the course of about an hour. I imagine it would work for you for a 90 min podcast once in awhile. Make sure you switch the runtime environment to gpu.

As for long form content I did see some discussion about it but I hadnā€™t personally encountered any problems with it so I donā€™t know much about that.

The only thing that whisper annoys me a bit about is the time stamping for the generated subtitles can be somewhat off and needing manual adjustment. Of course as noted on their website, some languages perform wildly different than others.

OMG bro! Iā€™ve looking for something likes this for months! Iā€™m so happy. Iā€™m learning Bulgarian and finding transcriptions is like the hardest thing ever. Thank you! :grin:

OMG bro! Iā€™ve been looking for something likes this for months! Iā€™m so happy. Iā€™m learning Bulgarian and finding transcriptions is like the hardest thing ever. Thank you! :grin:

How to make text from audio you need blackhole and Microsoft word. I hope you now share more your sources (text and audio) on lingQ :smiley: with this tools Simple automatic transcriptions using Microsoft Word and the Blackhole audio driver. - YouTube

I donā€™t know if anyone else has been looking for a more performant alternative to the original Python implementation, but I found this gem:

Itā€™s a reimplementation in C/C++, optimized for CPU using SIMD , threads etc. Overall Iā€™m very impressed with the results, occasionally it fails or gets hung up on one sentence, which then repeats endlessly. But this only happened on 2 out of 23 files I tried.
The accuracy is actually quite impressive. For Chinese this is on par with Amazon Iā€™d say. I have also tried Romanian which probably hasnā€™t had much training data, for example Google Cloud struggles quite a bit; Whisper gets better results, although it is not close to the accuracy it achieves in Chinese. But Iā€™m not looking for perfection anyways. Here are some non-scientific performance numbers, all tests were done with the medium model on a MacMini M1 using 4 threads (adding more doesnā€™t improve the results) as of b4a3875.
Input length in seconds / time to output in seconds
1539 533
2389 697
1755 587
1810 531
1731 553
1733 540
1491 494
1976 659
2181 651
2282 733
Total:
18887 5978
So, roughly 3x realtime, 3 hours of input take about 1 hour to process. The above results were with Chinese, but I couldnā€™t see a difference with Romanian.
The software also supports vtt and srt subtitle output. Please note that LingQ will ignore the first subtitle line after importing, I already reported this bug. Just use txt for the time being.

I didnā€™t try the python implementation but I decided to give this a try and it seems to work well for japanese. Granted I only tested it in a couple files, and the speaker was speaking very clearly so take as it is.
What model did you try? I went for the base since itā€™s what they suggested in the repo but Iā€™m not sure how it fares with respect to quality. As I understand it with the bigger ones (medium, large) you trade computation for more accurate subtitles?
Thanks for the share btw, very interesting.

My tests were with the medium model. Maybe Iā€™ll experiment with other sizes later, but currently Iā€™m satisfied with accuracy and speed, especially for Chinese. Romanian is already a bit flaky, not sure I could go lower.
The original paper (https://cdn.openai.com/papers/whisper.pdf) includes ā€œword error rateā€ (WER) data, starting on page 23, this can serve as indicator for the accuracy across the various languages. This might help to determine which model to choose.
Interestingly Japanese seems to achieve significantly better results than Chinese, even though the latter had quite a bit more training data (see Figure 3 on page 7). Curious.

Btw. one big thing I forgot to mention is that Whisper supports punctuation, this is not a given even for payed services.

guys im so bad at this how do you even begin to use the program !

Well, I only tried it on macOS, on Linux itā€™s probably even easier. I donā€™t really know, but compiling on Windows is probably a world of pain.
[Edit: actually you can just download a Windows executable from the automated build system on Github - look under Actions ā†’ Artifacts]
Please note, all commands are provided in good conscience, but I assume no liability if you screw up your system :slight_smile:

Prerequisites:

  • C/C++ compiler
  • git (version control, preinstalled)
  • FFmpeg (for converting the audio files)
    macOS:
    On macOS: if you have Xcode installed, you are good, else open the terminal and run:
    xcode-select --install

You can check your compiler:
clang -v

If you have https://brew.sh/ installed type to install FFmpeg:
brew install ffmpeg

(MacPorts should work as well)

Linux:
Potentially everything is in place already, on Ubuntu you may need:
sudo apt install g++
and


Cloning the repo and compiling:


Pre-processing the input file:
(change input.mp3 to your input file):

Model download:
For example. the medium model:
bash ./download-ggml-model.sh medium

Example command:
This is the command I used for my tests thus far (medium model, Chinese language, srt subtitle file output):
Choose the language:

Output format:

Beam Search:

Initial Prompt
The --initial_prompt option gives the model a hint what came before the first audio sample, it can help with special vocabulary or difficult to spell words here is an example:

Other options:
The -pc option gives colored output, according to how confident the model is
-pp will print occasional progress messages, e.g. 5% done


Batch:
You can also incorporate batch processing, here are some ideas:
for f in .mp3; do ffmpeg -i ā€œ${f}ā€ -ar 16000 -ac 1 -c:a pcm_s16le "${f%%.}.wav"; done
for f in *.wav; do ./main -m models/ggml-medium.bin -l zh -pc -pp -osrt -f ā€œ${f}ā€; done

thanks im gonna try it!!

I was trying to sign up for this service, but couldnmā€™t get past the initial verification step. Not getting a text on my US phone number, and they donā€™t offer an email alternative. :frowning:

You mean iflyrec? I think they only accept Chinese phone numbers, and even then the real problem will be the payment. I donā€™t have an account there but I doubt that their service is somehow better than ā€œBig Techā€. Itā€™s probably not worth the hassle. Personally, I have moved on to Whisper and create transcripts myself. But I always found Amazonā€™s service to be adequate and reasonably priced as well. I you are interested I could send you a sample transcription privately. They only require a credit card IIRC.

heyyy how much did you end up spending on amazon? and how is the process?

I think I spend about 100ā‚¬ on transcripts. In the free tier you can get 60 minutes for free each month for one year, if you exceed that you pay. The pricing depends on currency and region, see here: Amazon Transcribe Pricing ā€“ Amazon Web Services (AWS)
The process is quite simple, create a storage bucket on S3 and upload your audio files, then add a transcription job. Everything can be done using the website, but is available via API as well. As for options, I found the ā€œspeaker identificationā€ to help with accuracy, just enter the number of speakers. Also donā€™t forget to the select the subtitle formats. Also, Iā€™ve had the language selection reset on me, so when creating multiple jobs, make sure to select the language for each item.
Transcribing with the AWS Management Console - Amazon Transcribe
Getting Started with Amazon Transcribe

1 Like