I found this service shared on Hacker News, and thought a few of you might find it useful: https://freesubtitles.ai/
It uses the open-source OpenAI Whisper speech recognition model to generate transcripts WITH punctuation.
I tested it with 2 of the French mini-stories: #1 and #58. For mini-story #1, it was essentially 100% accurate down to the punctuation. For mini-story #58, it made several errors (“du restaurant” instead of “au restaurant”, “les heurts-annemis” instead of “leurs amis”, and several others).
I would estimate that it is ~95% accurate. Probably not accurate enough for publicly shared content (at least not without additional curation) but pretty good for personal use.
The Whisper github page gives a much more technical breakdown of the model: GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision
I’ve just tried this on a piece of text and it’s successfully outputted something with punctuation etc. I’ll take on board what you said about the margin of error, but for LingQ’ers with an advanced understanding of their target language, who are likely to know how to fix said errors, this could be invaluable.
Thanks to this, I’ve resurrected an old YouTube playlist I desperately wanted to add to my collection, but couldn’t due to lack of corrected subtitles. Time will tell if I can turn the Whisper-ised output into a workable LingQ course.
Appreciate the link and your posting it here.
Wow, great! This is exactly what I need.
YouTube video in Russian usually has no proper subtitles. I just tested an 8-min podcast and the result is very good. Punctuation almost 100% right. Some minor spelling mistakes, but nothing serious. English words are also recognized, and are written in English (autosubtitling in YouTube can’t do this).
Thanks for sharing!. Would be great if I can import my favourite podcasts into LingQ.
I’m surprised anyone would put this up for free, in my experience whisper is rather compute intensive. But there still seem to be altruists out there, kudos to them.
We discussed whisper in the other thread at length, but one tip for people who prefer to roll their own, I can heartily recommend whisper.cpp which is a more performant re-implementation, it is actually usable on CPU alone - this is in complete contrast to the original which is based on pytorch. For example, I can run the medium model on a base model Mac mini at 3x real time speed. Compiling should be straightforward, everything you need is in the readme GitHub - ggerganov/whisper.cpp: Port of OpenAI's Whisper model in C/C++
Here are some experiences after having transcribed ~ 150 hours:
In Mandarin Chinese, whisper is at least as accurate as AWS transcribe which I used previously. Whisper has a clear edge in multi-speaker scenarios especially when people interject or interrupt, this always tripped AWS up (Alexa struggles similarly).
There is an interesting quirk, whisper randomly chooses between simplified and traditional characters. I cannot detect any pattern in its choices, although it seems to prefer traditional. It’s a bit weird but fortunately LingQ has both language slots available. Also I don’t recommend going smaller than medium, it can easily fail then. I have made good experiences transcribing episodes from the following podcasts: 故事FM, 文化有限, 忽左忽右, 螺丝在拧紧.
Here is a graphic that can give you an idea how whisper performs across different languages: whisper/language-breakdown.svg at main · openai/whisper · GitHub (lower numbers are better). This can also be a hint as to what model to choose, medium or large are probably a waste for the top ranked languages.
Bro I give you a thumb up. Just search on google, haven’t find a better one. Team Steve should take your suggestion!
My post didn’t come through?
This seems to be your only post, so I guess not.
Well, I just wanted to say that I am the developer of freesubtitles.ai . I wrote something up about it but it seems my post was not approved. Nonetheless, the app now has a built in video player and I’m working on a feature where it will show the subtitles in your target language and also your native language simultaneously. This for me has been the best way to learn as it makes all content comprehensible and compelling.
Btw, Steve Kauffman has been my language learning guru for years so this invention of mine is largely inspired by what I’ve learned from him over the years. Anyways, this feature should be shipping for the general public soon, if I can help the Lingq developers integrate it in some way let me know I’d love to help give back for all the help Steve has given me over the years, cheers!
Well, my Macbook can’t generate transcriptions very well, so I pay for a GPU server and have a lot of excess compute power that I basically give away for free, the server only costs me $250/month or so though ($0.27/h) and the server is basically in constant use all day generating lots of transcriptions for people so in my mind the value add is there. Maybe someday I will figure out a monetization scheme to help the project break even but for now I am just enjoying using it myself and letting others use it for free as well.
I was inspired by Steve to pay for a transcription service to practice Serbian, I was paying $100/month for 30h/month at SimonSaysAI, but when I found out about Whisper and started using it the transcriptions were way better and I could build my own custom user interface the way I liked it, overall I like using my app a lot more than SimonSays now lol. I gave them feedback on how to improve their UI but they never took any of it so now I’m glad I can just build what I want for myself and it seems others like it as well so all the better!
Great you are chiming in here and thank you for democratizing speech to text by providing such a service. Indeed, it would be great if LingQ could implement something similar. I have written them a message and directed them toward your site and repo. Although I won’t hold my breath.
Hey it’s my pleasure, I am a big believer in information freedom and that means that information should be free to move between languages, this is my attempt at trivializing language barriers and the language learning.
@bdm - Thanks for bringing this up!
@meddit_app - Let us think about how we might incorporate something like this. Fundamentally, you need to upload an audio file or can audio or video urls be used? I see you mention using videos. Also, unclear to me what the Model field refers to.
You can either either audio or video files, audio is the only thing that’s processed but if you pass a video file the code will just ignore the video stream and will process the audio from the content.
Model refers to the AI model that is trained to do the transcription, the larger models take more time to process and require more RAM and have better results whereas the smaller models process a lot faster with less accuracy. Generally I always try and use the medium or large model which both give really accurate results.
I’m in the process of writing an API for freesubtitles.ai , or as @bamboozled mentioned there is a C port of Whisper which may be performant on CPUs which I’m going to try out today and if that goes well it may lower the bar considerably for the cost of running a server to process this stuff, we’ll see, lots of possibilities for sure.
And I think it could help people out quite a lot, I was having troubles learning Serbian because YouTube doesn’t have automatic transcriptions for Serbian which is mostly how I would practice my other languages, but then I saw Steve mention he used a paid service for transcriptions and I started to do the same and it helped me progress a lot, so I think there’s a lot of value in it and Whisper supports like 98 languages or something which is quite cool.