Yeah, I am not sure if it is 90% or 85%, I just randomly estimated that number.
I noticed it especially when I read transcripts from medical podcasts. The medical terms are the terms most likely to be wrong. This is a bummer because the whole point is to learn those I will try to use the large modelā¦
You said you ādonāt use LingQ for reading transcripts anymoreā, is this just for transcripts or in general. If so, are you advanced enough to no longer need Lingq or have you found a better solution?
I currently use the Pleco app (available on iOS / Android) to read transcripts. Some advantages are the built in dictionaries (not user provided definitions) those are offline as well so I can read without needing Wifi. But the interface is rather clunky. It currently works for me although Iāll continue to experiment. Another alternative would be to just open a text file in a Webbrowser and use a pop-up dictionary. LingQ often creates a mess and gets in the way, also my word count is already inflated, so I only import the occasional YouTube video and books (languages other than Chinese or Japanese are obviously not affected.
Regarding Whisper, the creators are silent on the source of their training data, but it is reasonably certain that it consisted mainly of YouTube videos with subtitles. I suspect that there isnāt much medical content on there, especially in Chinese. That would mean that the model is just not very familiar with specialist terms.
Here are some random ideas (that I havenāt tested):
create an initial prompt containing some of the special terms, this might help push the model in the right direction (not supported by freesubtitles ai)
Thanks. I read most of my Chinese on PC, so Pleco is not really an option and I also enjoy that I can mark strings of words as Lingqs. This is really more important for me than 1-2 characters. I do use a mouse-over pop-up dictionary on top of Lingq and I find it essential for the reasons you mentioned.
For anyone finding this thread that doesnāt know what Python or C++ are, but wants to give this a try on their own, this video gives step-by-step instructions for how to install and use Whisper.
Hello. Thank you from me, too. I installed python today on my Windows 11 box, and used the ābaseā multilingual model to produce some Chinese and Russian text files from audio files. I started with Chinese mp3s from voachinese.com (Voice of America) and Russian mp3s from archive.org. It works! I can tolerate some errors in the output text. Maybe Iāll be able to use lingq to read Š”ŃŠ°ŃŠøŠŗ Š„Š¾ŃŃŠ°Š±ŃŃ , a famous childrenās book. I have a hardback book copy of Š”ŃŠ°ŃŠøŠŗ Š„Š¾ŃŃŠ°Š±ŃŃ so errors in the computer text file would be easy to overcome.
I open the python output file this way:
f= open(āc:\somewhere\ā + args.output_file, āwā, encoding=āutf-8ā)
My Russian output has punctuation, but my Chinese output does not. Seeing that I had several thousand Chinese characters in my output and no new lines, I threw in a call to python textwrap.wrap() and that makes things nicer. I get maybe 80-85% correct characters in Chinese with ābaseā model and the voachinese.com sound files. Iāll experiment with the āmediumā model, eight times slower, some day soon. I donāt own hardware that can hold the ālargeā model in VRAM.
All of these items (lingq, youtube, whisper, ffmpeg, etc) are part of a giant toolbox and it is fun to see what can be done.
If you want to do this on a Mac, MacWhisper is an excellent implementation that is really easy to install (you just download it like any other app). There is a paid version, but I found that I got nearly perfect transcriptions on the most accurate free model. Highly recommended.
I had trouble getting whisper to work on my Linux machine, and the windows computer I use does not have a compatible GPU, so it was very slow. I found this description of how to run whisper in a free virtual machine through Google Colabs and itās worked extremely well. Itās very easy to set up and run.
Itās free, just upload video or audio to it. No account required. No trial limitations as of yet. I just had it transcribe a 1 hour 11 minute audio file. Transcribes really well but I havenāt compared it to Whisper. Iām not in the mood to go through a whisper installaiton process yet so this is a good quick option for anyone.
Yes, for the original whisper:
āoutput_format srt
[Edit] technically the original whisper already outputs all formats by default when used from the command line
For whisper.cpp:
-osrt
By the way, there is a new tool with a different approach. AssemblyAI published Conformer-1. Basically, it does the same thing: transcribing speech from youtube and uploaded audiofiles, but has a convenient UI available on their website AssemblyAI | Playground
It is also free to use. Although I didnāt find out what are the limitations of the playground version. Also I think it returns the result of transcribing faster.
However, I didnāt find how to make timestamps using it. Whisper can do that.
I tried it with Hindi and found freesubtitles.ai much better. Conformer-1 also gave the result in Latin alphabet instead of devanagari and had more mistakes.
Thank you for the heads-up! The button is located at the bottom of the page (off the screen on my PC) and if you arenāt looking for it, and scroll up, you wouldnāt even see it!
I just tried it by uploading the next chapter of the 3rd Harry Potter book that I hadnāt gotten to yet (ch 9 at 50mins) and it worked perfectly! Even better, since the text is in Iberian Spanish and the audio is in Latin American Spanish, by doing it this way, the text now matches the audio perfectly. As an added bonus, it created the whole chapter complete, without splitting it up into 3 or 4 parts. It took about 15-20 minutes for the whole audio to be processed.
Iām sure the authorās guild wouldnāt be pleased, but as long as the text isnāt shared, I think itās perfectly legal.
A couple downsides that are obvious in hindsight. There are no quotation markings or paragraphs. Only one separate sentence after another. Obviously, Whisper cannot tell the difference between narration and speech and equally obviously, it cannot tell when sentences form a cohesive paragraph.
But this is only a minor negative that applies to those of us who enjoy reading the text in page mode.