From the time I started importing large videos, I have had trouble with the breakup of text, while not breaking up the sound. Lingq still does this. Now in combination with the Whisper models available, I have been able to solve this. It is only worth it, if the content is really interesting to you as it is a lot of work. For Whisper I use the website “freesubtitles.ai” which uses the largest model available and is pretty accurate, but not faultless. Below I describe how I do it, so other people can emulate that, if the need arises.
- I download the video, which is a .mp4 or a .mkv.
- Then I use an audio tool (audacity) to break up the sound in a logical place. As I don’t understand the speech yet (or I would probably not import it), I use audacity to find the pauses.
- In order to find a pause I aim for about 15 - 30 minutes of video and start looking for a gap in sound that is large enough. Then I find the pause gaps and note them, for example 30:10.432 i.e. 30 minutes and 10.432 seconds.
3.a. Be sure to not download more that 60 MB. I try to stay well under it as lingq will otherwise refuse the upload of the mp3 file.
3.b. I know there are shrinking services out there, but then you run the risk of automatic break up by Lingq if the sound is too long. Hence the 30 minute cut-off at max. - Then I use ffmpeg to split off the audio at the required points, like so:
ffmpeg -t 30:10.4 -i <filename>.mp4 <filename-part-1>.mp3
ffmpeg -ss 30:10.4 -t 1:00:00 -i <filename>.mp4 <filename-part-2>.mp3
ffmpeg -ss 1:00:00 -i <filename>.mp4 <filename-part-3>.mp3
-ss = the starting point (or zero if left off)
-t = the ending point (or end of video if left off)
- After this I upload to my transcript site and download both .txt and .srt
- Normally I only use .txt. I collate all lines into one line, so there will be only one paragraph. For me that is convenient. You could create an empty line for each paragraph you want.
- Then I import the text only and make sure it is acceptable in what it looks like
- Subsequently I add audio, open the lesson and wait for the time-stamp regeneration. You get 2 messages, one when it starts and the second after a while, or after you close and open the lesson.
- Then I have to edit the time stamps. They generally need a lot of tweaking. You can do this in sentence mode, or in edit-lesson mode. Only edit-lesson can remove lines though, if you want to glue two lines together.
- The final action which takes most time is when the transcriptor leaves out words, portions of sentences or even whole sentences. I try to correct that if I can. Sometimes the speaker has a bit of a nasal sound, making it harder for the transcriptor to get it right, in which case you will have to correct more.
Hope this helps someone.