[import] Breaking up a large import into logical pieces

From the time I started importing large videos, I have had trouble with the breakup of text, while not breaking up the sound. Lingq still does this. Now in combination with the Whisper models available, I have been able to solve this. It is only worth it, if the content is really interesting to you as it is a lot of work. For Whisper I use the website “freesubtitles.ai” which uses the largest model available and is pretty accurate, but not faultless. Below I describe how I do it, so other people can emulate that, if the need arises.

  1. I download the video, which is a .mp4 or a .mkv.
  2. Then I use an audio tool (audacity) to break up the sound in a logical place. As I don’t understand the speech yet (or I would probably not import it), I use audacity to find the pauses.
  3. In order to find a pause I aim for about 15 - 30 minutes of video and start looking for a gap in sound that is large enough. Then I find the pause gaps and note them, for example 30:10.432 i.e. 30 minutes and 10.432 seconds.
    3.a. Be sure to not download more that 60 MB. I try to stay well under it as lingq will otherwise refuse the upload of the mp3 file.
    3.b. I know there are shrinking services out there, but then you run the risk of automatic break up by Lingq if the sound is too long. Hence the 30 minute cut-off at max.
  4. Then I use ffmpeg to split off the audio at the required points, like so:
ffmpeg -t 30:10.4 -i <filename>.mp4 <filename-part-1>.mp3
ffmpeg -ss 30:10.4 -t 1:00:00 -i <filename>.mp4 <filename-part-2>.mp3
ffmpeg -ss 1:00:00 -i <filename>.mp4 <filename-part-3>.mp3
-ss = the starting point (or zero if left off)
-t   = the ending point (or end of video if left off)
  1. After this I upload to my transcript site and download both .txt and .srt
  2. Normally I only use .txt. I collate all lines into one line, so there will be only one paragraph. For me that is convenient. You could create an empty line for each paragraph you want.
  3. Then I import the text only and make sure it is acceptable in what it looks like
  4. Subsequently I add audio, open the lesson and wait for the time-stamp regeneration. You get 2 messages, one when it starts and the second after a while, or after you close and open the lesson.
  5. Then I have to edit the time stamps. They generally need a lot of tweaking. You can do this in sentence mode, or in edit-lesson mode. Only edit-lesson can remove lines though, if you want to glue two lines together.
  6. The final action which takes most time is when the transcriptor leaves out words, portions of sentences or even whole sentences. I try to correct that if I can. Sometimes the speaker has a bit of a nasal sound, making it harder for the transcriptor to get it right, in which case you will have to correct more.

Hope this helps someone.