The slowest part is uploading each section to LingQ, which took me a few hours

Hey all, I just wanted to share my method for adding private courses to LingQ that include audio and text.
Most audiobooks don’t split their audio files on chapters all the time. This can be really frustrating if you want to upload to LingQ as you have to find and divide the text by each section.

This is my method:

  • Listen to an audio file and search in your text editor on the first word(s) you hear.

  • Add a section heading in your text editor to mark the start of the audio’s section. Add a symbol at the start of the heading that is not found anywhere in the text, something like a backtick. like this: `Kapital 14. I’ll explain the importance of this later.

  • Remove text that isn’t in the audiobook. Ebooks tend to include extra text not found in the audiobook and this will prevent LingQ from effectively matching the audio file to the text file.

Ok, I’m sure all of this has been obvious so far but the magic in my method is that I’ve developed a Python script that’ll split the text up on those backticks (or symbol of choice) in to individual files. You can even name the files after the audio book sections.

The last book I uploaded was 90 sections and it took around 30 minutes to split it up by ear and seconds to run the script. The slowest part is uploading each section to LingQ, which took me a few hours.

The script is really simple and if people want it I’m happy to share!

4 Likes

There should be no wrong in sharing the script. I’m sure someone will find it useful. Even better if you chose a more explicit title for your post so people can find it through the search bar in the future.

The slowest part is uploading each section to LingQ, which took me a few hours.

I may suggest giving this a try: GitHub - daxida/lingq: Lingq scripts for automated editing

It automates the uploading process assuming some file structure. If you have your texts / audios in this tree structure:

example
├── texts
│   ├── text1
│   └── 2text
└── audios
    ├── audio1
    └── audio2

This command from the README:
lingq post el 129139 "example/texts" -a "example/audios" --pairing_strategy zip
will automatically upload them (in the greek section el), to the collection (=course) with id 129139 by zipping them (text1 ~ audio1, 2text ~ audio2 etc.).

3 Likes

Thanks for the link and advice! I should have guessed someone else would have automated this! Awesome, I’ll definitely look at that.

2 Likes

I did toy with this idea for a couple evenings and I found a way to solve some cases. I will copy some of the docstring here so you can decide if this fits your particular case.

You may find yourself in one of this cases:

(1) Your texts and audios are chapter-structured.
    Then you don't need any of this and you can just post your pairing to LingQ.
(2) Your texts are chapter-structured but your audios are NOT.
    In this case you would want to split with "Preserving chapters" with simply:
    process_files(
        language="ell",
        files_dir="my_input_files",
    )
    This only modifies audios.
(3) Your texts and audios are NOT chapter-structured.
    If you want them to be chapter-structured, this (and most likely no tool out there) does
    not support it. You will have to manually tailor your particular case.
    If instead you want to split it in a number of sections to adapt the length to your needs
    you can by either:
    process_files(
        language="ell",
        files_dir="my_input_files",
        n_sections=10, # At most 10 sections
    )

    Or if you have a number of seconds per section in mind:
    process_files(
        language="ell",
        files_dir="my_input_files",
        cut_at_seconds=30, # At least 30 seconds
    )
    This modifies texts and audios.

My chapter-structured I mean

    example
    ├── texts
    │   ├── chapter1.txt
    │   └── chapter2.txt
    └── audios
        ├── audio1.mp3
        ├── audio2.mp3
        └── audio3.mp3

Caveats:

  • I was only able to test it briefly. Worked very well except for one cut of a 1h20~ audio where it was 10~seconds off (and then it fixed itself later on). Maybe precision issues.
  • aeneas is (sort of) hard to install.
  • No guarantees on aeneas supporting your language.
  • Some parts are still WIP

The code is in the already linked repo.

2 Likes