Forced alignment of e-books and audiobooks or a simpler alternative?

I’ve been successfully learning from reading e-books on LingQ, but I want to practice listening more, so I think audiobooks would be a great source of reading while listening.

The problem with that is that if I upload an audiobook - the transcription is lacking. There’s a lot of misheard words, but what’s even worse is that the transcription often detaches from the audio (as I already flagged here - Whisper AI issues when transcribing longer files )

Sometimes I’m able to acquire both an e-book and an audiobook of the same content. Would be perfect, but now I’m left in a situation where they are not aligned. I can read the e-book on LingQ and play the audiobook on VLC, but that’s not optimal as now I have to manage 2 apps.

I guess there’s no easy solution for this, I tried vibe coding some forced alignment scripts, but it didn’t work out for me.

So my solution for this would be a bit simpler. I can manage the sentence spoken myself if LingQ could at least remember where I stopped the audiofile and which page I’m on in the e-book. Like how it works when there’s a transcript, but minus the timestamps.

Could that be done somehow?

The problem with the e-book is that it’s broken down into chapters but the audiobook file is a single file, so the audio would have to be global for the whole course while the text is not. To which there’s also limitations on LingQ

2 Likes

If the book is in chapters, then I assume there is a pause in the audiobook as well?! If so, those should be easely recognizeable and you should be able to split the audio file according to the chapters using a software like Audacity.

I used this in the past to split mp3 versions of concerts downloaded from YouTube into the respective songs. You can even tag the split points and upon export the tags will be used to name the files accordingly. For a 1 hour concert this process took me usually about 5 minutes.

3 Likes

Encounter this problem too, it is fairly common. Random missing sentences in transcription and bad syncing. I just learned to live with it and fix the missing parts manually (free dictation practice). I doubt it would be improved on or fixed anytime soon, it seems to be at mercy of 3rd party AI development.

For this I had to split the audio using tools like Audacity.

Looking at it positively, you get to learn audio editing and use the chore of splitting the audio as training your listening/reading by finding the exact spot to split the audio to match the text in the book. (especially if I have to split in the middle of a chapter.)

Other problems.
Some audiobooks have more chapters (audiofiles) than the ebook so you may have to merge the audio files to match ebook.

Alternatively you can split the ebook chapter text to match the audio chapters. (I don’t see it as a total waste of time since I am testing my listening and reading). I prefer to split the chapter text to match the audio so as to keep the word count per lesson low and not hit the word limit.

So If an ebook chapter has more words than the LingQ limit then the chapter text gets split by lingQ into multiple lessons. In this case, you have to split the audio chapter to match.

If you have 1 single audio file you may be able to identifying the silent portions in the Audacity waveform as chapter and split accordingly.

If it feels like too much work I just skip and move on to a different book or podcast inside/outside lingQ.

2 Likes

Thanks for suggestions. Maybe splitting chapters from visual audio waves wouldn’t be the worst solution.

As for some automation - I found this project Storyteller Docs | Storyteller that does some sophisticated forced alignment to epub with media overlay.

It’s done for a different purpose and I haven’t tested it yet, but I think I could take that output and generate .srt files from it (lingq uses .srt for timestamping).

Then to work around lingqs 200mb file upload limit - script chunking both the audio and the srt files.

I also found another forced alignment project GitHub - r4victor/afaligner: 📈 A forced aligner intended for synchronization of narrated text that could potentially be easier to use, but not sure if as accurate as the former one.

Anyways, I’ll play around with it at some point and get back if i have something usable. My python knowledge is weak so I’m at the mercy of vibe code gods on this one.

2 Likes

Thanks, that looks cool, it will be good if it works well.

I suggested to LingQ to add support for Epub 3 (with that media overlay audio thing) awhile back, not sure if they are interested in it. Add feature to import Epub3 with synchronized audio timestamp. | Voters | LingQ

Maybe you can make a feature request if that forced alignment tool works well. I also want audio sync to work perfectly especially in sentence mode.

For transcription, I’ve tried Subtitle edit and also Vibe but they have their own issues and usual hallucinations.

2 Likes

Yeah I know what you mean. I tried making my own .srt files running WhisperAI locally with some more intense processing parameters in hopes that it would be more accurate than LingQ internal configuration. I still found large gaps in the transcription when I tried using it.

It was quite disappointing cause I spent a good few hours configuring the environment and making sure it runs on my GPU haha. That led me to the idea of trying out forced alignment though, so maybe something will come out of it.

2 Likes