I have been using Lingq for some years now. When I decide to read a new book I download the ebook and the audio book. After that I have to split up the audio book in chapters by using Audacity. This is a lot of manual work. Is there now a way with AI for Lingq to do this for me so I can just upload the audio book and ebook without the splitting up? If this doesn’t exist, are you working on this?
To speed up the process you could use Faster Whisper XXL to transcribe the audio book. The result is a subtitle file which contains timestamps. The benefit is that you can now search for the last line of each chapter and see at what time it ends. If you ask ChatGPT, it might most likely be able to create a python script for you that automats that process, using both the eBook and the subtitle file to manage the splitting of the audio file.
Thanks for the tip! Won’t that take like ages to do for an audio file that is like 20+ sometimes 30+ hours?
Have you found this to be effective? Seems to me that this could fail at so many levels: punctuation, homophones, unclear pronunciation of the audiobook at the cut, faulty transcription by Whisper etc.
I haven’t used it for audio books, but I have used it to transcribe podcasts when the YouTube import issues started. It actually works pretty well except maybe for words that aren’t part of the standard language so to speak, like nicknames or some slang. Using something like Faster Whisper XXL has the benefit, that you can choose how big the model to be used is. So if you are not satisfied with the result, you can always try a larger model. Punctuation wasn’t a problem on my end, too.
Well, that depends on your hardware and the size of the language model used. Whisper works pretty well even with smaller (and therefore faster) models on european languages, but also gives good results using the medium model on Korean and probably other asian languages.
On my end I am using a 8 core 3.4 GHz cpu and 16GB of RAM under Windows 10.
The transcription speed factor is about 5-6. So 5-6 seconds get transcribed per second, which for a 30 hours long audio file would mean the transcription to take 5-6 hours. (medium model)
Using the small model the transcription speed doubles. The only difference I can make out on my end is that there is more splitting with the medium model. So words get separated that are written together with the smaller model or there are more sections generated (timestamps). But the text is the same.
I would suggest that you take a short excerpt from an audiobook, maybe half an hour or so, and test the results with different models. This way you can judge better as it really depends on the language. Below table shows the word error rate. For languages marked with an * its the character error rate. WERR: S → M is the relative word error rate reduction when switching from small to medium model. For example: In Italian the switch causes 24% less wrong recognized words.
I use a method much less sophisticated but it works. I note where I am in current reading.
For instance:
03 14 50 4 / 15
03 09 18 3 / 15
03 09 18 means 3 hours 9 minutes and 18 secondes from the beginning of the audio file. 3 / 15 means page number 3 of the 15 pages of this lesson.
It’s to read an audio book and a ebook together. It requires to have lingq opened and audio player also opened.
What you’re asking would be extremely difficult, if not impossible, currently.
I use Epubor to automatically split the Audible audiobook into chapters when I download it.
I agree that if you don’t have the audiobook already split into chapters, what you’re doing would be VERY time intensive. And if I had no choice, I would just listen to the audiobook separately from LingQ. If you listen to it in Audacity, then at least you can separate the audio into chapters the first time through. Then you could upload the chapters into your lessons for future passes through it.
It is possible to use software to add timestamps to a transcript, like in your case the audiobook. (SOURCE: ChatGPT)
Yes, there are several free tools that can perform text-audio alignment using timecoding algorithms. These tools can help you determine the exact timestamps for when each chapter begins, allowing you to split the audio file accordingly.
1. Gentle
- Description: Gentle is a free, open-source tool that uses speech recognition to align spoken audio with a transcript. It generates a file with timestamps (e.g., JSON or SRT format) that you can then use to segment the audio file.
- How it works: It transcribes the audio and compares it with the provided text. The result is a set of timestamps for each word in the text.
- Advantages: Free and can be run locally, so no cloud services are needed, which has privacy benefits.
- Disadvantages: Requires some technical know-how to install and run.
- Link: Gentle GitHub
2. Aeneas
- Description: Aeneas is another open-source tool that synchronizes text and audio. It uses an algorithm to align the transcript with the audio file and generates precise timestamps.
- How it works: Aeneas outputs results in various formats (e.g., SRT, XML), providing timestamps for chapters, sentences, or even individual words.
- Advantages: Very accurate and supports multiple output formats. Can easily be used with Python.
- Disadvantages: Like Gentle, it requires some technical knowledge, especially for installation and configuration.
- Link: Aeneas GitHub
3. Google Colab + Kaldi (or other speech recognition tools)
- Description: If you prefer not to install anything locally, you can use Google Colab to run algorithms like Kaldi (a sophisticated speech recognition toolkit). There are pre-made notebook scripts available that integrate Kaldi or other models to perform text-audio alignment.
- Advantages: No need for local installation. Free to use as long as you stay within Google Colab’s limits for memory and computation.
- Disadvantages: Requires some technical setup, and the results may not be as user-friendly as specialized tools like Gentle or Aeneas.
4. MAUS (Munich Automatic Segmentation System)
- Description: MAUS is a project by the Institute of Phonetics and Speech Processing at the University of Munich, which also provides text-audio alignment. They offer online services as well as downloadable software for performing text-audio alignment.
- How it works: You can upload the audio and text, and the tool automatically aligns them, providing precise timestamps for words and sections.
- Advantages: Free and accessible online without the need for installation.
- Disadvantages: The online version has limited functionality, but the local version requires some technical knowledge.
- Link: MAUS Website
5. WebVTT and Subtitle Tools
- There are also several web-based services that use speech recognition to align text and audio, primarily for generating subtitle files (e.g., WebVTT or SRT formats). These tools are often free for small audio files and provide a timeline for each text segment.
Examples include:
- oTranscribe (for manual alignment)
- YouTube Subtitle Editor (you can upload audio to YouTube, use the automatic captioning, and export the timestamps).
Conclusion:
For free and powerful solutions, Gentle and Aeneas are the best options for automatic text-audio alignment. Both are open-source, and while they require some technical knowledge, they provide precise timestamps that you can use to split your audio file.
If you prefer a solution that requires less setup, you could try MAUS’s online service.
I’ve found using Whisper AI through the program Subtitle Edit to be the best option for making transcriptions locally.
The idea of my last post is to avoid transcription. It takes long for very long audio files and isn’t needed if the transcript is already at hand (in this case in form of the ebook). The question at hand is what is the most efficient automated process for splitting the audio file according to the chapters of the book, for which the timestamps are needed at which the new chapters begin.
My idea was that using text-audio alignment might be faster then creating the (in the end unneccessary) transcript. Once the timestamps are present, the splitting could be done automatically.
I don’t have an audiobook+ebook at hand to test this. Maybe I get a cheap one at the weekend to run a test, as I would be curious myself to which degree this task can be performed.
Ah, gotcha, I didn’t see that.
Reposting here a script from a discussion we had last week: The slowest part is uploading each section to LingQ, which took me a few hours - #4 by rafarafa The repo with the code is a couple posts above.
I did try Gentle and Kaldi but I couldn’t make them work. I was not aware of MAUS.
Be sure to check that they support your language beforehand: language — aeneas 1.7.3 documentation
I only made it work for greek, I can’t really comment on other languages.