Easy Whisper text generation from audio

Mycroft · October 10, 2023, 3:47am

Currently LingQ’s whisper tool is broken because it splits sentences all over the place. To fix it takes significant time. So, until LingQ is able to correct their Whisper text generation tool to keep it from breaking up sentences, I’ve found that it’s just easier to use Google’s Colabratory to generate the text and then upload it and the audio to a new lesson.

This short Youtube video walks you through it step by step and it’s very easy.

Once you have the text downloaded, paste into Word (or other processor) and then you just need to do a quick search and replace to first change the “^p” to " " (paragraph mark to space), and then change ". " (period+space) to “.^p” (period+paragraph mark) to fix the content for the lesson. Copy and paste the text into the new lesson, upload the audio, and then generate the timestamps.

Usually the title of the lesson you’re creating is not included in the audio, which can mess up the timestamps. So I just make a placeholder title of “1” or a punctuation mark until the timestamps are generated and then afterward, I change it to the title that I want to assign.

ericb100 · October 10, 2023, 3:08pm

It looks like LingQ probably is using the SRT with timestamps as it appears in the video to be broken up in a similar fashion in the middle of sentences.

Have you tried just using notepad++ to clean up the Lingq lesson and repaste it in? You should be able to fix it all quicker I think than using the above. I’ve done that with Easy German transcripts (which have a similar “problem”) and joined the sentences.

bamboozled · October 10, 2023, 4:09pm

1 To be honest, this process doesn’t make much sense, because the Whisper model is deterministic (under normal circumstances*). That is, it is expected to produce identical results, be it on Google Colab, your PC or the LingQ server.
The problem appears to be purely a formatting issue, if you don’t care for timestamps you can just reset the lesson formatting by using the “regenerate lesson” button.

2 The video appears to be using the original Whisper implementation by Openai, which I consider “research code”, i.e. not production code. For a much faster and memory efficient implementation I recommend faster-whisper (GitHub - guillaumekln/faster-whisper: Faster Whisper transcription with CTranslate2). In the context of Google colab, you can expect speeds of up to 5x real time with a Nvidia T4, 11x rt on V100 and 14x rt on a A100.
I’ll attach a minimal example that produces srt subtitles.

Summary

from faster_whisper import WhisperModel
import math

model_size = "large-v2"
audio_file = "input.mp3"
output_file = "output.srt"
audio_language = "en"


def convert_to_hms(seconds: float) -> str:
    hours, remainder = divmod(seconds, 3600)
    minutes, seconds = divmod(remainder, 60)
    milliseconds = math.floor((seconds % 1) * 1000)
    output = f"{int(hours):02}:{int(minutes):02}:{int(seconds):02},{milliseconds:03}"
    return output


def convert_seg(segment) -> str:
    return f"{convert_to_hms(segment.start)} --> {convert_to_hms(segment.end)}\n{segment.text}\n\n"


# Run on GPU with FP16 precision
model = WhisperModel(model_size, device="cuda", compute_type="float16")

segments, info = model.transcribe(audio_file, beam_size=5, language=audio_language)

print(
    "Detected language '%s' with probability %f"
    % (info.language, info.language_probability)
)

subtitle_lines = []

for i, segment in enumerate(segments, start=1):
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
    subtitle_lines.append(f"{i}\n{convert_seg(segment)}")

with open(output_file, "w", encoding="utf-8") as f:
    f.writelines(subtitle_lines)

print("Subtitles saved to %s" % output_file)

*when the model struggles to produce results with high enough probability or encounters repetitions it will activate a temperature fallback which introduces a random element into the decoding https://github.com/openai/whisper/blob/0a60fcaa9b86748389a656aa013c416030287d47/whisper/transcribe.py#L147-L185

Mycroft · October 11, 2023, 5:16am

You’re correct guys that it’s possible to edit with “regenerate lesson”, and for a short lesson, that’s not too much trouble, but for a longer lesson, it’s a pain to find each place where a paragraph mark needs to be exchanged with a space. It’s difficult enough reading a new language, but then to read it in sentence fragments makes it worse.

bamboozled, thanks for the tip about the faster whisper code. With what instructions do I replace the following in order to use the “faster-whisper” code?

!pip install git+https://github.com/openai/whisper.git
!sudo apt update && sudo apt install ffmpeg

bamboozled · October 11, 2023, 8:11am

!pip install faster-whisper
Should be sufficient, FFmpeg doesn’t have to be installed on the system.
Faster-whisper doesn’t offer a command line tool like Openai, so it has to be used from code (similar to what I posted above). If you want something similar to OpenAI’s tool, you could check this: GitHub - Softcatala/whisper-ctranslate2: Whisper command line client compatible with original OpenAI client based on CTranslate2. which wraps faster-whisper in a command line tool.
In colab you could install it like this:
!pip install -U whisper-ctranslate2
And use it something like this, to process all mp3s in the current directory:
!for f in *.mp3; do whisper-ctranslate2 --model large-v2 --language en --output_format srt "${f}"; done
The command line arguments should be very similar to the original Openai version.

Anyways, it’s not super important. If the original whisper program works for you then all is good and no need for immediate action. I just felt like mentioning it.

Mycroft · October 11, 2023, 9:39am

Wow! That is MUCH faster! The large-v2 runs over twice as fast as medium under the “research code”!

Thanks bamboozled!

TaosSong · October 11, 2023, 11:50pm

This worked very well. It eliminated 3 of the 4 problems that I tend to see with the lingq implementation. about a 3-4 minute run for a 22-24 minute audio file. (large-v2). The version of whisper on my laptop (from last fall) gets hung up with intertwined music, and creates text loops, and is a little slower than real time.

Thanks to all for the detailed info!

StewartLikesLingQ · October 12, 2023, 12:42pm

Just use www.revoldiv.com/

It’s free, it’s simple, it’s fast. 1 Hour of audio takes about 90 seconds. You can even just search for podcasts through the site. No need to watch any explanation videos.

Mycroft · October 13, 2023, 7:56am

It’s pretty cool, but how does one get the text out of it?

Edit: I think I figured it out. You need to carefully click the right place and drag across the text to select & copy.

Mycroft · November 13, 2023, 9:25am

Hey bamboozled, have you used “insanely-fast-whisper” yet?

https://x.com/reach_vb/status/1723810943329616007?s=61&t=mqH-IwwWkC6Ounv5dnosRQ

Is it possible for me to use it in Colab? TIA

bamboozled · November 13, 2023, 2:27pm

Hi, unfortunately not much time these days, busy IRL, so I haven’t tried it. But it looks like the author has two notebooks on Github, one for faster whisper and one using the huggingface transformers library. Here: https://github.com/Vaibhavs10/insanely-fast-whisper/tree/main/notebooks
The “open in colab” links don’t work for me. So you might have to download the notebook first and upload it to google drive (“download raw file” at the right side).

Looking at the code, the faster-whisper code seems to be very close to mine, so I don’t expect a speedup.
The huggingface transformers library implements a different method for “chunking” when transcribing long-form audio. The Whisper model can only work with chunks of 30 seconds of audio. OpenAi’s reference implementation and faster-whisper work sequentially through an audio file > 30s, where the results of a previous chunk are fed to the next chunk as a prompt so as to provide context and timestamps. This appears to work pretty well, but being sequential means that big GPUs are mostly idle. So huggingface came up with a different process, it is explained here: Making automatic speech recognition work on large files with Wav2Vec2 in 🤗 Transformers They try to provide the necessary context by creating overlap between the chunks. I haven’t tried it but I assume it will slightly lower the transcription quality. On the upside the same model (in memory) is able to transcribe independent chunks in parallel. Worth a try.

A tip for colab is to use Google drive to save models, transcriptions and audio files.

Summary

from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)

%cd gdrive/MyDrive/myfolder/
%ls

Mycroft · November 15, 2023, 7:36am

bamboozled, thank you very much for responding. That’s a bit above my competency level, so I think I’ll stick with your faster-whisper implementation. It’s simple and works so well!

I tried using Google drive for the transcription output, but I couldn’t figure it out. So I have to make sure I download them before the session times out and deletes them all.

Does the “cd” command redirect output to the Google drive?

bamboozled · November 15, 2023, 8:54am

I created a folder here called testfolder:
https://drive.google.com/drive/my-drive

My colab looks something like this:

Listing content and showing the path are just for additional info and not necessary. cd changes the directory.

Mycroft · November 16, 2023, 5:15am

Fabuloso bamboozled, that worked great!!!