Audio transcription by LingQ

rohigeshi9tagao5 · August 17, 2024, 11:41am

Hi,
The function of transcribing from audios (like podcasts) is so useful, but let me ask about it.
The output forms created by LingQ (more correctly, Whisper?) are very different. In some cases, each transcripted snip seems to be ended just after the punctuation, which makes it easy for me to understand the texts. In other cases, the ending is just in the middle of sentences, which is personally not practical in reading with reading (In Youtube, there are very often these cases). Recently, what has gotten out is the whole text without any paragraphs. I can read it, but it would be better with paragraphs.
Are these transcribed texts created unintentionally? Are there any way for adjustment to make more readable scripts?

Obsttorte · August 17, 2024, 2:32pm

When creating the lesson, after the transcription has taken place you can manually add paragraphs. I personally just mark the sentence that begins a new section of the video using >>, as this is faster.

Unfortunately there is no way to merge sentences easely, you basically have to copy and paste the sentences together and delete those you don’t need anymore. On the long run it is probably easier if you try to get used to the odd formatting.

What do you mean by reading with reading?

SpeechRecognition packages are available for free and you can ask ChatGPT to create a python script for you to do the transcription and let it guide you through the process of installing all that is needed in order to run the script. I have done so when the import issues with YouTube videos began. (It’s not necessary, but I only found out later )

It is possible to let the code format the text depending on pauses in speech for example, or using simple rules like new paragraphs after a specific amount of sentences or when certain keywords or phrases were used. ChatGPT can modify the code according to your needs, although you probably have to make several tests and modifications before the result will suit your needs.

You can also hand the text to ChatGPT and it will do the formatting.

OS_head · August 23, 2024, 12:13pm

I found that to be a problem and a hindrance to my understanding and fluent reading of the texts. So I went back to my older way of pulling transcriptions from audio files. I don’t use LingQs whiper model. Instead I log into Google Collab and run WhisperX with a medium language model and use another called wav2vec2 to align and fix the timings. It’s very accurate, is free to use and makes appropriate breaks in the language, ie where commas and full stops occur. The only thing it gets wrong are unusual words like a persons surname names, or weird place names.

It’s not perfect but I prefer it to what LingQ are currently using which is probably set to output a standard length of sub of 60 characters, regardless of what is being spoken and breaks the line because of that rule. The only real problem with my method is it’s more time consuming.

Mycroft · August 23, 2024, 1:06pm

I use Google Collab also. Here are my prompts - Can you tell me how I can incorporate wav2vec2 into it?

1: from google.colab import drive

drive.mount(‘/content/gdrive/’, force_remount=True)

2: %cd gdrive/MyDrive/C-files/

%ls

3: !pip install -U whisper-ctranslate2

!apt install libcublas11

4: !for f in *.mp3; do whisper-ctranslate2 --model large-v2 --language Spanish --output_format all “${f}”; done

Thanks OS_head.

OS_head · August 23, 2024, 1:50pm

Sure Mycroft

I install WhisperX first.

! pip install git+https://github.com/m-bain/whisperx.git

When that’s done, it tells me to restart the session. So I comply.

Then I upload my audio files. Usually by right clicking in a blank area of the files area. You can upload multiple. And just a note on the audio files. I run them through Audacity and export them at 16000 hz and a low bit rate. This was recommended by the creators of WhisperX, or was it wav2vec? No matter.

Then I run this command.
!whisperx “Audio_File_Name.mp3” --model medium --language fr --output_dir . --output_format srt --align_model jonatasgrosman/wav2vec2-large-xlsr-53-french --highlight_words False

There’s a whole load of parameters you can add. If not you get the default which are good enough.

One interesting parameter I’ve played around with is the --highlight_words one. If you specify True, it will deliver a large srt file with the word currently being spoken underlined. Thought he sub file is huge.

Anyway, good luck and let me know how you get on with it.

OS_head · August 23, 2024, 4:33pm

So this is how i’ve changed my script. You could do similar for yours.

Check the availability of the nvidia smi.

!nvidia-smi

Install WhisperX

! pip install git+https://github.com/m-bain/whisperx.git

Connect to Google Drive

from google.colab import drive
drive.mount(‘/content/drive’, force_remount=True)

%cd drive/MyDrive/Audio_transcribes/

%ls

Transcribe Audio file to Subtitles

!for f in *.mp3; do whisperx --model medium --language fr --output_dir . --output_format srt --task transcribe --align_model jonatasgrosman/wav2vec2-large-xlsr-53-french --highlight_words False “${f}”; done

In the last command you can change --task (transcribe/translate), default is transcribe. You need to find a wav2vec2 model for spanish. So a google search will get that. I get the french one off Huggingface. If you want word level time stamps, which is really cool, change --highlight_words True. I think this is specific to WhisperX though. Oh… and --language fr, I presume to sp, defaults to finding the language by itself if left out.

Good luck with it.

Mycroft · August 23, 2024, 4:36pm

Excellent, thanks for the help!!