Punctuations in Auto-generated transcripts

yeah_ok67 · January 17, 2024, 3:28am

HI,

I’ve noticed in a few of the transcripts I get through auto-generation come with punctuations while most of them don’t.

Initially I thought it might depend on how fast the speaker speaks, but most of the slow-paced videos don’t have punctuations.

Is it possible to have the system transcribe punctuations from your end? That’ll make the transcripts a lot more reader-friendly. Thanks.

nfera · January 17, 2024, 5:25am

All punctuation and spacing is determined by Whisper, a weakly-supervised AI model. The odd punctuation which happens in all languages would be largely due to the training data they used. Alas, subtitles often don’t have great punctuation and zero correct paragraphing. LingQ cannot change such things produced by Whisper. But what they could do is to run the Whisper transcript through ChatGPT to add/fix punctuation. But then they would need to fix the timestamps too then. In other words, it’s not an easy problem to fix. But who knows, they may do it. They have added a ChatGPT Lesson Translation feature for the browser recently.

yeah_ok67 · January 17, 2024, 5:35am

Thanks for the input.

The reason I asked this question is: two of my recently transcribed files are perfectly punctuated (as shown in my screenshots) while most of my transcribed files have no punctuations at all.

So, if Whisper can successfully generated 2 perfectly punctuated transcriptions, I wonder why they can’t continue to do so.

nfera · January 17, 2024, 8:05am

It’s because it’s weakly-supervised and was trained on a large amount of subtitles. No one knows what exact patterns Whisper has ‘found’ from its huge amount of data with 1.55 billion parameters (for the large model) and it’s impossible to tell. All we know is that the patterns it has ‘found’ (merely probabilistically) are not exactly the same as what humans have ‘found’ (because can guess based on meaning, but Whisper does not ‘understand’ anything).