Sentence mode splits sentences in half after whisper auto gen

jamesclarke · June 8, 2023, 7:54am

Apologies if this is a known thing, but when I have LingQ automatically generate subtitles for a podcast mp3, it works perfectly in all but one respect - it inserts line breaks in the text in random places (or what seem like random places) and this makes sentence mode show phrases and fragments of the sentence rather than the whole thing, which can be quite frustrating when trying to translate the whole sentence at once. Is there a way around this that doesn’t involve manually editing each lesson to recombine sentence fragments?

Thanks in advance!

Mycroft · June 8, 2023, 10:03am

Yes, I noticed the same thing.

I imported the “Reseña” for Un Mundo Feliz and it broke nearly every sentence for some reason. It was short and easily corrected manually. The rest of the book, I was able to paste in the text from the ebook itself.

But as a test, I imported via Whisper a smaller 10 minute chapter and it made every sentence a paragraph and split up a few sentences on a comma even though the transcription correctly identified the pause as a comma and even made the beginning of the next “paragraph” a lower case letter.

Strange. I think Whisper is doing its thing, but something is going wrong with importing Whisper’s output. I’m not sure they can get around making each sentence a paragraph, but they should be able to import two segments with a comma as a single sentence/paragraph.

What you might try is to “Regenerate” the lesson, quickly make the changes and see if that fixes it. You may need to Generate timestamps again, but test and see first because the underlying timestamps from the Whisper may still work after the adjustments.

jamesclarke · June 8, 2023, 3:44pm

Thanks Mycroft. Okay, so it isn’t just me then.

I’ll do some tests but it would be really great if this could be fixed in the code somehow.

zoran · June 8, 2023, 7:05pm

Thanks for reporting, we will look into it and have it fixed.

jamesclarke · June 9, 2023, 9:09am

Amazing, thanks Zoran.

Mycroft · June 10, 2023, 5:40am

There’s a great demonstration of this effect in the Spanish podcast that nsprung pointed out in his History post today:

The sentence fragments are all over the place, making reading in sentence mode very frustrating.

Sasuem · June 10, 2023, 5:45am

You can use ChatGPT to correct unnecessary line breaks if you want to avoid manual editing. I got that hint from @Constantine and it worked for me.

rickvan_wichen · June 11, 2023, 9:32am

I have the same issue with bulgarian.

Mycroft · October 10, 2023, 1:57am

Zoran, any update? The sentence breaks are still happening with every whisper text generation. It seems to cut the sentence after a set number of characters. So every sentence or partial sentence ends up being a paragraph.

zoran · October 10, 2023, 12:15pm

No, no updates here yet, sorry.

TaosSong · October 10, 2023, 7:32pm

I have seen the same behavior with about 1/2 of the whisper generated texts. The texts have proper punctuation, but the split seems pretty random.

With letters representing different sentences, I get these splits:

Aaaa aaa aa aaa, Bbbb bb bb bbbb
bbb! Ccc cc cc cc cccc? Ddd dd dd
dd dd, dd dd dd. Eee ee ee ee ee. Ffff
fff fff fff fff. Ggg gg gg gg ggggg ggg
gggg? Hhh hh hhhhh hhh hh hhhh.

Basically, breaking the goodness of Sentence Mode for many podcasts and other speech to text generated content.

ericb100 · October 10, 2023, 8:27pm

You can use notepad++ to edit into full sentences.

If you go to “edit lesson”. “Regenerate Lesson” (will give you the text as a whole). Copy that into Notepad++.

using regular expression mode,

Search on ([^.!?])\r\n\r\n
replace with \1

Then “replace all”.

This should give you full sentences with an empty line between each.

Possibly if on unix or mac? If your line endings are different you might have to search for ([^.!?])\n\n

Copy the formatted text into the edit lesson window (if you left it as is after the “regenerate lesson” step.

Then click “save” or “view lesson” at the top (can’t remember the wording). It should “save” it. If it doesn’t bring you out of the edit section, click “View Lesson” again. Then click “Edit lesson” again and click the “Generate Timestamps” button. Yes, you could stay in the lesson, but I think sometimes it gets a little hosed up, so going back out and going back to edit lesson seems to work most times.

TaosSong · October 12, 2023, 12:07am

Regular expressions could work. Certainly. For a quick fix, I have pasted to ChatGPT:

The following text has line break issues. Please reformat with exactly one sentence per line:
text text text…

The OCR software that came with my book scanner is terrible with San-Serif fonts in Dutch. ChatGPT is pretty good at restoring that too.

Thanks for the tips. This could be world-class software if it “just worked,” but I just re-subsribed for another year, so I am getting value from it as it stands.

Now… …if only the green dot told me what sentence number I am on…