Line Breaks and Commas in Auto-Transcribed Content

I imported some Chinese content from YT and from mp3 files.

In Page View, line breaks are nearly always inserted at commas. Not really a big deal on its own, but this results in Sentence View showing clauses rather than actual sentences and that’s problematic.

I can’t really tell if this affects the translations or not, but it may be a contributor to the translation frequently becoming out of sync with the source material.

Hopefully, it’s a quick/easy fix!

Thanks!

1 Like

A different version of this: Extremely long sentences could be broken at semi-colons or even commas so they display on one page in sentence view, with the translation.

1 Like

I’m not sure if these tools will work for the Chinese, but thought I’d offer them for possible help… I think it’s mostly dependent on if Chinese has periods to mark sentences?

First I go to “edit lesson”. Then I click “Regenerate Lesson”. This gives the full text of the lesson. I copy this.

First I join the text, essentially into one line using the tool at the following link, leaving the default settings:

Join Text – Online Text Tools

Then I create 2 line breaks for each sentence using the tool at the following link. I change the settings to “Add Line Breaks to Sentences” and up the count to “2” for “Add Extra Line Breaks”:

Add Line Breaks to Text – Online Text Tools

This gives me each sentence separated by an extra line. I copy this into the full text of the lesson and then save/view lesson. This will create a problem with the audio as it will probably be using the same breaks it was before. To rectify this I use an online youtube to mp3 converter and download the audio for that, which I then upload to the lesson. Then I “generate timestamps”.

It’s a bit of work–and I’m not sure how the tools I reference will behave with Chinese, so it may not work for you. However, if it does, then like you point out, the meaning can change quite a bit having the full sentences being represented, rather than the piecemeal phrases which may not be broken in a way to preserve the meaning of the sentence as a whole.

1 Like

This is an interesting idea, though not something a user should ever have to do!

I just tried this on an mp3 and it works. However, another issue is highlighted: Deselecting “Show spaces between words” removes all spaces in the text, including spaces after commas, rather than only spaces between words.

Yes, I can fix this, too, but…

There are now sentence breaks in the auto-generated translation that do not exist in the original. :smiley:

Agree in a sense. Ideally it wouldn’t be this way…but I believe it’s based on how Whisper outputs its translations, as well as how youtube divides up the time segments. I think it’s getting broken up in that fashion. Like if you import youtube subtitles, it splits it up in the middle of sentences all the time. LingQ just imports as is…as well as with the autotransciption with Whisper. So, in a sense, this is a problem with the “source” material.

1 Like

Oops! Yeah. Like I said…I’m not sure how well Chinese would work with these tools or not. Note there is a whole plethora of other text tools on those web pages you could try. Maybe you just need to tweek a few things differently then how I’m doing it for German.

Other choice is to use Notepad++. But you have to be strong with regular expressions to fix some of these things. I am, but I’ve turned to the above tools since it’s easier to help others with.

1 Like

Yeah, I’ve been using Notepad++ for a lot of years, and in fact it’s what I used for the above. :slight_smile:

I don’t think a little intelligent post-processing to align sentences rather than sending users to a text editor is too much to expect from a leading language platform.

For anyone interested, this can be fixed by using the correct comma character (not the one on your keyboard). The easiest way is to paste it from another lesson.

, is not recognized as a comma by LingQ
,is correct

1 Like