[request] AI generation: pls drop paragraphs and make real sentences

In importing audio the AI module is completely free to punctuate and insert paragraphs. Of course a snippet of text is not a sentence and a sentence is not a paragraph, so the AI instructions are to make every snippet into a sentence and every “sentence” into a paragraph. This results in a lot of unnecessary pages and the inability of the translator to take the whole context of the sentence into account. As the AI instructions are clearly contrary to the definition of both a sentence and a paragraph, I have the following suggestions:

  1. Do not insert any additional paragraph (barring the first paragraph after the title).
  2. Create real sentences: words that seem to make up a sentence ending in a point. That eases understanding of a sentence.

Can you please create these changes?

I would very much appreciate it.

Terve, Johannes.

2 Likes

Due to the unreliability of the YouTube import I’ve been using Faster Whisper XXL for quiet a while now. (I hav ebeen mentioning this here and there, so I may sound like an advertiser :smiley: ). It incorporates a sentence argument that causes it to split each line in the created subtitle file to correspond to one sentence, which has been working pretty good for me. Each sentence still starts at a new line, but at least it is easy to see where one sentence ends and another one starts.

Thank you for your response.

It seems like you are talking about videos with subtitles. I am not. This is about importing audio such that the Lingq AI interprets the text and creates written text. The issue is the way it separates text into units of text called sentences and paragraphs.

I do wonder whether Faster Whisper XXL is capable of interpreting audio and making it into reasonable sentences. However, it probably costs an amount of money on top of the Lingq subscription. Although I am curious to know how it works and how well it works, it is not my intention to use it on top of Lingq. If I would not use Lingq, then I might consider it. Is there a path toward this tool? If so, what are the costs?

1 Like

Me neither. I was talking about an external tool I use for creating the transcripts of the audio files, which is based on whisper ai. It can relyable detect sentence endings in Korean, but I haven’t tested it for other languages yet. It doesn’t create paragraphs, though, although something like ChatGPT might be of use here. You can download it for free, just google it.

Thank you. I will look up the tool and see if it is useful for Finnish. BTW I am slowly starting to watch Korean learner podcasts as well and if the tool is useful for Korean, I am very happy about that as well. Hopefully it does some good for Finnish as well. Thank you for the tip!

EDIT: I forgot to mention. I actually don’t care about paragraphs because most AI cannot make paragraphs like humans do. Each paragraph is and should be a coherent logical part of a section or chapter. The coherency makes the paragraph better and I feel at the moment that only humans can do this task properly. That might change in future, but for now, I prefer to leave paragraphs out when transcribing and insert paragraphs when I am reading or analyzing the text. Again, thank you for the tip. I found it in github.

1 Like

Okee, suffering from old age I guess. I downloaded and installed the faster whisper xxl successfully, it works. However, I completely forgot to note the github library I got it from (it was far after midnight), even though I did mark it to be watched for releases, I am having a hard time find it.

Could you please remind me which github repository and library this was?

I am a github member, so once I have the repo, I can re-find it. Sorry for the trouble.

1 Like

@zoran could you please respond whether this is feasible and for when it would fit the planning? Even though I might find a workaround in a separate module, it is an additional step, above separating the audio, that I can do without. So, a response please?

1 Like

Your faster whisper xxl tool works like a charm. By this I mean, that the end result is really a set of sentences: very, very good!

The weird thing is that the intermediate srt file contains partial sentences, even though they are way better than the Lingq AI result, they seem to be limited in length (may be due to screen size). But, once imported into Lingq, they become full sentences.

It looks like the Lingq importer merges the subsentences into sentences, which is ok with me. It delivers an endresult I can work with.

Again, thanks for the tip!

1 Like

There are different settings for how the length of the individual lines are determined, word count beeing the standard, iirc. There is a --sentence setting applicable, causing each line in the srt file to be a full sentence.

I’ve actually just downloaded the executable :smiley:

Works well enough for me.

That alone makes me wonder how well humans can do. :stuck_out_tongue_winking_eye: As said, I never actually bothered to check whether ai can handle it. I guess it may also depend on the kind of text.

@gbonnema Best I can do at the moment is to forward this to our team. I will let you know if changes like this are in the plans.

1 Like

Thank you the response. I really hope the team can come up with a plan or at least let me know if it is a no-go. I appreciate the feedback.

2 Likes

The request is from Nov 2024. I really would like to get rid of all the one-sentence paragraphs. Having a command to delete them all (except the first) or a setting to not generate them at all, would be best.

Can you please specify whether the team is willing to change this?

If all you want to do is remove the paragraph markers, then perhaps you can import the text into a word processor (like MS Word) and replace all the paragraph markers with a space.

@gbonnema Not in plans at the moment. That will be something we will look into in the future.

That approach would destroy timestamps. Besides, why generate paragraphs for every sentence in the first place? Especially, when those sentences are usually only fractions of a sentence?

I know that I wrote this as a request. You should realize, that the choice to insert a paragraph after every line is actually a bug. No school in its right mind would accept a piece written with a fraction of a sentence as a paragraph for every line. Lingq does this habitually, for every text.

This bug may not be in the planning, but it should be. The choice to insert paragraphs every line is outrageously stupid. The choice of paragraphs is subjective and should not be automatic. Indefinite postponement of repairing this is hard to understand. It needs to be repaired as soon as possible, so that I can enjoy reading in LIngq again, without having to delete hundreds of paragraphs.

Could you please present this as a bug, not a feature?

2 Likes

Really. The lines should be at least sentences, not paragraphs.

Step by step to achieve what you want.

  1. Edit the lesson and click the “Regenerate Lesson” button. (This will not remove the audio.)
  2. Select all the text Ctrl-a & cut Ctrl-x
  3. Paste the text into a blank word document.
  4. Search and replace the paragraph marker, a soft return, and another paragraph marker Ctrl-h “^p^l^p” (that’s up-caret p, up-caret el, and up-caret p again) and replace with a space.
  5. Copy that text Ctrl-a and paste Ctrl-v into the empty box in the lesson.
  6. I recommend moving the first sentence to the title box to make the next step work better. Highlight the first sentence & cut Ctrl-x, then click in the Title box and paste, Ctrl-v and then Tab out of the Title box to save it.
  7. Click the link on the left to “Generate timestamps”. This will match the timestamps with the sentences.

Until the techs find the time to provide an option to convert the audio to sentences rather than paragraphs, I think this is a method you could use.

1 Like

It’s also feasible.

Thank you, for your extensive tip.

Of course, that would be nice. My problem is though sentences or fractions thereof are declared paragraphs, which they are not. I can make sure I get full sentences myself by using my own whisper and specifying
“–sentence” in the command. The issue is, that if I have 700 sentences, I get 700 paragraphs, which Lingq allows me to delete one by one.

About the regeneration of timestamps, I prefer the timestamps that whisper generates (or that are already in the youtube imports). The regeneration process has given me nothing more than problems. I still get, on a regular basis, the “proud” message that it sucessfully regenerated timestamps on one of my early lessons. I did not request this. It is a sneaky, behind the scenes process, that does not respect user wishes. I can only avoid this idiotic process by making sure it has timestamps when being imported.

So re-importing without timestamps, causes problems I don’t want.

Having a text where I am the judge of where to put a paragraph is the only proper solution. Barring that, I need a command to mass delete paragraphs.