Ai Generated Lessons Tag?

alex1029 · June 27, 2023, 1:04am

So ive been going through some new content people have added in portuguese and im finding a lot of mistakes and imperfect lessons that i assume people are adding from the whisper function. I think its nice for private content but i kind of think this site should only add to public content if it has been really written on analyzed by a native or advanced level speaker.

If a beginner was going through some of this content they would be confused as to whats happening in the language. I think at minimum the content should have a tag saying its generated and what id prefer to see is to have content like this set to private.
Lingq is already confusing enough why add a bunch of bad content on top of it.

Of course im excited as well about the new features and im using them for my own private stuff but i wouldn’t share as if its 100% correct. Just my two cents.

bamboozled · June 27, 2023, 5:23am

I think the consensus on Slack was that sharing Whisper generated transcripts is generally allowed, if the content is:

clearly marked as such in the description
and targeted at advanced users

I would personally prefer choice to an outright ban, but if others feel strongly about it I am happy to make amends. My only AI generated courses are in Cantonese, which is a low-resource language, I thought this would be a good way to combat the dearth of content.

Regarding the problematic courses, you could also try to contact whoever shared these transcripts, maybe they themselves are unaware of the quality issues. In emergency you could also flag the offending courses or even make them private, as you are the librarian for Portuguese.

davideroccato · June 27, 2023, 11:09am

Personally, I’m against lessons generated by AI unless is very high quality, not the ones we have. The reason is that I can generate them by myself if I want to and I can choose the voice I prefer. Otherwise, I would expect lessons with real audio that are better quality.

If there is a sort of tag, I’d like to have a choice to filter away that tag from the settings so I cannot see them at all.

bamboozled · June 27, 2023, 11:22am

Sorry for the confusion, I was referring to transcripts generated using the Whisper machine learning model by Openai. The audio would be native in that case.

AI generated audio is not allowed to be shared in the libraries, unless something changed very recently @Lotsawa
If you find any of those, it would be best if you could use the report button to let LingQ staff know.

noxialisrex · June 27, 2023, 6:25pm

I assume we also don’t want LLM generated content shared if it was done by something like GPT3.5/4. I.e., not an AI transcription.

If LingQ were to block whisper transcription, I’d think in the same vein that they should block YouTube auto-generated subtitles, happyscribe, etc. I think in these cases a tagging system should be used* describing how the transcript was made, and if a native speaker reviewed/edited it.

nfera · June 28, 2023, 11:51am

I have shared 50+ hours of Italian audio content with unedited transcripts generated by Whisper. As @bamboozled mentioned, this material is targeted to advanced users (Intermediate 2 and above) and it’s mentioned in the description that the transcript is auto-generated. The reason why I share these are the following reasons:

Whisper works great with Italian (depending on the content, I may be able to go many minutes without any error)
I am studying the content anyways, so it’s easy to share (as I select content under the Creative Commons unrestricted licence)
There are definitely some people interested in the content
I’ve seen manually-written transcripts, which are worse than Whisper-generated transcripts (because the transcripts were written first, then the speaker improvises a bit when speaking)
It is only for upper intermediates and advanced users, who should be able to easily see the mistakes while reading while listening (for Whisper-generated transcripts for beginners and lower intermediates, I always edit them first)
Upper intermediates and advanced users need to get used to inaccurate transcripts, as they will be seeing more of them
Over time, other users may edit the transcript, so the errors may slowly reduce.

As you mentioned @alex1029, maybe a tag or a little symbol saying it’s auto-generated (and unedited!) may be easier for users to recognise. At the moment, I just write in the description of the course that it’s 95% accurate and Whisper-generated.

davideroccato · June 28, 2023, 4:39pm

The problem is that if there is no filter, or some sort of strict policy, everyone will do it with any criteria they will think useful for them. I don’t even care about those “intermediate” or “advanced” categories on LingQ because I use percentage of “blue words” or “yellow words” as reference. There are so many random interpretation about those categories that how people are suppose to know if they were well intended or not? It is impossible.

I would prefer to have the possibility to filter out anything that is not human spoken. Same thing for anything that has not been personally corrected because the transcript was machine generated.
Those are two different things but rely on the same problem. Content generated by machines.

Otherwise, be sure than in no time there will be only AI generated or Whisper generated or any other automatic generation. Because it is easier to create.

You might interprete those audios/transcripts in your way, someone else in another way, and so on. If it is good for me that it could be good for someone else, so I post it, and so on. Whisper is good for me, Edge is good for me, this one is good for me as well, I have no problem with this one too.

Before clicking on any lesson on LingQ, there should be an icon or something that will show that it is generated by a machine (either audio generated or transcript generated). When we go to the library, we don’t read any description if we want to click on any random lesson. It should be clear from the start if it machine generated audio.

The problem is that this could appear to be convenient for LingQ, because they will have a lot more content than before. The downside is that the quality will lower overtime and users will get used to audio/transcription generated by machines because they won’t have an easy way to filter it out.

Imho.

PS: if I want to generate it like that I can do it by myself. Content shared should come from a little bit more effort to give real text+audio content. Or in the case of transcription machine generated, content should be corrected and perfectly match the audio.
Of course, I might be wrong.

alex1029 · June 28, 2023, 7:03pm

Yes thats what im looking for, i think the minimum should be saying its not perfect. For me i dont want to waste my time with it. Im trying to finish all the content uploaded here in portuguese. My level is high enough to tell that stuff is wrong. For me its not a huge deal but I think its disservice to new learners especially when there is higher quality content already posted on here.

Mainly im just seeing people add podcasts.

alex1029 · June 28, 2023, 7:09pm

i think this is cool for obscure languages and stuff with not alot of content!

alex1029 · June 28, 2023, 7:10pm

the ones i ve seen have been like 95% correct but still i think 5% wrong is bad look for a professional language service like this.

alex1029 · June 28, 2023, 7:11pm

i agree and what im looking for. I just get annoyed when i start a new lesson here and i get 2 mins in and see that its ai generated.

TTom · June 28, 2023, 7:39pm

I agree that some sort of tag so that people can choose would make sense.

As an aside, most of my experience has been with a standalone Whisper app (MacWhisper) rather than the built in service in LingQ. I don’t know if it’s because I’m learning a different language (German) or a different model (I tend to use “Large”), but my results have been way better than people on this thread seem to have experienced. Some of the transcripts that I’ve got back have been astonishingly accurate, even for complex vocabulary and unusual names.

alex1029 · June 28, 2023, 8:43pm

Yea like im blown away by the transcripts, but still they aren’t perfect. I think its an amazing tool but shouldnt be public content here

nfera · June 29, 2023, 8:06am

@davideroccato

There should be no AI spoken audio / TTS in the library! LingQ has a strict policy on this. If you find content where the audio is TTS, report it and it will be removed.

This is exaggerating a bit… In several years, everyone may use AI-generated transcripts, but I don’t see the problem, if they are more accurate than human-generated transcripts. Worrying it’s going to happen today and in a month there will only be computer-generated content on LingQ is like worrying about your bed will spontaneously combust. It’s just not going to happen.

Asking for a 100% perfection rate is just unrealistic. If you watch movies with subtitles, even on Netflix, you will notice that they are not 100% accurate. Are they useless? Of course, not. Does it mean that I shouldn’t recommend ever recommend a movie to someone because the subtitles aren’t 100% accurate? In my opinion, no.

Of course, I prefer subtitles, which would be 100% accurate, but there is some degree of practicality that needs to be applied. Finding 100% accurate transcripts of interesting content with great audio, which you can legally share, is not an easy task. Anyone who thinks otherwise is more than welcome to volunteer as a librarian. As you are a native Italian speaker, I’m more than happy if you edit my 75 hours of Whisper-generated transcripts I shared on the LingQ Library.

nfera · June 29, 2023, 8:12am

A good or bad “look” is a business decision for them, but personally I don’t think it’s that bad. For Italian, I just say it’s 95% accurate, but in reality it’s much better than that. This is why I’m happy to share them. For some other languages, which have a much higher error rate, sharing Whisper-generated content probably wouldn’t be appropriate.

davideroccato · June 29, 2023, 9:17am

I don’t pay Netflix for learning languages but I pay LingQ for learning languages. Subtitles on Netflix are a plus as an additional service. The comparison has nothing to do with it.
They don’t care much about it because the translation come from the original script instead that from the dabbed version, depending on the budget of the movie. Netflix provides what the movie company provides unless they are the movie makers.

Shared content by users on LingQ is a plus compared to the same content that should be searched, found and shared by the company itself. Users can have many reasons for sharing.

The company should verify the content or have rules on the content shared by users. I prefer to have less audio+text content but be sure it is quality content .

The rest I can upload it myself and I can have any degree of incorrect material as I want to because it is my private material, not something I would expect as Premium user.

In the library, there are news articles, by clicking on them I go to the website, import the article and generate the audio.
Same thing can be done as reverse. In the library, there could be audio podcasts, by clicking on them we go to “soundcloud” whatever platform, import the audio and generate the transcript.

At least, I know this is my private content because I import it and not content that I find on the library and that should have a higher quality.

Again, you might be ok with 5%, others with 10%, others with 15%. Who’s gonna verify that?

I’m not against with robot transcriptions per se if it is verified by anyone from the company or volunteers before being released.

Imho.

nfera · June 30, 2023, 11:18am

LingQ does have rules for content shared by users!

I don’t quite understand why you are saying that it’s okay for ‘externally shared content’ (like news articles) to be of lower quality than ‘shared content’. LingQ has external content for copyright reasons, not as a measure of content quality.