Chinese word recognition (Wrong characters are combined)

Blackandwhite23 · January 28, 2023, 4:40am

I am studying Chinese with LingQ, but I have very often the problem, that LingQ is recognizing words, combining characters which don’t belong together. How can I tell LingQ, that the word actually is shorter or combined differently? Since there is no space beetween words in Chinese. The words needs to be cut down at different positions sometimes.

seanu · January 28, 2023, 3:14pm

I have been having the same problem. When I’m importing a new lesson, I sometimes force the issue by manually inserting spaces between the words before importing the text. I’m not sure that’s sustainable, though, and it’s kind of putting the cart before the horse since it means I’m closely studying the text even before I load it into LingQ.

I get that splitting words in Chinese is harder than it is in languages that use spaces, and I don’t think a little inaccuracy a big deal when using LingQ alone, because I believe that my brain will sort things out in the long run. But I think that inaccurate splitting is probably a huge negative first impression for anyone who is seeing the app for the first time, and, with new users getting only 20 LingQs before they have to subscribe, there’s not a lot of time to make a good first impression.

Assuming getting the word splits right 99% of the time is infeasible, I wish we could at least have the “select a phrase” feature adapted to allow us to chop the “words” into smaller pieces, too.

I am also relying on the vocabulary export to make sentence cards in Anki, and there the word splitting matters more. Right now I’m dealing with it by reviewing and editing the notes in Anki as part of my flashcard preparation process. That’s fine, and I don’t think adjusting the word splits in LingQ would actually be any less work, but it would be more satisfying. I’d do literally everything in LingQ if I could.

bamboozled · January 28, 2023, 4:58pm

Hi, I’ve been using LingQ for almost two years now learning Chinese (simplified and traditional) and I am convinced that the word splitting issue cannot really be solved, because it is the result of how LingQ works on a fundamental level.

LingQ operates by comparing a list of unique words in a lesson to the user’s database, containing all words they know, have LingQed or ignored. It will then apply the respective colors, this is also how the system knows the number of unknown words, as displayed in the library. The database is dynamically updated, as the user changes the status of individual words. From what I understand this is a day-one decision, LingQ has always worked this way and I assume that this has become part of the LingQ brand. I don’t expect any change in this regard.
The problem is, that this process requires splitting a text into words, a relatively simple task in most languages that use spaces to delimit words. However, in Chinese and Japanese, LingQ has to employ a word splitter or tokenizer, this is part of natural language programming (NLP), one can find many papers engaging with this (unsolved) problem. To the best of my knowledge LingQ uses a software called ‘jieba’ (GitHub - fxsjy/jieba: 结巴中文分词), which is a bit primitive (~87% accuracy at best [1]), to perform this task, but even if LingQ employed a more sophisticated package that used something like machine learning it would never be perfect.
The alternative to LingQ’s approach would be a (dumb) dictionary based solution, similar to how it works on a Kindle, any other e-reader, or a pop-up dictionary, as implemented for example in various browser extensions. This is what I would prefer, especially as I start to engage with literature. So I am currently in the process of switching most of my reading to the Pleco app. This process is of course inextricably linked to a good, professionally curated dictionary which goes against LingQ’s crowd-sourced approach. Further advantages of this approach are that it can easily tolerate mixed language content, for example interspersed grammar explanations, it would act as a regular ebook or document reader, preserving table of contents, footnotes and formatting choices. Additionally, the need to split content into lessons would also be obviated.
I have, by the way, tried to push for a LingQ browser extension, that is connected to the LingQ database, but since it would have to use the latter approach it don’t have high hopes.

But coming back to the word splitter itself, I don’t think it’s terrible, especially for normal conversational language, it only really comes apart when confronted with literary language. But still, even if it reaches 87% accuracy, it still leaves about 13% of incorrectly split words.
My main gripe with the system is actually that I am forced to engage with a varying number of nonsensical (blue) words in every lesson I open. At first I tried to ignore those words, but realistically the only sane way to deal with those is to disable highlighting and just turn the page. In effect only engaging with words I want to look up. But this feels like cheating and inflates the known word count.
One of the more enervating issues is actually how LingQ handles non-Chinese words that are mixed in with Chinese, for whatever reason LingQ let’s those ‘join’ with the surrounding characters creating an unholy English-Chinese combination, ultimately resulting in an endless stream of blue words, as the English word will inevitably be surrounded by different characters throughout the text. I find this so infuriating that I have come up with various scripts to pre-process anything I upload to LingQ. Here are some examples:
For private imports I use the following regex in Python: [re.sub(r’[^\u4e00-\u9fff0-9\u3000-\u303f ,.;:] Which gets rid of everything that is not a Chinese character, a number or a punctuation mark. When I intend to share the content I try to put spaces around non-Chinese words to prevent the infinite blue word issue: re.sub(r"([a-zA-Z]+)([^a-zA-Z])“, r” \1 \2", text)

Regarding the question: If you want to correct LingQ’s word splitting, you will have to edit the lesson or sentence. Correcting the characters splits before importing doesn’t really work because you cannot tell LingQ to not run the word splitter. Also, I once thought importing a dictionary as vocabulary could solve this issue - wrong! LingQ applies the word splitter even on vocabulary imports. This is mildly frustrating since I consider the dictionary normative on the matter of what constitutes a word.
Don’t get me wrong LingQ is a tremendous resource and I intend to continue to use it, although the experience using “normal” languages is far superior. I recommend to switch to a software that was designed with Chinese in mind, if you are ready to tackle literature, especially in conjunction with a good monolingual dictionary.
Sorry for the rant. [Also sorry for editing this post a 3 hours later, the original writing was “fast and furious”, and consequently a little sloppy. I have taken the liberty to clean it up a bit.]
[1] Paper tables with annotated results for Robust Chinese Word Segmentation with Contextualized Word Representations | Papers With Code

zoran · January 28, 2023, 10:46pm

We are doing our best to make improvements there constantly, and hopefully with time you will see the difference and improvements in Chinese lessons.

Blackandwhite23 · January 30, 2023, 2:52am

I don’t understand, is it really so difficult to give the user an option to make words shorter? I mean it’s also possible to make them longer… For me that would be a very good solution for that issue before investing thousands of hours to make the algorithm perfect and detect everything on its own.

TofuMeow · January 30, 2023, 5:53am

Like everyone said, that’s the nature of chinese - it’s kinda annoying but there are some benefits:

LingQ will occasionally pick up words that pop up dictionaries (even the super nice monolingual ones installed via yomichan) don’t recognize, so it gives you a hint to check baidu - there’s lots of obscure chengyus / vocab (like burial ceremonies of emperors) that i’ve picked up because lingQ gives my brain a hint to do more research
it’s great for parsing character names - nonsense words still get highlighted so you can pick up when new characters / clans are introduced

eventually as you read more you’ll see a patten in chinese and know what words are nonsense - easy fix for the lazy is just to turn on the “paging moves blue to known” function, do a quick mouse over with a pop up dictionary if you aren’t sure it’s a nonsense word yet and deal with it that way

I’ve tried other readers like migaku, pleco, readibu or azure TTS and I’m still a LingQ diehard for using it for chinese, so the tradeoffs are worth it - plus it makes your word count go up super fast which is motivating! haha

procion · March 4, 2023, 11:00am

I can’t honestly understand the problem.
Sometimes Lingq gives me complete nonsense combinations, like "運說“（Yun said).
Why can’t it just compare splitted words with some authority dictionary?
I get that it would still get splits wrong sometimes, but just eliminating “inexistent” words by further splitting, would probably make a great improvement.

Also (just as a quick fix), maybe if Lingq can’t find splitted word in a dictionary, it’d be better not to color it blue? I find myself wasting a lot of time, ignoring such weird combinations.

Another improvement would be give us a tool to set word borders manually. At this moment when selecting text I can only “stack” one or several words “made up” by Lingq, rather than any number of characters. We can always edit a sentence directly, but this is a rather slow way.

rafaelhmay · June 7, 2023, 12:24pm

there is a way to partially solve or at least reduce the wrong lingq splitting words issue. before beggining a lesson copy paste text on Chinese Text Annotation - MandarinSpot (free tool) and compare. after that it is at your own to correct the lingq text or disregard that splitting. Hope this helps.

adisulistiyanto · October 10, 2024, 7:42am

Thank you Zoran. This would be much appreciated especially for such a popular language.