How does LingQ parse Chinese texts into words?

DavidMartin · March 17, 2022, 10:59am

Just wondering how you guys do this, because it’s really good!

Hagowingchun · March 17, 2022, 7:03pm

I dont even study chinese this sounds super interesting. How is it done?

bamboozled · March 17, 2022, 8:11pm

It is very good indeed, although I feel not all Chinese languages are equal on LingQ… In my experience the word splitting is, compared to Chinese simplified, a little worse in Chinese Traditional, and rather creative in Cantonese.
As for the algorithm, I suspect it works by means of a dictionary on the backend. Problems arise when a word doesn’t appear in the dictionary, last year I struggled with hundreds of combinations of “川普”, the algorithm split the name and joined 川 and 普 with the neighboring words. Other names like 习近平 were recognized on simplified, but weren’t on traditional.
LingQ5 has definitely changed things, today I imported a book (权力的转移 by 阿尔文•托夫勒) and the algorithm presents me “為控制全球而鬥爭” as a LingQ-word, the whole thing, this is quite interesting; but I don’t hate it.

If you are curious, how the results differ for the same text, you can check out (this is the LingQ 4 algorithm):
Traditional:

Simplified:

These are just some of my observations, I’m just guessing course. An authoritative answer by Zoran / LingQ staff would certainly be appreciated.

Hagowingchun · March 18, 2022, 4:34am

If it uses the dictionary would it have a zillion errors since chinese has so many compound words? Or do chinese dictionaries have all these combinations?

bamboozled · March 18, 2022, 6:10am

Well, in the end I don’t know. But I would hope that the dictionary contains multi-character words. And I don’t even think the compounds in Chinese are that bad in this regard, look at:

Or:

Yesterday I forgot to explain “為控制全球而鬥爭”: these are traditional Chinese characters, that are for whatever reason in a simplified Chinese book (it happens, no big deal). But the fact that this blob becomes one LingQ-word leads me to believe that it isn’t included in the (SC) dictionary. So instead of splitting it into its components, single and multi character words, the algorithm fails and returns its input, which happens to be everything between punctuation marks.