Please implement a better Chinese text segmentation!

In nearly every Chinese text here at LingQ Chinese text segmentation is often wrong.

One example:


“我常” is not ONE Chinese word, but TWO words (“I” + “often”).

Is it really so difficult and/or too costly to implement a powerful Chinese text segmentation?

see also Is there any good open-source or freely available Chinese segmentation algorithm available? - Stack Overflow
and GitHub - fxsjy/jieba: 结巴中文分词
and https://nlp.stanford.edu/software/segmenter.shtml

5 Likes

I’m glad to see other people experience the same thing. I was wondering if I was going crazy with the pinyin being wrong. I have seen one word where it just says ? instead of any pinyin.
Regarding the “text segmentation” - I agree, it often puts any word combination together as one “new word.” 一個 & 兩個 count as two “whole words” but I would really say they are just separate words next to each other.
I actually ran in to one situation where it broke the words up as follows AB C DE however the actual word/phrase was BCD, and there was no way for me to make a LingQ out of it because you can’t separate A from B nor D from E.

1 Like

I find this endlessly annoying and has been a big gripe I’ve had with Lingq over the years. Even when you unclick “show spaces between characters”, it still shows smaller spaces between them. Why? This is pointless, real text is not like this. It also confuses its own system and makes words that shouldn’t be words as you mention.

This is a serious problem I also encounter in almost every Chinese text here on LingQ. It is very annoying, especially when it combines a word that I already know with a word that is new to me.

Maybe the system can check with an online dictionary to see if what it is classifying as a “word” is actually one.

Whenever I see that characters are not spaced correctly I simply click on ‘Edit’ and fix the problem.
You can even edit lessons that are available to the whole community, not only privately imported ones.
Sure, it’s annoying and we all would like it to work properly, but for the time being that’s the best we can do.

I’m using LingQ only on iPad and can’t edit through the app

@Yannick This seems to work. Not the best solution but better than nothing. Thanks for sharing this.

As if the tones and characters weren’t intimidating enough, this obstacle makes it all the more scary to learn Mandarin.

Thankfully, I have a few others to attack in the meantime.