Strange spacing in chinese text

I have (and native speakers that I work with) noticed that on most Chinese lessons that I have studied on LingQ, have strange spacing issues in the text.

I correct these on my own notepad when I discover them (good mnemonic for ME), but It seems that this could stump some learners if they are not familiar with set phrases.

For example: “日子 过 得 真 快” should it be " 日子 过得 真快" ?
I don’t think that 过 and 得 should have a space in between.
Also when you mouse over “过 得” it only gives you the definition for the word “过”

… It’slik e readingeng lish liketh is…

It’s a VERY old problem: see this thread … Adding Furigana To Japanese And Pinyin To Chinese - Langu...

and see also this snapshot: http://bit.ly/gNKTCF

@hape

Oh… Thanks for pointing that out. I should have done a search.

I think that when someone new starts learning Chinese on lingq this issue should be brought up in a disclaimer or something, so folks won’t learn the wrong meaning in the beginning.

It’s only a minor issue with me (corrections help me remember), but my Chinese friends are VERY critical of this problem when they see what I’m studying.

I must say that I find LingQ quite effective for Chinese and Japanese despite these minor issues.

@steve

I agree totally 1000%

… I have (as a paid subscriber) used many other online resources for learning online… The problem with those… is that I get burned out after passing the basic dialogs / lessons due to lack of diversity (same people reading over and over) … that’s a bummer when you study languages…

This is not a problem at LingQ

The spacing issue is not that big of a deal with me as my speaking is better than my reading and finding those is like an exercise for me… I was just curious if others know about the issue and what they (especially beginners do about / with it)…

Thanks for your understanding. We have a a system for many languages. We rely on outside software to deal with the spacing in Asian languages. We may one day find a better solution, but we are not focusing on that right now.

Steve, you don’t need a better software.
The system should just respect the space characters that has been already inserted to indicate word boundaries. In the moment the system ignores these spaces and makes its own wrong assumptions. Just have an option to do NO word boundary detection, and it’s done.

MisterB said:

For example: “日子 过 得 真 快” should it be " 日子 过得 真快" ?

In fact the spacing between guo and de is correct. Guo is the verb, de is an adverbial modifier and strictly speaking not part of the verb.

@Hape,
I thought there is already an option to shut off the spacing…

Friedemann, the option only hides the spaces. That option has nothing to do with the word detection algorithm.

I always thought that 过得 was a contraction 过得去 and should be defined as a set phrase.

I just asked two native speakers that I work with and they disagree with each other about this …haha… chinese is such a fickle language…

I normally use Wenlin and their splitting algorithm…

This is a good example that we cannot simply divide up the characters blindly. It all depends on the context.

The automatic software can do the initial division, but then it has to be manually edited by someone who has some in-depth knowledge of the language (and this excludes many native speakers).

I never used a splitting software, experience will ultimately tell you which characters belong together.

Hape, what do you mean by word detection then?

Chinese “Words” consist of 1 to n characters.
The problem is to split a stream of characters into words.

e.g.
服務員,給我一杯水。冷的還是熱的?冷的。

into (with " | " as separation indicator)
服務員,給 | 我 | 一 | 杯 | 水。冷 | 的 | 還是 | 熱 | 的?冷 | 的。

服務員 and 還是 are multi character words.

The problem is that LingQ does it wrong.
What’s more, LingQ does not obey the spaces in the original text and makes its own false assumptions.

This text (input/edit view):
http://bit.ly/m7TLLB

results in this (lesson view):
http://bit.ly/jSOke5

LingQ does not obey the inserted spaces!
Although I told LingQ to treat 服務員 and 還是 as one word, the system does not recognize these popular words!!!
Even my own annotator that I programmed some time ago, is able to do this (98-99% correct), and the word splitting is the most primitive one (only a few lines of php code): Täglich Chinesisch | Annotator | taeglich.chinesisch-trainer.de

Interesting…
I wonder how many other languages present this type of challenge.
I think that we would have a similar problem with Persian …

For example:

زمین = Ground
خوردن = To Eat
زمین خوردن = to stumble or fall (not to eat dirt… well… unless you REALLY mean to eat dirt)

and if you space the letters wrong it really gets messy…

ز م ی ن خ و ر د ن

@MisterB

That may be one reason that Arabic is not yet available @ LingQ

Another reason my be content availability.

Arabic is not much of a problem for me. The only issue is a few diacritics turn into boxes. I just delete them.

Could we have an option to turn automatic worddetection off of a lesson.

I still see some problems with getting strange words detected in my imports. Perhaps I could do a better job myself by selecting the words I want to lingq. I could also easier lingq sentences and expressions then.