Asian Languages Now Working, Library Updated

Keith,
I think that inserting spaces is related to a kind of morphological analysis of sentences, which is indispensable for finding out new words.
( HugeDomains.com )

日本語の分かち書きや文法に関心のある人は、下記のサイトが参考になると思います。
( HugeDomains.com )

Keith, when the content is uploaded it is split by an algorithm that searches for the most likely word boundaries. It is obviously not perfect but we need to split the text into words for all of our word counting and Lingqing functions to work. Can it be better? Certainly. But, for now it is more or less doing the job. We will continue to try and improve this functionality when we have time. The splitting algorithm is not one that we wrote ourselves but was developed in Japan.

Mark, if it is splitting されていない into three words and ありません into 3 words, then it is more or less not doing the job and still the word counts are completely unreliable for Japanese. And if it is splitting 宮城県 into two words, then how is able to find my saved lingq where I have saved it as only one word? But it does find it and it does select it with yellow. I think it is going through my saved words, finding them in the text and highlighting them. After that, it goes and inserts spaces into the displayed text. The original text is not modified. Only the displayed text is modified.

Learners cannot use the dictionary look-up for words that are not split correctly. For the benefit of others reading the discussion, let me use English as an example instead of Chinese characters. The current algorithm is taking words like backpack, carport, and refinance and splitting them into
back
pack
car
port
re
finance
Unfortunately, the current user will not be able to look up the original words because they need to be together, not spaced apart, in order to find them in the dictionary. New learners won’t even be able recognize that those words should be a single word.

Keith, I noticed that now when I highlight several separated lexems, they are lingqed as one word (without any spaces inside. Yes, the text is shown with spaces, but lingq itself is without spaces). But it is rather tricky to convince LingQ to highlight what you actually want… Anyway, I accustomed to use Rikaichan plugin, and it is impossible to use it now with LingQ lessons. The only way – to open both lesson page and variant “for print”. It is inconvinient :(( I do want LingQ count words some other way, without putting spaces…

One of main principles of LingQ is authentic content, but with spaces Japanese and Chinese content are no more authentic, aren’t?

Keith, I understand the issue. I study Japanese on the system myself. I’m not sure what you want me to say. It is not perfect but it is pretty darn good. And, the odd word that is split strangely, I just don’t worry about. Sure, I would prefer that it all worked perfectly. But, at the moment, that is the best we can do. The words are split by an algorithm we didn’t write. The system treats your word 宮城県 as a phrase which is why it is highlighted.

肥 沃 is two words? No it is one word.

Keith,

You are a fluent Japanese speaker and so a no-space approach may work better for you because the word boundaries are obvious to you. However, I can see that for many beginners (and in fact even for myself whose Japanese is only of low-intermediate level), a no-space approach could be highly confusing. Just imagine a beginner looking at a simple phrase like:

どうもありがとうございます

Without the spaces, it’s just a string of hiraganas. I just flipped through the beginners books I have. The words in them are also separated by spaces. Perhaps the book publishers also realize that no-space text is difficult for beginners. As for myself, until a better algorithm comes along, I can live with the present situation.

Well, as a total beginner I can’t work with unspaced texts because my brain refuses to identify words that are not space delimited. I assume with practice I will no longer have this problem.

I haven’t found problems with dictionary lookup yet, I tend to look up word parts with Babylon a lot at this stage. Google translate seems to be able to cope with spaces in whole phrases or longer words. Often it comes up with a sensible translation, sometimes it comes up with gibberish.

Maybe intermediates would find the option to turn auto-spacing off helpful, maybe advanced students should have authentic unspaced texts. I gave up studying Japanese because of unspaced texts, and started again when the texts became spaced.

I didn’t realise LingQ was doing it automatically, I thought the content providers had manually changed all the lessons :wink:

I agree with Cakura. I have saved separated words several times and gotten LingQs without spaces, just as I’ve wanted.

Electronic dictionary Wenlin (which I think some of you use) has a great Hanzi segmentation, and in case of any ambiguities you can fix those manually (all of them show until you choose which to keep).

If you want to print a text segmenteted by Wenlin, you have to edit out any vertical bar “|” as well.

Walkthrough here: http://people.cohums.ohio-state.edu/chan9/conc/wenlin_word-spacing.htm

Cantotango, of course, it’s supposed to be just a string of hiragana characters. Japanese normally has Kanji which breaks up the kana. Since kids books don’t have Kanji, they put in spaces. I believe the lessons on LingQ which are only Hiragana have spaces. If you need spaces, those are the lessons to use.

When people speak, do they speak with spaces? No. So when you read just read the hiragana and listen to what you are saying. You will understand if you know the words.

This whole LingQ idea is about using authentic content. We know Steve prefers authentic content. It sad if LingQ has to accommodate people who are not serious about what they are wanting to learn.

I talked a long time ago about giving people options and letting the learner decide. Still, I find LingQ is not giving people options. I told you guys you have to code with these things in mind. It’s very difficult to go back later and try to add these extras in. The whole site has been redesigned since then and still we don’t have any options. No option to turn spacing off. No option turn highlighting on or off. Still not able to edit forum posts, nor do the forums have correct statics. Many say 0 threads, and lots of posts. How can that be? No threads, yet many posts. Japanese and Chinese are still in beta. No commitment to getting them out of beta. They still become a ‘when we have time’ priority. Why? Because they want to reinvent the wheel first.

Japanese and Chinese have been out (in beta) long enough. I think those languages deserve the priority to become as useful as the other languages are. I mean, they should not be second-class members.

<>

Wow, look who’s talking! Let me read this a second time…

Keith, you seem to be so terribly annoyed that people at LingQ are not obeying the orders of Keith the Great. Look, not everything is perfect at LingQ. Still, for only $10 a month, I am getting great value for money. (As much I can tell, you are still only a free member.) Please give Mark a break. He has to make do with a limited revenue stream. There are tons of things to do and I don’t blame him if he has to prioritize those tasks. It might cause him more pains than it does to you. And as for this issue of space vs. no-space, there are genuine differences in opinion. You might be right but is there any need to throw such a tantrum?

Keith, I can see that you care about LingQ and really want it to succeed. Still, that doesn’t justify your talking down to everyone who disagrees with you.

Perhaps it will be possible to implement (somewhere in the future) possibility to edit word boundaries…
i.e. automatically the sentence was divided such way: だから 、ロシア_の_水道_屋_さん_や_お_医者_さん_は_とても_失礼_だ_と_思い_ます
somebody, who take this lesson, divided it this way: だから 、ロシア_の_水道屋さん_や_お医者さん_は_とても_失礼_だ_と_思います
so, it is now shown this way to other users too (and they still have the possibility to change word boundaries).
And I really want to have the option to switch off spaces…

About inserting spaces into Japanese sentences

Do you think that the LingQ system should show you which words are new to you, and how many words you have known?
If your answer to this question is negative, the problem will be solved easily. In my opinion, it is one of the possible options.
The parts of speech(品詞分類) of Japanese is so different from English that finding out new words needs different algorithms, or a kind of morphological analysis. I think that perfect automatic translation is not possible at the present level of development of machine translation technologies. If that is the case, inserting spaces “automatically” into Japanese sentences cannot be perfect.

Cantotango, I am not trying to talk down to anyone. But why do they redesign the system and do not think about giving us options? So here we have to have some debate on whether spacing is good or not for Japanese and Chinese instead of each of us able to set an option in preferences. Look, there is no preference panel even. So either I am being ignored or brushed aside.

I may be a free member now, but I have been a paying member in the past. Yes, I paid for a beta system while you pay for a working one. You are able to recommend LingQ to all your friends because they all study English, but how can I recommend LingQ to those who study Japanese or Chinese? If the system does not come out of beta, then the Japanese and Chinese learners will not increase. I would really like to recommend LingQ to everyone.

Cakypa wants to turn off the spaces too. Every serious learner would. LingQ could be great for counting your known words and practicing reading. But it does not count correctly and reading the text with spaces, especially misplaced spaces, is off putting.

YutakaM, this is not about machine translation. Nothing is being translated.

Obviously, a certain amount of specific coding or processing has to be done for Japanese and Chinese in order to insert spaces into the text. The other languages don’t need it. But it has been accepted that Japanese and Chinese will need this. So why not add the complete coding to give Japanese and Chinese languages the special treatment they need?

A solution will eventually be derived, so let’s get on with it and commit to having the best system we can.

Keith the Great has spoken.

Keith,

I can assure that we have many issues in front of us which prevent us, with our limited resources, from doing better for Japanese and Chinese at the present time. Your frustration, and that of others, is noted.

I still feel that the Japanese and Chinese are better than they were.

@Keith

You imply that everyone who is learning Chinese or Japanese and does not agree with you is not a serious learner. I disagree with you.

I am a ‘serious learner’ of Chinese and I personally much appreciate the efforts of Steve, Mark, and the rest of the LingQ team to make the recent improvements to the system given the resources available to them, and the fact that, whether you like it or not, LingQ is unique in the functionalities it gives you to learn with, and that deserves a certain amount of respect.

Yes, the parsing of words does not always work correctly, but that’s part of the ‘osäkerhet’ of the learning experience as Steve describes it and also part of the discovery of the language.

A new learner of the language will not care about your pedantic fine attention to detail, he or she is just looking to soak up as much of the language as possible.

I just went and used the Japanese. I find it acceptable.

Keith, you have to realize that our revenue stream does not come close to covering our expenses. Our efforts and programming time are now going to be increasingly devoted to things that will attract more members, overall, not to fixing things that annoy power users of specific languages.

We have no choice.

Re: Word splitting algorithm for Chinese

I programmed on my site (sorry in German) an annotator ( see Täglich Chinesisch | Annotator | taeglich.chinesisch-trainer.de ), that works very simple against a dictionary:
(1) take the first 10 characters and search the dictionary for such a word, if found, go on with the characters after that word
(2) if not found: take the first 9 characters and search the dictionary for such a word, if found, go on @ (1)
(3) if not found: take the first 8 characters and search the dictionary for such a word, if found, go on @ (1)

(10) if not found: take the first character and search the dictionary for such a character, if found, go on @ (1)
(11) nothing found in dictionary, just do as if found in step 10.
until end of text

This works OK in 98-99 % of all cases.
LingQ could do Chinese word detection this way with CEDICT.

Ah, YOU’RE the man behind Täglisch Chinesisch! I’ve been following that blog since somebody (Steve?) posted a link the other day. Thanks. Now, if there was a site where we could get translations into English…