Parsing in Japanese

I recently started using LingQ in earnest. Overall I think the idea is great, and there are a lot of resources.
I’ve lived in Japan for 10 years, and I consider myself “fluent” but every day I still run into words I don’t know, so I like the idea of finding and learning the words I don’t know by reading – and by copying and pasting news articles fron the Net, etc. LingQ automates that entire process, so that’s great.

The problem I have is that LingQ’s parser seems to have difficulty finding the boundaries of words, or even phrases. For example in an article I uploaded recently, there is a word 容疑者 (=yougisha) or “suspect”, which can be broken down into 容疑yougi = suspicion and the suffix 者=sha (suffix indicating a person). The parser breaks these apart whereas obviously it is important to learn the whole compound word.

Even words like せず (sezu) which means something like “not doing” where “se” is the stem and “zu” is the inflection. LingQ’s parser breakes it down into the stem and inflection – which is really hard to understand – rather than leaving it as one word.

Since Japanese is written without spaces between words except for special audiences (children or foreigners who are just learning to read), I certainly don’t recommend having users put in those spaces artifically.

Is this a work in progress, or is the current status the final product, so to speak? While overall I think LingQ is really top notch – especially since I want to improve my vocabulary through reading and review – having an improved parser would improve the quality of the entire experience immensely.


You can hide the spaces.

About splitting algorithms


Thanks for directing me to the other threads. Actually I saw Steve’s response to “the French man’s complaint that LingQ is not free” video on YouTube the other day. I must admit that I am not a paying customer yet, but even at $10/month the system does so much more than other Online language learning Websites do, and I think it really is a great way to learn vocabulary.

My comment was not so much about the existence or absence of spaces between words, as much as when I set a LingQ, it only gives me part of a compound word. Certainly the entire compound words exist in Edict (I assume that is the dictionary that is used.) You pointed out that LingQ uses MeCab, and my sources tell me that is currently the best parser, and therefore preferable to older ones like CaboCha and Chasen.

Unfortunately I don’t have any technical insights that can help you with this improvement of the parsing algorithm. I did notice however that I can manually select a larger phrase from the text than what the parser identifies as a unit. For example 先生が私の遅刻を大目で見てくれた。 I could highlight 大目で見て “oomedemite” (=overlook, but in a conjugated form). That works for me because I know the phrase, but if I didn’t, I probably wouldn’t find it…

In the academic literature on CALL there is a debate between teacher (or in LingQ’s case tutor?) intervention, or like I did with my article, completely free reader selection and automated glossing. As an alternative, if tutors upload and prepare texts, can they manually set the boundaries for words/phrases, or at least modify the ones that are generated automatically? I haven’t tried to be a tutor, so I don’t know if this is already possible. If so, great and sorry for the dumb question.

In any case, thanks again for such a great site, and I only hope it gets better and better.

@karlt - I’m glad you find the site useful. Any machine parsing of Japanese texts has limitations. As you say, mecab seems to be the best out there. We will continuously look for ways to improve it but that is not something we are focused on at the moment. I study Japanese on the site and it works very well. Yes, there are some compound words that are broken up and other strange splits occasionally but for the most part it works very well. As you say, you can always select the compound word you want after the fact.

I should add, that when we have the time we would like to improve how the site performs for each language, and each language has its particularities. We are adding Korean and that will probably have its special problems, some of which LingQ does not deal with in a perfect way.

Eventually we hope to have the resources to optimize for each language. Suggestions are welcome but we may not be able to act on them right away.

The idea that tutors could prepared the texts is a good one, but we rely on the spontaneous contribution of content from members in 10 languages. This may be difficult to do. Eventually we will enable the exchanging of LingQ lists, and therefore if a knowledgeable learner has already created LingQs for an item, this could be imported and applied by another learner. But all this is just in the plans, don’t know when we will be able to get around to it.

Thanks again for the great Website. Actually, after I made the post, I tried to see if any of the other Websites like and Reading Tutor do a better job of parsing, but they have the same problems – and they DON’T let you highlight your own word manually. So in retrospect, the way LingQ does it is probably better than the others. The idea of sharing LingQs that Steve mentioned sounds good too – if one (advanced?) user goes through and does the improvement work, the idea of leveraging that to other users intuitively makes sense, and my guess is that the benefits of such functionality would not be limited to Japanese.

I don’t expect it will be any time soon, but if I run into a better mousetrap for parsing Japanese that is optimized for Incidental Vocabulary Learning through Extensive Reading, I’ll let you know!