Japanese tokenization and spacing

I am have been studying Japanese using LingQ for a while, and have noticed problems with the way LingQ tries to split Japanese text into “words”. First, it seems that spaces are added in the text between the words. This would be OK except that: (1) it does a pretty terrible job of separating words and (2) even when I click on the “Show/Hide spacing” link many of the spaces remain which breaks the dictionary lookups. Japanese text doesn’t really have spaces, so I would expect that clicking “Show/Hide spacing” would remove all the spaces allowing me to select an arbitrary string of characters to lookup in the dictionaries and create a LingQ.

Tokenizing Japanese text into words is obviously a more difficult problem, but whatever algorithm LingQ is using is not very good. There are a number free tools for tokenizing Japanese that might perform better. I have personally used mecab on some projects with good results, but here are some more:

https://code.google.com/p/mecab/

Anyway, I think that fixing problem (2) with the spacing should be relatively straightforward and would lessen a lot of the pain of the poor Japanese tokenization.

Cheers,
Ben

I’m pretty sure that when you click to hide spacing and spaces remain that those spaces are actually typed into the lesson that way. The space needs to be manually deleted. This is one of things I find annoying about the library because I seem to run into these random spaces more often than I would like. I can delete them myself for any of my imported lessons so it isn’t so much of an issue as long as I’m importing.

As for the word splitter, most of the time I don’t find it to be too much of a hindrance. Typically when it is a hindrance it is because it took a string of Kanji that form 2 or more words and it splits it in a way that they become non-words. For example, lets say there are two compound words with two characters each. They should be split as so: 11 22. Sometimes they get split this way for some reason: 1 12 2. When that happens, I can’t LingQ either word unless I manually go into the document and add a space between the first and second word.

@gerdemb - Thank you for the feedback. Our current splitter works for most words. I wouldn’t say it does a terrible job. It isn’t perfect, but no word splitter will be. We do have future plans to enhance the word splitter on the site. I’m not sure if any of the tools above are compatible with our system, but when we get time to devote to this we will explore our options further and figure out what can be done to improve the current splitting algorithm.

@cgreen0038 I get spaces in my imported lessons even though the original text doesn’t have spaces. The spaces remain even after I click “Show/Hide spacing”.

@gerdemb - I have noticed that spaces sometimes get added when I import text. It doesn’t happen all the time to me and to be honest, I haven’t paid enough attention to notice when this happens. I’ve always been able to go in and remove the spaces so I think it only happens when I import from specific sites. If you’re experience is different from that, maybe you’ve found a new issue?

@cgreen0038 I too have noticed that it depends on what site I’m importing from. I don’t think I’ve ever gotten spaces when importing pure text (ie. not HTML) which makes me suspect it has something to do with the HTML formatting of some sites. Anyway, although not particularly critical, this problem is annoying and doesn’t seem like it would be too difficult to fix.