Japanese Word Breaks

Japanese word breaks is a longstanding problem, largely due to the language itself. But I wanted to point out some examples that I’m seeing in virtually every lesson nowadays.

I ‘know’ about 53,000 words in Japanese according to my LingQ stats, so I really shouldn’t be coming across many new words. Yet I often have dozens of new words in a lesson, and at least half of them are not words at all, just nonsensical combinations of two or more words (including particles).

I love using LingQ and am pointing these problems out not to be critical but to help improve the usefulness of the system.

Here are some examples, grouped into three themes:

  1. Topic or subject particles being grouped together with the adjoining word. In the example below, は is a topic particle (sort of like saying “as for…”) and 語り手 is a noun meaning a/the narrator. This is a bit like saying “As for [something else] … the narrator…” in English and claiming it’s a single word.

  1. Adding the copula です and its variants like で and である to a word and claiming it’s one word. In the example below, the suggested definitions are shown (I already know these words). 機械的で is like saying “it’s mechanical [and…]” and again claiming it’s one word when it’s actually a phrase. 機械的で should be broken up as 機械的 で [space between]. Likewise with 無愛想で, which should be 無愛想 で.

  1. This is by far the most frustrating for me at the moment as I come across multiple instances of this in every lessons I import these days. For some reason, LingQ assumes every instance of って to be part of a verb. This is understandable in the sense that verbs in the -te form often have the って pattern, but in these examples って is acting as a standalone particle. This usage is meant primarily to quote or cite something. The って here should be treated as a separate word on its own.

1 Like

looks like the same issue as reported previously in this post - Latest update broke previous Japanese parsing
@zoran

1 Like

@tcclo We pushed a fix for the mentioned issue. I will ask our team to look into it again.

3 Likes

For what it’s worth, this issue may be resolved. I re-split text on some lessons I had imported weeks ago and the splitting is very good, with every entry being a single word alone or a collocation that appears the same way in an online dictionary. It may never be 100% perfect due to the complexity of the language, but it seems to be at least 99% now.

3 Likes

Yes I am getting this issue more often now. Also I used to be able to create new Linqs by highlighting to connect words that were broken. Can’t do that any more unfortunately

for me this issue is random. Lately its been wrong and then i have to re-split with ai to fix it and sometimes i have to do that multiple times to get the accurate new word count. I suggested this a long time ago but i think it would be perfect if all new lessons were just automatically re-split with ai upon importing to save time.

2 Likes

I haven’t tried importing anything fresh lately - I have a backlog of unread lessons - but the re-split did wonders. Great idea to automatically embed the AI re-splitting within the import process.