Refining parsing in spaceless languages (like Japanese) with AI

Since Japanese (and others languages) doesn’t use spaces between words, when you import a book into LingQ, the parser tries to detect the separation between words. Very often it separates them incorrectly, and when you read an imported book in LingQ, you need to edit many sentences to separate the words correctly (I edit sentences every day!). I usually do this with ChatGPT (I ask it to break down words when I have doubts).

Please, @zoran could you guys please consider if it’s possible to implement something similar to the ‘Translate with AI’ function? For example, have an option in LingQ to “re-parse” the import? This could involve having an AI System reviewing the text and add spaces to correctly separate the words in each sentence.

The language I want to learn with LingQ is Japanese, but there are other languages where they don’t use spaces (like Chinese), and they will likely have similar issues.

5 Likes

Thanks for your feedback and suggestion, we’ll see what can be done.

2 Likes

Great idea. Improving the parsing of Japanese and Chinese would make Lingq much more useful.

2 Likes

@zoran I believe a new option has been enabled to re-parse a lesson with AI, right? Yesterday I received a notification about this when I entered a new lesson. How does it work? From which menu option can I access this feature to re-parse old lessons? Very happy about this addition, I am looking forward to trying it out :slight_smile:

1 Like

@srdurden

I tested this feature out a bit.

After creating a new lesson you get this popup

The lesson is still available while optimizing. This is a good idea although if any words change then LingQs made would disappear.

There also seems to be a bug that this popup appears every-time when importing a new lesson. Even if you have the reader setting checked.

There also does not appear to be a way in the UI to re-split an existing lesson with this new ichimoe AI technique. Code below if anyone wants to re-split their old lessons in the mean time.

Code to re-split an existing lesson
await fetch("https://www.lingq.com/api/v3/ja/lessons/27842682/resplit/", {
    "credentials": "include",
    "headers": {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:122.0) Gecko/20100101 Firefox/122.0",
        "Accept": "application/json",
        "Accept-Language": "en-US,en;q=0.5",
        "X-LingQ-App": "Web/5.1.51",
        "HTTP_X_CSRFTOKEN": "...RC",
        "Content-Type": "application/json",
        "Sec-Fetch-Dest": "empty",
        "Sec-Fetch-Mode": "cors",
        "Sec-Fetch-Site": "same-origin"
    },
    "referrer": "https://www.lingq.com/en/learn/ja/web/reader/27842682",
    "body": "{\"method\":\"ichimoe\"}",
    "method": "POST",
    "mode": "cors"
});

Not sure what headers are required. Likely this HTTP_X_CSRFTOKEN, also the usual LingQ cookies if running the code from outside of LingQ

3 Likes

Can someone explain what algorithm or system is used to segment words in Japanese on LingQ?
What is used by default for imports and what for the “Japanese Word Split Optimizer”?
First it was said that machine learning was going to be used, specifically ChatGPT. And the popup says “Optimize with AI.”
But now I read “ichimoe” which points to https://ichi.moe/ which in turn appears to use GitHub - tshatrov/ichiran: Linguistic tools for texts in Japanese language
This doesn’t seem to be related to machine learning at all and use a more traditional lattice based tokenization like MeCab.
Thanks.

1 Like

This weekend, I tried for the first time a Japanese lesson in which I used the new AI-optimized word parsing. The parsing has improved significantly; I didn’t have to re-parse any sentences and the reading was much smoother. Super happy with this change!

@zoran are there plans to make this option more accessible? I want to optimize old lessons that I had imported and opened in the past, and I believe the option only appears when you enter a lesson for the first time.

1 Like

日本語では単語を漢字またはカタカナで表すことが多く、確かに単語毎のスペースは使いません。
ただローマ字表記した場合は、スペースを使うことがあります。
この文章もローマ字にしてみましょう。

Nippongo dewa tango wo kanji mataha katakana de arawasu koto ga ooku, tashika ni tangogoto no space(外来語はそのまま使う) ha tsukai masen.
Tada rome-ji hyouki shita baai ha, space wo tsukau koto ga arimasu.
Kono bunshou mo rome-ji hyouki ni shite mimashou.

1 Like

@waka_suke ありがとう!でもうまく動くように、LingQには単語と単語の間にスペースが必要です。テキストを読むとき、スペースを見えません。

2 Likes

I’ve found the web plug-in even less reliable for Japanese and the error messages rather unhelpful.

I was also tempted to mock up some better UI for this too.

I’m rather disheartened with LingQ’s support for Japanese now having worked with it for a few months in refreshing my Japanese.

1 Like