Complaint about duplicate Korean words (particle stemming)

wildwestrom · July 28, 2023, 3:55am

I’m trying to read through articles and such just to feed the system with words I already know, however the words lists are cluttered with duplicate words with just a particle attached to the end.

So N+은/는/이/가/을/를/과/와/로/으로/에/에서/에서는/도/에서도/etc. in its current state, are all separate words. This is extremely annoying.
This tool would be perfect for me if the stemming algorithm were a bit smarter.

Someone else brought up verb stems and inflections. I understand that a word shouldn’t be marked known if it has a different inflection, as they’re not easy to understand just by knowing the infinitive form. Particles are different however. In Japanese, you don’t count 食べ物は, 食べ物に, 食べ物で, 食べ物を, 食べ物も, 食べ物から as different words, therefore it shouldn’t be like this for Korean either.

Tamarind · July 28, 2023, 6:00am

Is there an algorithm that can distinguish between particles and characters that change word meanings? I hope that’s possible! But from what I’ve seen so far we’re not there yet.

Examples of word endings that often seem like particles:
－가 家 man, family, specialist, a person noted for
－과 課 ① business division;
② section,lesson, chapter
－과 科 department
－을 added to verb stems to express possibility, future tense, etc.

Tamarind · July 28, 2023, 6:07am

Perhaps because I use LingQ on a browser and hit the keyboard shortcut ‘K’ to mark words known, particle differences don’t annoy me.

zoran · July 29, 2023, 10:53am

We’ll do our best to improve things in the upcoming updates.

wildwestrom · October 20, 2023, 9:29am

@zoran Is there any update on this?
I’d really like to start using linq again, but not being able to cut off at least the 조사 (particles) from words is a major blocker for me.

I don’t know what the blocker is for the team, but I’m pretty sure Korean word stemming is a(n at least partly) solved problem.
Take a look at this for reference:

zoran · October 20, 2023, 10:28pm

No updates here yet, sorry.