About conjugations treated as different words

cophnia · October 12, 2014, 9:44pm

I know this isn’t the first time a user ask this but I’ll try anyway.
I did a month subscription because I like the overall features of LingQ more than any other piece of language learning software, but the fact that different forms of the same word are treated as different words make it unusable for me. This is not a critic, I understand the work that is done and I appreciate it and all the content available almost for free. I’m writing this because I thinked of a way to add an option that will allow this behaviour, without need to broke anything.

I speak about japanese because this is the language I’m studying. I suppose you are using something like Mecab to break the words. And that every word is contained in a database table, identified with a unique number, like it is possible to see if you open a lesson and look at the html code.

So why don’t expand that database table with another column which contains something like this?

word | id | dictform_id
飲み | 000001 | 001
飲む | 000002 | 001
読み | 000003 | 002
読む | 000004 | 002

with the third column number being the same for the words forms which pertain to a single dictionary word?

So when a user makes a new lingq, he could choose “make a lingq for every other form of this word”. If this option is selected, the software looks at the dictionary_id of the word for which the user is making the new lingq, it searches all other words with the same dictionary_id and makes a lingq for those words too.

I think Mecab has a “deconjugation” option so starting from all different forms of the same word you could obtain the dictionary form and just tag every words with the same id when the dictionary form coincides.

This is only a suggestion, I don’t even know if LingQ works like this behind the scenes, but I know for sure this (apparently) simple feature would make more user subscriptions. Just talking about it in forums and the like, I know so many people that would subscribe but they don’t subscribe for this reason This is a shame… Just in my family my mother and my sister would subscribe too but are reluctant for this.

I know it is very difficult to expect this to be done, but I have a great esteem for Mr. Steve and all the staff on LingQ so I thinked to do at least an attempt to suggest this

NB: This wouldn’t modify the way the words are counted or other Lingq features. It’s just like when you encounter those words separately, and I know for sure (or, at least in japanese) when a user encounter a new word he, if you give it the possibility to do so, would tag also all the other forms of that word. Only for now he has to tag them while he appens to encounter them while reading. So maybe he tag 読み as a new word, then he finally recognize it and maybe after three months he encounter a form of that word he never encountered before, so it is showed as a new words while it’s obvious he know it. It’s not a problem with a single word, but when you consider this multiplied for many words as would happen in a real situation, it defeats the purpose. Or, I repeat, at least for japanese where conjugations are so obvious it has little maning to treat them like words per sé.

I know the concept of word itself is blurry, so if one has two word’s forms it could be useful to treat them as different words (like in italian “vivere” and “vissuto”, or “leggendo” and “lessi”) because knowing one form of the word does not imply you automatically know the other forms. This could be also true in japanese, if only the words are broken the right way, but considering “飲みます” and “飲みません” are broken in a way that let you tag the stem “飲み” (so, a single lingq for both word’s forms), but then “飲む” is treated as a different word than “飲み”. So it would be good to tag them as different words as long as this behaviour is consistent, thing that is not in japanese, for obvious reasons… So you might as well settle things once and for all and permit users to tag all forms one time for all. If it is too difficult or impossible to do programmatically, just put a popup where you show all words which contain that kanji only (or that kanji compound only, if it is a compound) and let the user chose for which of those words to create a lingq. It requires only a search feature you can do it with regular expression just to restrict this to only words with the exact same kanji (or the exact same compound), so not inclusive. Please just take this into consideration

PS: maybe it is very difficult to do this and I’ve underrated the work needed for this to be done, I could only imagine all the works you have so I hope this doesn’t seem like a lack of respect, on the contrary it is only an affectionate concern

nobody · October 13, 2014, 4:32am

Thanks for your feedback! This is something that we would like to do at some point, though it is a bit tricky and we’d have to apply a slightly different method for each language. While we don’t have immediate plans to do this, it is something we will hopefully get to in the future and will keep your feedback in mind when we do get around to looking at this in more detail!

cophnia · October 13, 2014, 9:55am

Alex, thank you very much for your reply! I know you are pretty busy so I won’t steal anymore of your time, I add only that there is a software called “rikaisama”, maybe you know it, that has a “deconjugation” feature. It is open source and I know other developers use to look at the code just for that feature (for things like glossing softwares and the like), so maybe it’s good to know there is something like that out there to look at if needed… just in case
I wish you good work and thanks again for your interest!

nobody · October 14, 2014, 4:46pm

Awesome, thanks for the additional tips on this. When it comes time to look into this we’ll check into this software too!

cophnia · October 14, 2014, 7:57pm

man sorry if I reply another time ._. I’ve tried to do something similar with nmecab for .net (c#), parsing a text with mecab with the option “wakati” just to add spaces between words; now that I have a text with words divided by space, I do a second pass with mecab with the “lattice” option to obtain the “base form” of the word. So if the text contains something like:

洗います洗わないでしょう

it breaks the words like:

洗います洗わないでしょう

and in the second pass with the “lattice” option I obtain this:

洗い動詞,自立,,,五段・ワ行促音便,連用形,§洗う§,アライ,アライ
ます助動詞,,,,特殊・マス,基本形,ます,マス,マス
洗わ動詞,自立,,,五段・ワ行促音便,未然形,§洗う§,アラワ,アラワ
ない助動詞,,,,特殊・ナイ,基本形,ない,ナイ,ナイ
でしょ助動詞,,,,特殊・デス,未然形,です,デショ,デショ
う助動詞,,,,不変化型,基本形,う,ウ,ウ

and I use the info I’ve enclosed here with § § to know they are different forms of the same morpheme (洗う).

Unfortunately this doesn’t work for some form of verbs, i.e. those with okurigana “いた…” because mecab break them the wrong way:

洗いたくない洗いたくなかった洗いたくなくて

with the first morpheme “洗” instead of “洗い”. So in the second pass with the lattice option, it gives the base form for “洗” and you end with some form of 洗う which results not pertaining to the same morpheme as other forms of the same word.

洗名詞,一般,,,,,洗,アライ,アライ
いたく形容詞,自立,,,形容詞・アウオ段,連用テ接続,いたい,イタク,イタク
ない助動詞,,,,特殊・ナイ,基本形,ない,ナイ,ナイ
洗名詞,一般,,,,,洗,アライ,アライ
いたく形容詞,自立,,,形容詞・アウオ段,連用テ接続,いたい,イタク,イタク
なかっ助動詞,,,,特殊・ナイ,連用タ接続,ない,ナカッ,ナカッ
た助動詞,,,,特殊・タ,基本形,た,タ,タ
洗名詞,一般,,,,,洗,アライ,アライ
いたく形容詞,自立,,,形容詞・アウオ段,連用テ接続,いたい,イタク,イタク
なく助動詞,,,,特殊・ナイ,連用テ接続,ない,ナク,ナク
て助詞,接続助詞,,,,,て,テ,テ

I’m writing this just to say it happens because it apparently prioritize “いたく” as a possible token. The only solution I’ve found is to remove “いたく” (and the other endings for witch the same happens) from the ipadic dictionary and then recompile it (“Recompile UTF8 dictionary”).

I know you all are professional programmers, to be sincere I know almost nothing of programmation so I tried to go tentatively, but the fact I lost almost a day finding a solution induced me to think maybe if you’ll decide to add that so discussed option, and you find yourself with the same problem, maybe knowing from the start that this is a possible solution may help you save some time.

(something similar happens for potential form which is treated as a thing per sé, but I’ve found no solution to this)

PS: don’t feel yourself obligated to reply, I don’t want to steal your precious time, I promise this is the last time I reply and I put myself on stanby waiting this feature I hope so much

I wish you good work and sorry to have bothered you again!

nobody · October 14, 2014, 10:23pm

Thanks again! I’m not a programmer myself, but I’m sure this will come in handy when it’s time for the dev team to look into this further. We’ve noted this thread and will see if any of this helps simplify things when we get onto this project