I think grouping verbs together is fine when it involves regular compound verbs, but if the algorithm has been tweaked in order to ensure those were detected then I believe it’s acting a bit too liberally, take a look at these tokens from the last few pages of my current lesson :
見せつけてやれよ
はっきりしていた
引き離されそうになった
ぶつかりそうになりながら
気づかないわけにはいかなかった
And look at this simple example : 巻き返してきた
It’s currently tokenized as one word, but at some point you have to distinguish between compound verbs and noun phrases
I don’t think it really makes sense to tokenize things such as this together either, like the current parser does :
noun+する, noun+だ
There are many inflected forms for both of those, and if you multiply all those forms with every possible noun you get a ridiculous of “words” that don’t really make sense as a unit outside of some idiomatic forms perhaps.
Not to mention this can also compound with the issue above such as with this token from the same lesson : 叫びっぱなしだ
This was regarding the verbs.
Other issues are with vocabulary and idiomatic phrases detection, I could be wrong but I remember the previous parser doing a much better job with this.
With the current model I frequently have to re-group things together because they get split.
A simple example, the book I’m reading has man occurrences of the word 吸魂鬼 which used to be detected just fine.
But now it sometimes appear as 吸魂+鬼 or 吸+魂鬼, both of which create words without meaning (吸魂 and 魂鬼)
There is also another issue, I remember the previous model being able to parse works written using older spelling, or unconventional grammar, at least to a better degree. I’m currently reading a modern book, which is where I’ve taken those examples, but when trying to read older works the quality of the parsing seems to worsen considerably, something that I felt wasn’t an issue before.
Maybe someone else who is currently reading such works on Lingq can share their experience.
So to summarize, since the re-introduction of the feature I haven’t encountered any full clause or phrase being tokenized together, so it seems that it did fix some issues, but the parsing of verbs and particles still remain overzealous (which as I said before also impacts the quality of the suggestions in the vocabulary window) and vocabulary detection still remains less accurate than it used to be.
Thank you for looking into this, hopefully other people can also come forward to describe the issues they’re seeing.
Link to my current lesson : Login - LingQ