Is it possible to see new words as percentage of total words?

njb44 · January 25, 2022, 7:31pm

When I look at new material to read, lingq presents the number of new words as a percentage of the total unique words. While this is interesting and can be helpful, I realized that when material frequently introduces new tokens that are rarely reused it tends to present an inflated estimate of the difficulty. With total word count it’s ambiguous whether the, say 20% of unknown words all appear once (and are only say 5% of the total words) or they appear many times each (and are, say 50% of the total words), as may be the case if you’re still learning the top frequency words like “and”, “or”, etc…

On the surface this seems straightforward to calculate from a tokenized text: sum(new_tokens_appearance_count)/total_word_count. That said, I have no idea if the application has convenient access to the total times each token appears when it summarizes the new word count so it could well involve major changes to implement (i.e. no worries if this is a high effort ask that isn’t worth doing, assuming it’s not already available).

As a user it would personally be a useful feature as I’ve found myself opening red/“difficult reads” that are both very light on frequency of new unknown words and heavy on them, leaving me unsure what the experience will be until I’m well underway. As an example I’m starting to read through a ~1 million word series in Japanese that I did some quick word analysis on. Of the roughly 42k unique tokens 12k of them are only used once, and nearly half of them are used fewer than 5 times. I haven’t done any analysis on when I’d likely stop seeing new lessons as “difficult” at this rate of new term introduction but with up to 50% of terms being rare (at least internally) I suspect that the difficulty will remain artificially high even well into the series when most of the words (by frequency of occurrence) are familiar.

Again, no idea how much work this feature might take and I wouldn’t suggest removing the percent of unique words as many probably rely on it. But as a supplement I think that could be useful to many.

t_harangi · January 26, 2022, 6:58pm

I myself have pushed for this in the past, and there were a few discussions about it, in the forums, but it has never been implemented. For an overall comprehension index it would be more useful to do it the way you suggest.