The percentage of unknown words indicator uses the wrong metric in my humble opinion

t_harangi · November 12, 2019, 11:59pm

Okay, so this may seem nitpicky, but hear me out: I just noticed that the percentage of unknown words displayed on the lesson and course info – which could be a very useful feature – may be using the wrong metric or implementation. It seems to be calculating the % unknown words out of all the unique words present in a course, and not out of the total words present.

Meaning, if a book has, say, 100,000 total words, made out of 15,000 unique words, and one doesn’t know 1,500 of those words, the % indicator will say 10% Unknown.

Though this may seem right at first glance, when the 1,500 unknown words are calculated against 100K total that would make it 1.5% unknown – which I think is the actually accepted metric when it comes to reading comprehension.

The rule of thumb of unassisted reading is once the percentage of unknown words fall below 2%, the reader can deduct the yet unfamiliar words from context. However this formula uses the total words vs. unknown metric, assuming that at 2%, one out of every 50 total words read would be unknown. And at the 1.5% example above, one out of every 66.7 total words read would be unknown.

I think the percentage of unknown words indicator here should be measuring our progress towards reaching that 2% goal, but with its current implementation it doesn’t seem to be a useful indication of that, since it doesn’t follow the accepted metric of reading comprehension.

zoran · November 13, 2019, 5:19am

We had a similar thread before where we discussed this and our conclusion was that current metrics system actually makes more sense. I’ll try to find that thread and re-read it to refresh my memory.

Car2017 · November 13, 2019, 9:59am

Speaking of percentages, could it be that the calculations aren’t right? I’ve had many courses where the percentage of the course as a whole is way higher than for any of the lessons. How can a course have say 25% new words when each lesson only has about 10-20%?

ivargas · November 13, 2019, 10:12am

I totally disagree with you. I prefer to know how many words I don’t know out of unique words, for example, a text with 1000 words, 50 unique words and I don’t know 25, i.e. 50%. Using the unique words, I can notice that it is a way too difficult text for me, but if we analyze like your metric, it would be like 2,5%. That induces me to think that it’s an easy text for me when, in reality, it isn’t. So, I disagree with you. I need to know how many words I literally don’t know in a text.

t_harangi · November 13, 2019, 7:04pm

I hear you, but the problem with that is that metric technically only benefits you in the very early stages – when it’s actually less useful since everything is difficult to read at that point.

And the higher your vocab gets, all the less useful of an indication this mesure of % becomes because reading comprehension is not measured by that metric by the relevant studies and references. And most importantly, it is not experienced by that metric.

I’d recommend everyone to check out:

Hu and Nation (2000) https://nflrc.hawaii.edu/rfl/PastIssues/rfl131hsuehchao.pdf

Laufer (2010) https://files.eric.ed.gov/fulltext/EJ887873.pdf

In these studies, and every other academic study that you will find, reading comprehension is found to be a correlation of a percentage of known words out of the total word count of the text and not the unique word count of the text.

vallarta · November 14, 2019, 12:21am

As the high frequency words, which are more likely known, are repeated through the course the number of known words grows more slowly than the number of less frequent words because a word that is expressed multiple times is treated as appearing once…

Word frequency tends to follow Zipf’s law which postulates that the probabilty of seeing a word is inversly proportional to its rank in frequency. The word that is ranked 1000 in frequency will appear roughyly 10 times more frequently than the word ranked 10,000 in frequency.

The more common words show up in more lessons. The rare words show up in fewer lessons… At the lesson level, there are less unknow words per known words than at the coarse level because the known words are counted as one word regardless of the number of times they appear at both the lesson and coarse level.

miscology · November 14, 2019, 9:15am

That’s reflected in the color scheme where <12% unknown is green, therefore at your reading level. If it was actually 12% of the total words then it would be a very difficult text, looking up one or two words every sentence.

rafaelnajera · November 14, 2019, 1:22pm

Completely agree with you here. I’m at about 28000 words in German right now and at this point is very hard to predict whether some text will be hard or easy to read just by the percentage of unique words known. I now simply ignore that statistic and start to read; I know, however that if there’s just too many blue or yellow words in each page it’s going to take some work and I may not be able to understand very well.

Coincidentally, just last night I came across a lecture on YouTube by Alexander Arguelles (a towering figure in the polyglot community) on reading as an efficient tool for language acquisition. Most of what he says actually reinforces the idea that Lingq is a pretty good system. He also says that the actual threshold to be able to acquire vocabulary by context only, no dictionary, no lingqs, is about 98% of known words (out of total words, not unique words); less than that you run the risk of believing that you are understanding when in fact you really don’t.

peterB1 · November 14, 2019, 1:52pm

The metric suggested by t_harangi is indeed better. The main problem with the current metric is that it is not “additive”. Consequently, the metric for the course is typically larger than that of any of the lessons, which is certainly not helpful. Also, the percentages will tend to be higher for longer lessons, not because they are more difficult, but purely, because they are longer. The metric suggested by t_harangi would give percentages, which are the weighted averages of the smaller components of any given text (or lessons of the whole course).

This can also be demonstrated on a single sentence. Let’s assume that the only word I do not understand in “I wanted to buy apples, but they decided to purchase pears.” is “pears”. If I then change the sentence to “I wanted to buy apples, but they decided to buy pears.”, the current metric shows it as more difficult, which does not seem to make much sense.

It also seems that it would not be any more difficult for the system software to calculate the metric suggested by t_harangi.

peterB1 · November 14, 2019, 1:55pm

Hi ivargas,
Your preference is just the matter of recalibration. With the metric suggested by t_harangi, you would just need to learn that 2.5% is somewhat difficult and 1% is easy. I justify above/below why the metric suggested by t_harangi is indeed better.

nobody · November 16, 2019, 11:52pm

Here is a table from an Alexander Arguelles video that you can use to calculate your approximate vocabulary range. For example, if you know 98% of the total words in a native text, then you have a 6,000 word vocabulary. The table assumes that you are using native materials, and it is based on the total number of words.

80% (1,000 words)
87% (2,000 words)
90% (3,000 words)
92% (4,000 words)
97% (5,000 words)
98% (6,000 words)

98.5% (8,000 words)
98.86% (10,000 words)
99.11% (12,000 words)
99.25% (14,000 words)

I have done most of my Italian reading on Facebook, so I started using this table and I usually have a 92-96% word comprehension level, which indicates that I have a 4,000 word vocabulary.