‘Importance’, how is it calculated?

Wronski · March 6, 2018, 8:11pm

I have noticed that you can sort vocabulary by importance. Out of curiosity, where do these ratings come from? Is it based on an analysis of the things I have read, or does it come from a pre-existing frequency list?

I have used it a bit to help me to decide which words to add to Anki, but it seems like the vast majority of words fall into the lowest importance category, and therefore aren’t sorted. For example, In French, I have around 3000 LingQs (not known), but only around 250 of these have an importance rating of two or more. This brings me back to my first question, will this change as I read more? If not, ie. if it’s based on a pre-existing frequency list, then how come so few words fall into the higher categories? Is the data not good enough to support a more granular rating system?

Miznia · March 6, 2018, 11:31pm

I think it’s a pre-existing frequency list. I think the ranking is intended for beginners so they can tell which words they might really want to focus on. In your case your French LingQs to known word ratio is pretty low. You haven’t needed to LingQ most of the words you’ve encountered. So the importance ranking probably won’t help you that much.

A more refined/granular ranking would be interesting, even if the effects were only visible when sorting your vocab.

Wronski · March 8, 2018, 2:47pm

I downloaded a huge Russian frequency list (based on film subtitles) and used it to rank some of my LingQs from the the lowest importance category. The results seem reasonably useful - you could probably call the top quarter of the list middle-frequency and the rest low-frequency. However, there are quite a few words that appear to be placed ‘wrongly’ based on my expectation.

I realised I should have given them a rating based on the relative frequency (which is also included in the list) rather than just sorting them, that might have given more useful information.

gregf · March 12, 2018, 9:12am

I don’t think the frequency calculation comes from a predefined word list, if only because I’ve noticed that morphologically similar words can have different frequencies in the LingQ system. That would suggest that the frequencies are calculated on an individual basis, and not according to a master list…

My guess is that they have an algorithm that calculates the frequency of a word given the corpus of texts that have been uploaded into the system. Thus the larger the corpus (English, French, Spanish), the more accurate the frequencies.

But I’d love to hear from someone at Lingq about this…

Miznia · March 12, 2018, 10:51am

I think it’s a predefined list using frequencies from some corpus, but done just once. I don’t think somebody consciously created a list making sure that morphologically similar words got the same values. That would be a pretty thankless task I think.

Basing the counts dynamically on the LingQ corpus would be impressive but I don’t think it would make sense to do that… Unless they couldn’t find a larger corpus or some other authority on this.