Language coverage stats (e.g. 97% of spoken language covered, etc.)

It would be nice to have a % of language covered statistic (probably being spoken or from newspapers/internet articles) based on which words we know. The database used to judge the importance on words could be any one of the databases that exist already. Plus I’m sure LingQ already uses a database seeing as they have importance of each word in the vocabulary section.

https://en.m.wiktionary.org/wiki/Wiktionary:Frequency_lists

2 Likes

Do you mean providing a personal statistic of how much of a target language the learner has marked as “known”? That would be interesting to have, though it’s easy to see lots of problems in defining what extent of the language to measure against. I also wonder whether it could prove de-motivational? Yay, I know 0.3% of my new language!

Yeah that’s what I had in mind. The link I provided has, for a lot of languages, words pulled from movie subtitles. So I think that would be a good indication of spoken language. Of course it wouldn’t necessarily match up perfectly with literary language, but it would be a good indicator of where someone stands.

ps, the statistics would be weighted by frequency so even a couple prepositions would get you to like 50% (which is meaningless), but it would be more meaningful between 90% and 99.99%. Just out of curiosity I loaded the top 50k words for Russian into LingQ and I know 19829 (23046 are blue, 7125 are yellow). So 19829 of the 30k known words I’ve reached on LingQ are on that list.

Pps, those top 50k include everything encountered in subtitles from proper nouns to sounds, e.g. “уф”. So maybe a few hundred are complete junk.