Are the LingQ proficiency levels really that accurate for highly inflected languages like Russian or Czech?

This proficiency chart tells you how many words you need for each proficiency level in different languages. I feel like, based on my experience learning Czech that 1800 words (by the LingQ system) would be an accurate depiction of an A2 level Czech vocabulary.

To explain why I present two basic words from English and Czech. A noun and a verb respectively.
school/škola and “to cook”/vařit

school has the “school” and “schools”

škola has škola, škole, školy, školu, školou, škol, školám, školách, školámi.

to cook has “cook”,“cooking”,“cooked”

vařit has vařit, vařím, vaříš, vaří, vaříme, vaříte, vařil, vařílo, vaříla, vařili

And thats not even counting perfective forms! You can see, measuring by the LingQ system of counting words, Czech has an exponentially larger vocabulary than English but only 5000 more words at the highest level of proficiency? I personally believe the word count requirement for Czech or Russian should be some where closer 1.5x to 2.5x to that of English.

1 Like

Personally I agree that the numbers correlating to some kind of proficiency for inflected languages seem way too low for inflected languages, based on my own experience in Russian and in several non-inflected languages. However, I don’t think that the number of case endings/declensions in Russian and various imperfective and perfective verb conjugations in Russian (for example) are the main concern.

Once one has a basic knowledge of all the cases in Russian which should be somewhere at LingQ’s B2 level, then I would find it odd that intermediate students would make lingQs for regular nouns, adjectives and pronouns in each and every one of their case variations. Certainly I don’t and would only do so for an irregular form of a noun/adjective. I suppose I could, but if I understood the word in the given context (that is, in a specific case), what’s the point? To merely inflate the number of lingQs and then in turn “known” words? While some may do this, I personally care about what I understand (1) in a given context and (2) whether I know the word/phrase well enough to use it independently in conversation. The number of LingQs says nothing about the second which is my primary goal.

Similarly, for verbs, while in Russian there are imperfective and perfective forms for most verbs, in the Romance languages there are more verb tenses and six conjugations for each – that is, many more verb forms than exist in Russian. Again, I can’t imagine anyone making LingQs of regular verb forms in every conjugated form once one is at the intermediate level.

Yet Russian still has considerable challenges relating to grammar. It doesn’t merely have cases applied to nouns, but additionally to adjectives and pronouns which do not match in spelling the way adjectives agree with the nouns they modify in the Romance languages. Moreover, in Russian, different verbs and prepositions take different cases, with corresponding different meanings. Proper names and numbers are also declined. Which number is used affects the declension of the nouns that follow (e.g., 2,3,4 + noun is different than 5, 6,7, 8, 9 + noun). Thus, the difficulty is not merely that the ending of the noun, adjective and pronoun is spelled differently, but when that changes (some prepositions can be used with more than one case ending but this changes the meaning).

All this makes Russian grammar vastly more difficult for non-Slavic speakers. Learning the infinitive of a regular verb or the nominative case of a noun and adjective is easy. However, one cannot understand text or speech or say anything comprehensible without knowing which form of a noun or verb is appropriate in a given context.

Thus, for me personally, the “advanced” number of words I know on LingQ is not an accurate assessment of my ability to use the language. My skills in (1) reading, (2) writing, (3) listening, and (4) speaking are all different with my best ability being in reading and my worst, speaking. Right now, I am specifically working on improving my listening and speaking but that won’t be reflected in the number of words “known” on LingQ.

BTW, the US State Department’s assessment of words needed for proficiency in various languages counts “word families,” not individual words. Thus, all declensions of a single noun count as a single word as do all conjugations of a single verb. Verb forms of nouns are also a single word as are similar adjective and adverbs. How one counts this practically, I have no idea.

At best, the LingQ number probably correlates more to one’s reading comprehension and is motivational since one can see it increase. Yet to what extent one’s reading knowledge is applicable to other language skills remains dependent on the efforts of each learner.

I don’t feel strongly about this one way or the other,. But you didn’t count the verbs’ participles, which I know in Russian to have all the declined forms of adjectives. All told, I think I counted somewhere around 80(?) forms derived through grammar from one regular verb.

1 Like

But isn’t true that once you learn various nouns and verbs with all their inflections, it becomes easier and easier to deduct inflections for new words you come across? As you advance in the language you don’t have to meet all the forms of a verb to be able to “know” all the form of that verb due to the patterns you have already become accustomed to.

The chart is very arbitrary, looks quite random and, in my opinion, doesn’t really represent anything.

First of all, what words? Any random words?

Here’s some statistics from the Russian National Corpus: 100 most frequent lemmas cover 37% of the words in texts, 1 000 lemmas – 60%, 2 000 lemmas – 69%, 10 000 – 85%, 50 000 – 93% of the words in texts.

So, roughly speaking, 1 000 most frequent lemmas are somewhere around A2 aiming at B1, and 2000-2500 lemmas are somewhere around B1 aiming at B2.

The problem though is that counting words outside the range of the most frequent 2000-3000 lemmas is, in practical terms, quite useless, since language materials you work with at this level become wider and more register-dependent, so the number of words you have to learn in order to make an additional 1% coverage gain increases sharply. For example, you can learn a lot of words from movies, but they may not necessarily be helpful in reading/watching media, in understanding academic speech, literature, blogs and so on.

I don’t think about this too much either but the table seems about right to me. I’ve read several books in Polish now and lots and lots of other material I’ve imported. I’m getting near 20,000 and feel I’m nudging the Advanced level. I’ve also thought about the inflected nature of Polish and how this might artificially increase the known word count. But of course, just because one verb might have dozens of conjugated forms, doesn’t mean that they all come up with equal frequency.

1 Like

If you dump a lot of material into lingQ for each of each of those languages, you can compare the word counts. I think that’s what they did, and then they simplified it by putting all languages into one of 3 different groups. So I don’t see how those ratios can be “wrong” for you, unless the material you’re using is quite different from what they used. If you want to do a spot check, you could upload an article in Czech and it’s translation in English and look at the word count ratio.

On the other hand, I don’t think the different levels correspond to the CEFR. They don’t make that claim, as far as I know, but I mention it because I sometimes hear people bring it up.

In my own experience of learning Russian I have suspected that Lingq was significantly overestimating my level until I recently bought a Russian graded reader book aimed at A2/B1 and found it is indeed quite a bit below my level on Lingq (B2 approaching C1). Honestly, I was pleasantly surprised how easy I am finding it. So maybe the numbers are not that far off, for me anyway.

It does surprise me that there is not that much difference in the numbers between the language groups.

I just commented about this. Where in lingQ does it tell you your CEFR level, or are you just assuming lingQ intermediate 2 = B2, etc?

They are just milestones. I don’t think you can come up with definite numbers of known words that really reflect your “level”.
There are a couple of factors that IMO are worth considering when discussing the relationship between vocabulary and proficiency, in general and in the case of “highly inflected” languages that I feel are often overlooked:

  1. Number of known words is a wonderful predictor of a low level in a language. Its predictive power declines sharply as the number increases. E.g. if you can only understand, say, 100 words in the language, of course your level is very low. When you go to 1000 you’re most certainly improving. However, if one person understand 30.000 word forms and another 31.000 you have absolutely no clue which one has a higher level. They may vary spectacularly and it’s possible that the first one has a good working knowledge and the second one is barely able to function in it. Even 30.000 vs 40.000 doesnt’ mean much or 100.000 vs 1300.000. So, focus on getting to an ok-ish level (examples follow) and then mostly forget about absolute numbers and try to improve your vocabulary as you work on your listening comprehension and active use.
  2. The number of known words (as per Lingq count) do increase with inflection but much slower than you might think based on how many word forms there are. That is why counting the forms doesn’t give you an accurate picture of how many words you “need”. Why is that? Because, if you learn words through context (reading and listening to real material) you are not likely to come across all forms of a given word at once. Some forms are way more usual than others. An example from English: you are likely to find “parents” very often, however, it may take very long for you to find “parent” used in the singular. You do “know” the word because you can deduce it from your knowledge of the plural form plus your understanding of the morphology. In the meantime “parents” doesn’t increase your official word count, so it results in an under-estimation of your vocabulary. The same way, you’ll meet “beauty” soon enough but it may take very long to find “beauties”, same thing with “need” and “needless” or you may encounter “dare” months before you come across “dared”. In English this effect is tiny but it gets larger as the languages get more inflected. For example: Even now, I still came across “blue” words that are less common forms of very easy words: it was years until I met “большому”, or “многих”, e.g.
    The end result is that there’s not such a big difference between less and more inflected languages as you may think. Steve has said that you can function relatively well in most languages if you know about 30.000 word forms and my experience seems to corroborate that claim.

As a rule of thumb, my own suggestions about relevant vocabulary milestones in languages I know are like this.
As per pro. Arguelles, you usually need about 5000 “word families” (WF) to understand everyday conversations by uneducated speakers (about B1 level), 10.000 to understand well-educated ones (which IMO would also help you understand a lot of media content, about B2 level) and 20.000 to read high-level literature unassisted. You won’t understand everything but you may manage to keep up at each level. We are assuming that you have acquired the vocabulary in meaningful contexts, so you have a good grasp of how the language works and are used to both written and spoken material.
Those numbers seem, in my experience, to work ok across several languages. Now, how many “word forms”, as per Lingq count, correspond to such numbers? It depends on the language, of course, my own estimation is as follows:
English and similar languages have about a 1.5 ratio between word forms and word families. Taking into account what I explained before, the number of known words in Lingq (“lingq words” LW) that you need is a bit less, but not much, so we can use that figure. So you need about 7.500 LW for the 10.000 WF-level (B1-ish), 15.000 LW for the 10.000 WF level and 25.000 for the 20.000-WF level

For Romance languages I think that a safe estimation would be to double the English one: 30.000 for the 10.000 wf-level (B2-ish), etc. This estimation seems to coincide with what learners of those languages on Lingq usually report.

For Slavic languages (Russian, in particular) I would say that you must double again the previous figures to come up with good predictors : 30.000 LW (about) for B1-ish level (typically I think you experience a sharp increase in level a bit earlier, a bit over the 25.000 mark, but YMMV). about 60.000 for the B2-ish level (again, typically maybe a bit earlier, well over 50.000) and, probably, you’d need well over 80.000 to read difficult literature unassisted (I haven’t reached that level yet, so I’m not so sure).

Disclaimers:

  1. For LW I’m assuming passive vocabulary that can be understood with ease, subtracting proper nouns, such as names of people or places)
  2. When I mention about B1/B2 levels, etc. I mean that the given number of words give you a good basis for functioning at that level. The mere knowledge of the words is, of course, not enough to reach that level: you need to work in listening comprehension, conversation, etc.

" For Slavic languages (Russian, in particular) I would say that you must double again the previous figures to come up with good predictors : 30.000 LW (about) for B1-ish level (typically I think you experience a sharp increase in level a bit earlier, a bit over the 25.000 mark, but YMMV). about 60.000 for the B2-ish level (again, typically maybe a bit earlier, well over 50.000) and, probably, you’d need well over 80.000 to read difficult literature unassisted (I haven’t reached that level yet, so I’m not so sure). "

For Polish, it sounds correct to me. At this time I find more useful to look at the ratio of ling/word number. For example, If I read 2000 words and I ended up with 100 yellow words, then I am reading at the 5 % level. It is still very hard. I think that 2.5 % would be the real beginning of unassisted reading. Currently, I am reading INFERNO of Dan Brown at at 3.5 % level. It is still too hard. I think that I will need another 20-30 K more words to read without assistance. I guess it put me at the low C1 reading level.

By the way, I noticed that reading too easy material become too boring. Reading became enjoyable after 50000 known word count Of course, I mean, news, magazines, books, etc.

Yes that’s my assumption, I probably should have said my level on Lingq is Intermediate 2 not B2. Although I think I read somewhere that they can be considered as being roughly equivalent. It could only ever be a very rough estimation and I guess the reason they don’t make any claims is because it really depends on the criteria people use for marking words as known. So one person’s Intermediate 2 is not the same as another person’s Intermediate 2. Whereas the CEFR levels are, I assume, designed to be the same for everyone who is assessed against them.

1 Like