Defining the 30,000 word benchmark

​"It is mentioned that a vocabulary size of 30,000 words equates to an Advanced 2 level. I am aware that this count includes inflected forms like plurals and verb conjugations. But does it also include proper nouns (names of people, cities, countries) and written numbers (like ‘seventeen’)?

​If the 30,000-word benchmark assumes these are included, then ignoring them seems illogical. However, to ensure this target is accurate, are there specific categories of words you would recommend I mark as ‘ignored’?"

How does Steve Kauffman do it ?

3 Likes

I think there’s no way to avoid that issue completely. According to my analysis, for english, the number of real unkown words is under the 60% of displayed number.

3 Likes

Wow amazing analyses but a little complex for me because of my English level…But I review more closely thank you…

Did you use “antword profiller” for this lemmatized ?

You write about 40.000 words down to 24.000 words..Am I right ? If so 20.000 words more than enough read Harry Potter, Dan brown books etc..

By the way I certinly ignored proper names but I’m not sure for places names and written numbers like seven, fifteenth, second, 15th…etc..

How do you do this ?

I used a python library to clean the raw data.

This is what I’ve noticed after using LingQ daily for two years:
The number of known words doesn’t accurately reflect the actual number of words you know.
As the accumulated vocabulary increases, the proportion of newly added words that include people’s names, place names, and typos seems to increase.
So, don’t focus on the number itself, just on the rise of the number.
For me, most useful statistic is the Words of Reading and LingQs Created, since other stats can be destorted by outlier or errors.

4 Likes

p.s. If you want to export the whole known words from LingQ, try this add-on

https://forum.lingq.com/t/lingq-addon-ai-tools-custom-layouts-for-an-enhanced-learning-experience/

2 Likes

I appreciate that,…. I certainly use this add-on when I achieved high number known words….

It doesn’t. Advanced 2 Level (and other Levels) at LingQ use quite different thresholds, depending on the language.

2 Likes

Personally, I do not get too caught up with it. Sometimes I add names, numbers and places to known, sometimes I don’t. It honestly depends on what mood I am in haha.

There is a joke I saw a few months ago that made me laugh, it said “ I know One Hundred Thousand words in German. Eins, Zwei, Drei…”

5 Likes

I agree, I don’t care about it anymore either. Often, I don’t even convert yellow words anymore. But I believe that every word matters because we need to be able to recognize names and any type of word. Even that surnames are surnames and not a word with meaning, for example.

If calculation is a “problem”, one can just mentally consider a percentage of error. For example, you can see 30k displayed, but with a 10% error, it becomes 27k. Problem solved, and a lot less headache!

Plus, I wouldn’t rely on the advanced 2 to be close to reality for all languages and all situations. We did the comparisons in the forum a while ago, but I don’t remember where.

3 Likes

I just read an article that said 8,000 to 15,000 words should get you to C2 (A2 on LingQ) level in most languages. There are some languages that would use a far lower word count depending on how you count what a “word” is.

Country, city and people names should count. You can’t read the news or watch a historical documentary, or even read a geography/history textbook for elementary schoolers, if you don’t know country and city names.

In most languages I have studied, names are also normal words (like being named Eagle, Bear or Stream-Branch). Also, you sometimes run into problems if you write an Email to someone and don’t know if they’re a man or a woman because you haven’t learned about their name in the language.

I’ve watched a few interviews with Steve and he doesn’t seem, to me, to be the type of person to waste time by over-analyzing his language study. His method is really just “tons of comprehensible input”. In my own study I just hit a word as learned or not, I don’t bother with levels in between or with hitting the ignore button.

You can learn thousands of words, and still know “all the wrong” words. I learned an Asian language in university – very different from my native English both in vocabulary and grammar – and continued learning through TV and novels (all fiction). I took an international language exam and achieved C1. But when I went to live in the country, I found that there was a lot of everyday life words and phrases I didn’t know (“sponge, mop, bleach, pension plan, I’m sorry for your loss….”), and my language level felt much lower in reality because I had learned all the wrong words for my new situation. The opposite is true too. While I was there, I met foreigners who wrote and spoke at an amazingly natural level in the language, far better than me, but they admitted they couldn’t understand fiction!

4 Likes

"Actually, when I say ‘advanced level with 30,000 words,’ I don’t mean being fluent in reading or fully understanding what you read. What I mean is purely vocabulary matching. For example, Oxford or New-GSL frequency lists always give match/coverage rates. So the fact that 3,000 words cover 90-92 % of any text doesn’t mean you will understand 90% of .

So 30.000 words in lingq which is equel 10.000-12.000 lemma roughy.. That is probably coverage %99 of friends tv show, %98 Harry Potter except for wizard and fantastic vocab. etc….

And I think we can understand nouns whether it plural or singular or verbs base form and conjugation etc at least most of them .. So understand thresold even can more below than goal numbers….

But of course everyone aim and expectation and standards different and actually my goals so below than yours….

I think I will understand if I’m wrong or right about 1 year laters for now I’m reading oxford bookworms level 4 intermediate…

You need at least specify the language or a group of languages. As a general statement that does not seem to be accurate.

1 Like

Except creating LIngqs messed up the % of words known statistic so we don’t know how much of an article we are likely to understand. Bad statistics make Lingqs a bad idea.

1 Like

I asked Gemini 3.0 (Google’s latest AI LLM) this question a few minutes ago and it gave me a very interesting and useful answer;

This is one of the most common points of confusion for LingQ users. The short answer is that LingQ levels (Beginner 1 – Advanced 2) do not map 1:1 to CEFR levels (A1 – C2).

In fact, LingQ’s “Advanced 2” is widely considered by the community to be roughly equivalent to a solid B2 (Upper Intermediate) in reading comprehension, rather than a C2 mastery.

The main reason for this discrepancy is how “words” are counted.

1. The “Word Count” Inflation: Forms vs. Families

Standard CEFR estimates usually count Word Families (e.g., run, runs, ran, running = 1 word). LingQ counts Word Forms (e.g., run, runs, ran, running = 4 known words).

Because of this, your “Known Words” number on LingQ will always be significantly higher than standard academic estimates, especially for highly inflected languages.

  • English/Mandarin: Low inflation (few variations per word).

  • Spanish/French/German: Medium inflation (conjugations, genders).

  • Russian/Slavic: High inflation (cases, declensions create dozens of forms for one word).


2. Rough Conversion: LingQ Words to CEFR

Since LingQ updated its thresholds around 2023, the community and internal data suggest the following approximate targets for a B2 (Fluent Reading) level.

Note: These are estimates for Reading Comprehension only.

Language Type LingQ “Advanced 2” Target Approx. CEFR Level
English ~15,000 - 20,000 words B2 / C1
Romance (Spanish, French) ~30,000 - 45,000 words B2
Germanic (German, Swedish) ~40,000 - 50,000 words B2
Slavic (Russian, Czech) ~60,000 - 80,000+ words B2
Japanese ~40,000+ words B2

Key Takeaway: If you are learning Spanish, you are not “Advanced” (C1/C2) when you hit 15,000 words. You are likely just breaking out of the intermediate plateau.


3. LingQ Levels vs. CEFR Reality

If we try to map LingQ’s internal stages (Beginner 1 to Advanced 2) to CEFR, the consensus “Rule of Thumb” looks like this:

  • LingQ Beginner 2 $\approx$ CEFR A1 (Survival)

  • LingQ Intermediate 1 $\approx$ CEFR A2 (Basic conversation)

  • LingQ Intermediate 2 $\approx$ CEFR B1 (Independent user, can consume clear media)

  • LingQ Advanced 1 $\approx$ CEFR B1+ / B2 (Approaching fluency)

  • LingQ Advanced 2 $\approx$ CEFR B2 (Solid literacy; “Potential Fluency”)

Why does it stop at B2?

Steve Kaufmann (LingQ founder) often describes the goal of LingQ as “Potential Fluency.” By the time you reach Advanced 2, you have the vocabulary to be fluent (B2/C1), but you likely lack the activation (speaking/writing) to pass a C-level exam unless you have been talking extensively outside the app.

4. Better Metrics to Watch

Since “Known Words” can be gamified (and sometimes inaccurate if you mark names/places as known), many advanced users track these two metrics instead to gauge CEFR progress:

  1. Words Read:

    • 1 Million Words Read: Usually gets you to a solid B2 reading level in Romance languages.

    • 2+ Million Words Read: Needed for difficult languages like Russian or Japanese to reach similar comfort.

  2. Unknown Words Per Page:

    • When you can read a standard book page with only 3–5% unknown words (roughly 5–10 blue words per page), you are solidly B2/C1 in Reading.

******************

I’m too lazy to bother “ignoring” proper names and locations so they get included but, since I know that, it doesn’t really matter. I’ll know if I’m really a B2 next May or June when I take the test.

I’ve used a bunch of different tools. Outside of regular conversations with my French tutor, LingQ is the most useful tool for truly getting a handle on a language. I have found it to be extremely helpful.

6 Likes

As far as I understand from his videos, all inflected forms are included, as well as main numbers in letters (like one, two etc, not 1, 2). About proper nouns, I have not heard from him, but I would infer that main names, countries and cities could be included.

That answer is basically what we came out with. I used to consider an advanced 2 as a B2 more or less. But I think LingQ was conservative, and for a few languages, I would have added 20% more to their figures.

It’s not that you are lazy; you are smart. You don’t waste time on things that don’t matter. It’s normal at the beginning, but after a while, priorities change. There are advanced people here that don’t even bother to click on blue words or yellow words unless it’s really important. When you start to have tens of thousands of known words in one language, things change a bit.

2 Likes

Would debate that the accurate number of words known equivalent to FSI’s level is the Lingq’s amount multiplied by 2. Since we mostly bring in an uncertainty amount of common and uncommon words in our language statistics, it could disrupt the true amount of words that is genuinely used. Found out that even in the hardest language such as Mandarin Chinese in terms of the Lingq’s amount of known words for Advance 2 is certaintly not accurate and would mostly need about double the amount to match the standards more accurately.

Just an opinion.

They updated some of the numbers a while ago, maybe last year? But they didn’t update all languages. I didn’t follow other news, so I’m not sure if they completed or not the entire spectrum. What they had before was too low. The updated numbers were still low anyway, but better!

If I’m not wrong, someone said they had advanced 3, and advanced 4 too, but I might be wrong.

That was a thoughtful reply. Thank you.

And regarding advanced users not even bothering clicking on the blue links: I get it. I’ve struggled with prepositions and conjunctions for years but have begun to realize the only way for them to really click is to see them used. I stop now for a second to absorb the usage, nod to myself and go “okay, I get it” and then move on.

Steve K is right. Volume taken in is more imporant than staring at a vocabulary list. Words seen in context stick.

2 Likes

A related question. Can anyone give examples of working with there collection of known words (in a database sense)? Without this use, marking a word as ‘known’ is a lot like marking a word as ‘ignore’.