Vocabulary coverage ratios

There’s often talk of how many x thousand words is enough to cover x% of any average text. A book I was reading had statistics for a few languages.

1000 words covers 80.5%
2000 words covers 86.6%
5000 words covers 93.5%

1000 words covers 83.5%
2000 words covers 89.4%
5000 words covers 96.0%

1000 words covers 81.0%
2000 words covers 86.6%
5000 words covers 92.5%

Chinese (Mandarin)
1000 words covers 73.0%
2000 words covers 82.2%
5000 words covers 91.64%

1000 words covers 60.5%
2000 words covers 70.0%
5000 words covers 81.7%

1000 words covers 73.9%
2000 words covers 81.2%
5000 words covers 89.3%

1000 words covers 67.46%
2000 words covers 80.0%
5000 words covers 92.0%

1000 words covers 69.2%
2000 words covers 75.52%
5000 words covers 83.13%

The table in the book I was reading is reproduced in this document (Japanese, pdf):
There’s quite a bit of variety there. This was a Japanese book, and the Japanese statistics seem to show that you need a high vocab for everyday usage. The book states that you need 10,000 words to cover 91.7%, which is less than English, French and Spanish. I am always a bit doubtful of these figures because when I’ve seen statistics like this for Japanese it’s usually printed along with some text about the unique richness of Japanese vocabulary and a few examples of how inferior English is, which usually just demonstrate the writer’s poor grasp of English. In the book, it actually misses out the stats for Russian and German, which I presume didn’t go along with the point they were trying to make. These are taken from the linked pdf file. Japanese does have a very rich and varied vocabulary, but I am surprised that it uses a larger number of words than other major languages, and probably even more surprised that German is comparable.

Anyway, this is quite interesting for learners of languages. Numbers aside, how have you found different languages to be? The different levels for the avatar on LingQ are differ between languages, but they do not correspond with these statistics. Have you generally found that you can understand a similar amount between languages when knowing a certain number of words? Or does it differ wildly? Have you seen any other statistics that contradict these? Or agree with them?

You know, I’d find it interesting to see similar stats but for spoken language.

It seems like the most common 200 words comprise about 80% of the commentary in German Youtube videos. But then again perhaps it’s just my taste in videos that’s the problem . . .

Also I’m surprised at the difference between Spanish and French.

Spanish is fairly known for supposedly having an extremely limited vocabulary, though I’ve never heard such a comment about French.

I’ve always believed that Spanish has a smaller vocabulary than French. Interesting.

It’s strange, but for Italian I rarely meet texts with less then 10% unknown words, knowing more then 12000 italian words.
Looking in statistics for roman languages (French, Spanish) I imagine to see the similar percentages for Italian. But…

Here is my statistic for Steve’s book, in italian (unknown words’ percentage is provided in parentheses for each chapter):
Is this book not “average” enough? :wink:

Don’t forget the names and forms of words that you already know. I am down to 10-15% in Russian from Echo Moskvi.

What an interesting looking PDF document! Pity my Japanese sucks :frowning:

I’ll try OCR’ing it and importing it as a (private) LingQ lesson.

I’d heard before that French had a smaller vocabulary.
However, I imagine that a lot of this is difference is down to what you define as a word. I gather in German there’s a tendency to agglutinate words, so it is easy to create a new ‘word’ without adding any new lexical units. In English teacup is one word, but coffee cup is not. I’d be surprised if any of these languages had any significant difference in the number of commonly used expressions of meaning. I wonder if the large number of words required for Russian is due to the many case endings and verb forms?

I don’t think Japanese has any more variety of expression than English. In that article they mention the large number of ways of saying ‘person’, yet in more general terms Japanese has very few adjectives compared to most languages.

Steve, and other polyglots, with your experience, have you found huge differences in the number of words needed?

@roan: Yes, I’ve read that French has a smaller overall vocabulary, but that the average number of words in regular everyday use is larger than in English. I would think it depends on what kind register in the language we are looking at.

I don’t know if the information is included in the document (it’s in Japanese : ) ) as to how they’ve arrived at those numbers. As has been raised, it’s very hard to define what a word actually is, when an affix or a suffix is producing a grammatical inflection or a “new” word. I’ve seen you can do some maths on things like the total number of verb forms that are technically possible in agglutinative languages:

“Taking into account all the possible prefixes and suffixes gives the 5070 (338 x 15) possible forms of each Hungarian verb.”

Which is of course really interesting (for me at least), but does this mean that Hungarian or Japanese really has millions of distinct words? And if granting that, I don’t think it means they are necessarily more fine-tuned or expressive, just that they encode information in different way.

@maths - “Yes, I’ve read that French has a smaller overall vocabulary, but that the average number of words in regular everyday use is larger than in English.” - That might help to explain why I can’t understand French films to save my life! lol

Not the high-art complex plot lines then : )