How many words/ characters do you need to be fluent in Chinese?

This week Mandarin learner and language app developer Karl Baker posted a blog offering his comprehensive breakdown of what you can expect to be able to do at each level of known Chinese words/ characters. I find his guide much more helpful and realistic than CEFR or HSK. Check out the blog below and let me know your thoughts :slight_smile:

2 Likes

Interesting article / blog but:

“5,000 Words (2,000 Characters)
You should be able to read slightly more advanced texts such as modern novels.”

That’s wildly optimistic. I’ve found modern novels difficult in their own way since they involve a lot of internet / local slang terms or they intentionally “misspell” words to get around the auto web censors.

I found 3,500 hanzi was the level I needed to not get hit in the face constantly with unknown hanzi when immersing with native material. (I can not confidently define what exactly a “word” is Mandarin, so I have no idea how large my vocabulary is.)

I strongly believe in immersing in native material whatever the level, and sticking whatever you want to read into LingQ as long as you have the patience to get through it. I’ve just thrown out the whole “words /hanzi known” goals in exchange for # of words read / hours of listening.

4 Likes

Also after browsing through his blog, I would give some advice to people wanting to start reading outside of beginner materials / graded readers. I think a trap is to use translated literature like the Lion, the Witch and the Wardrobe, the Little Prince or Harry Potter since the reader is more familiar with the plot, but I personally don’t recommend it. I browsed through them after reading a lot, and I find the style quite awkward and contains lots of foreign names / loanwords which don’t crop up frequently in native content.

Instead just import >100 stories (yes, that many…never underestimate how much you have to read haha) from gushi365.com. They have sections broken down by age level, so you can pick stuff at varying difficulty. You’ll get a better feel for how the language works (with onomatopoeia and what plants / animals native children are familiar with). The stories are really cute and fun, and I’ve been actually meaning to go back and continue to read more of them.

4 Likes

An interesting topic, thanks for bringing this up. Many learners will probably agree that vocabulary size is one of the best indicators for one’s language proficiency.
The levels Mr. Baker has come up with seem familiar. For example, researchers estimate that educated native speakers of English know about 20,000 word families (Goulden, Nation and Read, 1990; Zechmeister, Chronis, Cull, D’Anna and Healy, 1995). Generally linguists understand words as word families. The research seems to indicate that for adequate comprehension a text coverage of 98% is needed, that is 1 unknown word in 50. See:

(I believe others have suggested a lower 90%-95%)

Measuring your vocabulary can help gauging your general level in a language. But, more practically, one can use it to find appropriate reading material. This is especially important for unaided reading, i.e. without a dictionary; not on LingQ.

You can test yourself here (in English): https://my.vocabularysize.com
In fact I just did: Imgur: The magic of the Internet

I have some questions though:

  • Can English word families be equated with Chinese root words (as Mr. Baker seems to do)?
  • Do scientific studies on the necessary vocabulary size for adequate reading in Chinese exit?
  • How do various sources e.g. news articles, technical texts, novels differ in this respect?
  • How large is a native speaker’s vocabulary size?
  • The HSK frequency list seems to cover only 5000 items (intermediate level?). Do we have more comprehensive lists?
  • How many (root)words does a typical Chinese novel have?
  • Does a vocabulary size test exist?
  • Do I understand correctly that the “Mandarin Vocabulary Builder” is one of those tests? I don’t have an Android device, so I can’t try the app.
1 Like

Don’t really know the answer to this but to enjoy the majority of content, Lingq’s Advance 2 word count of 30,250 is a good amount to reach for. Anything more than that is useful as well. I still believe that at 24,000 words, it’s not enough.

5000 characters is very good

This post makes me feel quite uncomfortable! :smiley: I´m studying Chinese for half a year now and my word count goes up really slowly. I´m sure it will increase faster in the future but right now a word count of 30000 words seems pretty far away :smiley:

3 Likes

Just keep at it! It’s going to take a lot (a lot) of reading, but eventually it clicks. I remember when I first started just getting to 1000 words was a horrendous amount of mental effort, but now I’m surprised at how quick my LingQ count goes up (please note that LingQs aren’t directly correlated to vocab size as the software picks up a lot of names / nonsense combinations).

A vocab size of 25,000 - 30,000 seems accurate though to be read most literature comfortably at the mythical 98-99% comprehension level. I put one of my books into Chinese Text Analyzer after reading that above blog post, and it’s got a unique word count of 25k. Granted it’s a wuxia / cultivation webnovel, so uses a lot of fantasy / TCM / historical / slang terms, but it doesn’t have many modern words / business / tourist / loan words.

(I’m hoping if I read the equivalent of 10k-20k pages in LingQ, I’ll be able to to read comfortably. It sounds like a lot, but I’m trying to be realistic about how long it takes. But I’m monolingual, so for others it might be quicker.)

Chinese is a vast language (the chengyus!!) and anyone that says it is easy…不知天高地厚

2 Likes

Hi all,

I’m the original author of this article. It’s good to see that you found it interesting and I enjoyed reading some of your feedback.

A few answers to some of your questions:

  • A good online mandarin vocabulary size test exists here: Online Chinese Vocabulary Size Test
  • There’s also a good test for how many characters you know here: https://hanzitest.herokuapp.com/
  • My free Mandarin flashcard app is available to download here: https://play.google.com/store/apps/details?id=spaced.repetition.mandarin.chinese.learning.vocabulary.builder
  • There isn’t an IOS version yet but I am working on one
  • Can English word families be equated with Chinese root words?
  • Roughly I think so yes, because there is no conjugation in Chinese, almost all words can be considered root words
  • The HSK frequency list seems to cover only 5000 items (intermediate level?). Do we have more comprehensive lists?
  • When I was building my flashcard app, I did some word frequency analysis in Mandarin and came up with another 5 packs of words that are not in the HSK but are equally as common, I named these the “HSK Extra” packs. On the app there is currently HSK Extra levels 1 to 5. The HSK Extra level 1 pack contains the most common 150 words that are not in the standard HSK. The HSK Extra 2 contains the next most common 150 words not in the standard HSK. HSK Extra 3 has 300 words, HSK Extra 4 has 600, and HSK Extra 5 has 1300. I’m currently still working on further levels of the HSK Extra packs but I’ve noticed an enormous benefit from studying them daily with flashcards.
    The other questions may take me a bit longer to answer but I remember coming across the answers before, I will just need to look up the exact sources again so I’ll get back to you.
4 Likes

Thanks Karl. In response to some points raised below, I think while the LingQ counter is useful for measuring your progress on LingQ, it’s not at all useful for measuring your vocabulary size. This is because a) LingQ is bad at deciding what counts as a word and what doesn’t and b) everyone has a different standard for when they mark a word as ‘known’. A more objective measure is to use the tools you mention.

That’s important because otherwise you might think the vocabulary levels described in the article can be mapped onto LingQ and wonder why you’re still struggling to read novels even though your counter is over 5000. But if we’re talking about 5000 root words, e.g. HSK6 profficiency, then many modern novels are within reach. There will still be unknown words/ characters on every page but few enough that it’s not painstaking.

2 Likes

Yes, LingQ is bad at deciding what counts as a word and what doesn’t, sometimes suggesting random combinations of characters as “words” but a good solution is to mark those as ignored (click the trash icon) which prevents your known word count from being artificially inflated.

Using this method, my LingQ known word count has always been within the range that would be expected for my level of comprehension.

It would be nice if LingQ segmented the words properly in the first place, but it’s admittedly a hard problem to solve for computers and I can deal with it.

1 Like