How many words do you need to read? Paul Nation was wrong

nfera · October 3, 2022, 8:06am

I know everyone loves to quote Paul Nation’s How much input do you need to learn the most frequent 9,000 words? paper. I just reread the paper and I believe there’s a little bit of confusion about it. His recommendation of 11 million words read is required to learn the most common 9,000 word families is a number he just pulled out of thin air. Honestly, I don’t know how he jumped to that conclusion.

In his analysis, to achieve an average of at least 12 repetitions of up to the 9th 1,000 word family level (note: not all word families are actually encountered) took only 3 million words. This is Table 1 in the paper. It is a “running,” as he says, (i.e. accumulative) total. Table 2 clearly shows you will encounter 8,219 out of 9,000 word families with 3 million words read with encountering the 2nd 1000 word family level on average 171 times per word family.

Why exactly he jumps from an accumulative total to separate values needed to achieve that particular word family level is beyond me. He talks about seeing it as “a set of stage steps.” His initial analysis shows that you need only 3 million words, but later in the paper he recommends to read 11 million words. That’s an order of magnitude jump! From memory, one of his previous papers estimates that a native speaker learns about 1,000 word families per year on average, so I guess this is why he decided to do this. But it is a big leap from his initial analysis, so should really be taken with a grain of salt.

I agree, as he says, “there are some serious problems with these crude calculations,” but I still has some qualms.

The reading material he used in his initial (Table 1) analysis is simply unrealistic (including literature). I imagine he close these books because they were freely available online. This unrealistic reading material compounds into several problems:

He isn’t clear, but he may slightly hint to it, but what order does he add his 25 novels? If he added them in alphabetical order, the first novel he added was the 19th century literature Adam Bede. As this book would have many words above the 2nd 1000 word family level, this seems like a waste of a first book to read… This means that by reading more realistic beginner material, such as the LingQ mini-stories, you will gain the 2nd 1000 word family level much faster than his analysis.
This analysis starts counting the exposures of the 9th 1000 word family level from the very beginning, when you start reading Adam Bede. Unrealistic much? This means that the mid-frequency levels would take longer to acquire, because you don’t start acquiring them from page 1. (This brings an interesting thought to mind. Perhaps there is some metric which could be created to look at the value of a book in terms of frequency words? Instead of just ‘unknown words’, an interesting metric to choose material could be ‘unknown mid-frequency words’. Then you can choose material with more mid-frequency words.)

TL;DR All in all, his analysis which results in 3 million words of reading uses unrealistic material and his 11 million words of reading recommendation comes out of thin air. It’s a great idea for an analysis, for sure, but he just could’ve done it in a better way. Due to his “crude calculations” you really can’t derive much from the exact numbers, apart from maybe you are looking in the millions of words read area. Because of this I personally wouldn’t use the exact numbers derived from this paper as your goal. It’s probably more realistic to aim for goals, which are recommended by experienced language learners, such as noxialisrex, PeterBormann, and several others. If anything, aim for 1 million words read as one of your earlier goals and you’ll have a solid foundation in the language (probably around lower intermediate).

noxialisrex · October 3, 2022, 1:51pm

Spicy title!

If you consider things like Accelerated Reader in the US, we actually do have metrics that try and assess how difficult a book might be. One-and-a-half to two years ago, when I was first picking things to read, I actually consulted https://www.arbookfind.com to get a sense for how hard the text was in English before starting. (I have a friend that uses some site for books in Japanese. It shows unique kanji and compares it to the list of kanji that he knows to say how many are new, and their relative frequency in the book.)

I would encourage us not to have a mindset of targeting “the most common 4.000 word families” or reading 1.000.000 words, being a B1 and then being done.

The most common word families will come naturally by reading and listening. We see them so often that there is literally no other option. It is the lower frequency words that benefit from deliberate practice, because we see them so less often.

1.000.000 words read is a nice round number, and we like round numbers. It is a great goal. But, why stop there? If you are enjoying the process, enjoying the new culture you are getting access to, enjoying reading, then keep reading.

nfera · October 3, 2022, 7:31pm

I guess the question the paper was trying to answer was really “How much reading do we have to do to acquire enough vocabulary to be able to read most native material unassisted?” (because his previous research has indicated that the 9,000 most frequent word families suffices to reach 98% coverage of encountered word families within a wide range of contexts). In that case, a million words read won’t really get you there. It would like be more closer to your estimates.

People are just looking for goals, as goals are highly motivating. Some of us know that the goalposts will move as soon as we get close to them, but they are motivating nevertheless. It would be ideal to just not worry about the numbers and everything though. Even though LingQ provides it for such a reason. At the start, people are really pushing to want to know when they can get past the learning part and just get straight into the enjoying the content part. And that’s kinda the number that this paper was trying to answer.

jahufford · October 3, 2022, 7:32pm

“This means that by reading more realistic beginner material, such as the LingQ mini-stories, you will gain the 2nd 1000 word family level much faster than his analysis.”
and
“2) This analysis starts counting the exposures of the 9th 1000 word family level from the very beginning, when you start reading Adam Bede. Unrealistic much?”

I think that’s why he decided to use the running count as the more relevant metric. No one can start from scratch, read 3 million words from adult level novels and end up with 9000 word families, not even close. Language learning is most definitely cumulative, you learn a bit and use that little bit to be able to understand a bit more, wash rinse repeat. He used no more than 2% new words at any one time (which seems kinda low, I think you can cope with more). So he’s not saying 3mil is enough, just that in theory 3mil words gets you a lot of exposure and it’s a reasonable amount.

I agree about the corpus, classic novels and fairy tales are terrible. They tend to be tough, outdated and not particularly relevant to modern usage of the language. But his idea of using software to replace rare words with more common synonyms is pretty brilliant if it actually works and doesn’t produce weird sounding texts.

jahufford · October 3, 2022, 7:35pm

Just an interesting coincidence:
Last year, I read 10,425 pages of French in 11 months, 30 novels total. Using Nation’s figure of 300 words per page, that’s 3.1 million words, basically the same as his 3mil words figure (although if you google “average words per page” you find a figure of 500 words, so who knows). I can read roughly 30 pages an hour depending on the book, so that’s about 350 hours total over 11 months. But it took me years to get to the point of being able to do that.

You would likely be able to read 10 million words in the same amount of time as 3 million words if those 10 million words were properly graded upwards in difficulty as you go, that’s if you get anywhere at all by starting with an adult level book, which most people won’t, it’s just far too inefficient.

nfera · October 3, 2022, 7:50pm

“I agree about the corpus, classic novels and fairy tales are terrible. They tend to be tough, outdated and not particularly relevant to modern usage of the language. But his idea of using software to replace rare words with more common synonyms is pretty brilliant if it actually works and doesn’t produce weird sounding texts.”

I just mean, if he says a mid-frequency reader “can be produced in a few hours,” why the bloody hell didn’t he just turn those books into varying levels of mid-frequency readers? This would make his analysis actually much more useful than just some not-very-practical theoretical piece. Then he wouldn’t have to do all the kinda dodgy theoetical jumps.

jahufford · October 3, 2022, 8:39pm

Do you mean make the mid-frequency readers and use them for his analysis? That would be a quite different analysis.

noxialisrex · October 3, 2022, 9:41pm

Like I said, we like round numbers :). (Well I do at least.)

Set goals that challenge us, celebrate when we pass them, … and then move the goalposts.

miriaml5 · October 3, 2022, 11:26pm

He did, they are on his website for free. But sadly only in English and not in any other languages.

caitlafous · October 4, 2022, 9:53am

Answering the main question in topic is the simple answer is that you need to read as many words as possible. However, the real question is how many words do you need to be able to read in order to understand a text? Depending on your reading level, you may be able to read and comprehend a text with a relatively small number of words, or you may need to read a larger number of words in order to follow the thread of the argument. For example, I picked a personal (not mine) essay about a girl who likes reading and she exactly learns new languages through reading - https://nerdyseal.com/my-love-for-reading-and-writing/. In general, however, it is advisable to aim for a vocabulary that allows you to understand at least 95% of the words in a text. This will ensure that you are able to follow the main ideas without getting bogged down in unfamiliar terminology. By expanding your vocabulary, you will be able to read more complex texts with ease and confidence.