Extensive reading for a 9,000-word vocabulary: evidence from corpus modeling

nfera · July 18, 2023, 5:01pm

Extensive reading for a 9,000-word vocabulary: evidence from corpus modeling

Published October 2022

Abstract

This paper contributes to a research program within extensive reading (ER) and Reading in a Foreign Language using corpora to simulate ER input to develop vocabulary through incidental learning to 9,000 words. This helps researchers/teachers evaluate ER. If corpora indicate no ‘pathway’ from smaller to larger vocabulary sizes through authentic ER input, with vocabulary recurrence rates sufficient for incidental learning, then graded readers or other pedagogy appear essential. Studies offer different conclusions due to modeling issues. This study replicates previous research on a larger corpus of general fiction, with improved modeling. For every vocabulary size, a substantial amount of comprehensible fiction is found, with enough repetition of vocabulary from subsequent levels that pathways from smaller to larger vocabulary sizes are possible without graded readers. Prior estimates of approximately 3 years to acquire 9,000 words at 1 hour a day are underestimates, with modeling indicating 2 hours a day would be required.

https://www.researchgate.net/publication/364350617_Extensive_reading_for_a_9000-word_vocabulary_evidence_from_corpus_modeling

TL;DR This modelling expands on Paul Nation’s 2014 paper of How much input do you need to learn the most frequent 9,000 words? using a much better method. If you add up all the numbers in Table 3, starting from a vocabulary of 2,000 word families (i.e. around a B1/Intermediate 1 level), you are looking at about 21M words of extensive reading to reach 12x encounters of the majority of the top 9,000 word families in English (i.e. you would have the vocabulary of a C2 in English by the time you reach there). At an hour per day of reading at 150 wpm, this is 6.3 years.

EDIT: Adding up all the hours in Table 3, you are looking at ~2,300 hours of reading at 150 wpm. If you increase your reading speed to 200 wpm (eg. reading while listening with an audiobook on 1.5x), you are looking at ~1,700 hours. From B1 to ~C2 reading/vocabulary in English, this is pretty good, right?

bbbblinq · July 18, 2023, 9:08pm

I started LingQ about 600 days ago with about 2000 known words and now have 8700 known words according to LIngQ stats.

asad100101 · July 18, 2023, 9:50pm

Thank you so much for sharing this interesting article. Definitely need to up my reading game. Have only read 7 million words until now. The ChatGPT has done the math for me. Reading 4 hours daily at the speed of 200 wpm would require 292 days to finish reading the remaining 14 million words.
This is something I can sustain over the long period of time. Let’s see how it goes.

nfera · July 19, 2023, 8:18am

@bbbblinq

Your LingQ stats and this majority of 9,000 most frequent word families are like comparing apples and oranges.

Lemmas and word families are two different metrics. I’m not sure what the average ratio is between them for English, but as an example it’s estimated that the average 20-year-old native American English speaker knows 42,000 lemmas, but only 11,100 word families. [1]
Between languages, the number of lemmas required per level changes a lot. Just look at the new LingQ levels. They changed required Known Words because there are significant differences between the languages. The difference between English and Korean is over 2x, whereas the difference between English and Greek is ~1.5x.
The number of word families differs between languages too at their corresponding levels, but not nearly as much as the number of lemmas.
As a general rule, most people probably record lemmas on LingQ, but not only. I, for instance, have included a few proper nouns (eg. I lingQed the word “Parigi” which means “Paris” in English, which has gone to Known) and alternate spellings (eg. I have marked as Known “cuore” meaning heart vs. “cuor” which is a common literature spelling of it). These are probably rather minor in the grand scheme of things though.
On LingQ, you haven’t marked all the words you know as Known. There are many words you know, but just haven’t encountered them yet while reading to mark them as Known. Really your known number of lemmas is higher than that represented by your LingQ stats, especially if you do lots of listening and/or study away from LingQ. Eg. I read paper books, then add words read to my stats, but I have never marked any Known Words in this book.
The way the modelling was carried out doesn’t mean it only enountered 9,000 word families a minimum of 12x. In fact, the bare minimum was encountering 800 x 7 = 5,600 word families, starting with a base vocabulary of 2,000 word families. This is because the model had to reach 80% (800) of high-frequency word families a minimum of 12x at that particular 1000-level before the 1000-level/difficulty of the material was increased. The modelling ended after 800 of the word families at the 8,000 - 9,000 level were encountered 12x. If you look at Table 3, you will see that you end up encountering a lot of other words at least 12x in the process of focusing on one particular level. So by the time the modelling ended, even though the author doesn’t provide the data in their paper, many more word families would have been encountered 12x, such as those at the 10,000-level, the 11,000-level, and beyond. The exact amount, I don’t know, but, as a guess, you are probably looking at a vocabulary size similar to that of the native American English speaker.

TL;DR You can’t compare your LingQ stats and the results of this paper. If you are looking for a Known Word number on LingQ to aim for in Greek, go for the LingQ Advanced 2 level of 52k.

[1] https://www.frontiersin.org/articles/10.3389/fpsyg.2016.01116/full