Word count in different languages

The CC formula directly follows from the Heap’s law, which I’ve written above as Eqn. 1:

V_lang = V_Ref_lang * f^beta_lang

Now, how could the two parameters, V_Ref_lang and beta_lang, be ‘measured’ for an arbitrary language?
(To make formulas shorter I’ll omit the subscript _lang for a while ).

Measure V1 and V2 - the vocabulary counts (unknown words on an empty LingQ account) at two fractions, f1 and f2 of the Reference (for every language). Eqn (1) gives:

ln (V1) = ln (V_Ref) + betaln (f1)
ln (V2) = ln (V_Ref) + beta
ln (f2)

Solving these Eqn with respect to V_Ref and beta gives:

beta = ln (V2/V1) / ln (f2/f1) (2)

ln(V_Ref) = lnV1 – (ln f1)* [ ln (V2/ V1) /ln (f2/f1)] (3)

( I used Eqn2 in the above Master and Margarita example, approximately substituting f2/f1 for the ratio of the chapters 32/23).

It is convenient to choose f2 = 1, that is, the entire Reference. It gives:

V2 = V_Ref - you’d measure this directly (2)
beta = - ln f1 * ln(V_Ref/V1) (3)

There is still freedom where to choose the first measurement. Here is better to look at the actual data taken at a sequence of small f. The fact is, Heap’s law, being a statistical law, gets true only for a fairly large text. (You can easily see, using some example, that it is untrue with only, say, 3, 5, 10, 20, 100 etc. words in the text). For “War and Peace” I’d say, one could bravely use the Heap’s law starting from f = 0.1. Since Steve’s book is less big, I don’t know.

On the other hand, throwing out data at small f, without looking at the data, might be also not wise. It is just the area of the counts where most of the beginners are. You at LingQ have already played with it and must have better intuition.

I had anticipated that it could have only be possible to catch the difference in the Heap’s parameters between Russian and English, as the most pronounced once, but that the other languages would be barely distinguishable, because of the statistical uncertainty. However, if the three set of your numbers already reflect the data, then I am fairy happy! They are pretty distinguishable!

I know I write with errors, confusion and omission of words. Feel free to ask questions. Answerring them will teach me English. Feel also free to close this theme - you are not obliged to implement this :wink:

Sorry to barge in so late at the stage of scary formulas, but won’t it be easier to get the numbers of words an averange “fluent” person in each language needs to know, and then from there derive the beginner, intermediate,etc by means of

total_word_count / N_of_levels

Of course, one would need to get the total_word_count from somewhere

Yuri, you are the best!

Seems fully reasonable way to derive the proficiency levels like that if you could practically get the total_word_count “an averange “fluent” person in each language needs to know in every language” from somewhere. To complicate things a bit (though not in a principal way) LingQ operates on all the distinguishable words rather then the word families or lemmas; it is the latter that are usually advised in the list of words “a fluent person needs to know”.

However, the narrow problem in focus, at least the one Steve started this thread with, is different: A user has a word count of 2864 in French. What would have been her equivalent (I use the term “matched”) word count in Russian, given the matched vocabulary acquisition, so to speak?
Ok, 2864 is intermediate for French. Ok, you had conventionally attributed the range between 2000 and 5000 to the Russian intermediate also. Does it helps you to find a proper match to the French 2864?
Many test have arbitrarily set levels. When I immigrated to Canada, I needed an IELTS or something. The manual taught that it takes about 3 month of full-day preparation to advance a level. I doubt that the officials who claimed that have ever tested their claim.

However, the Heap’s law seems to give solid reason to ‘scientifically’ predict how much reading (or lingquing) it would take to accumulate the next portion of a unique vocabulary. In short, if you have just reached the vocabulary V by reading through the mass T of text, you will need to read another T*1.25 to make your vocabulary up to 2V (provided the text is the same level). Much more exact predictions are possible. Did we know it before Mr. Heap’s?

Correction: you will need to read another T*1.378 to make your vocabulary up to 2V (provided the text is the same level and beta_lang = 0.8).

Sorry to interrupt –

Beginner 1 up to 2,500 known words
Beginner 2 up to 5,000 known words
Intermediate 1 up to 10,000 known words
Intermediate 2 up to 15,000 known words
Advanced 1 up to 20,000 known words
Advanced 2 up to 25,000 known words and beyond

is the above really the approximate word goals? If so, I have a lot of words to learn to match my estimated competency.

The numbers may be even higher for German. I suggest you step up your activity index.

But then you may know a lot more words than is indicated at LingQ.

Yes, I think there are many words I simply have yet to stumble across. Very interesting (and motivating) to have a metric to work towards.

We’re considering changing the current numbers for English as well, as they don’t accurately reflect our levels in the Library nor realistic targets. We’re thinking of something like this:
Beginner 1 - 500
Beginner 2 - 1500
Intermediate 1 - 5000
Intermediate 2 - 10000
Advanced 1 - 17000
Advanced 2 - 25000

We would then just multiply these by the appropriate coefficient to get the targets for the other languages.

Alex and Mark are much gentler and kinder than I am, but I defer.

Ilya, the basic coefficients were taken from the number of unique words in Steve’s book across the languages. The results showed that the increase was not proportional, as we expected, but rather continued to increase exponentially at each progressive segment (10 chapters).

Thanks Alex,

And you seem to have the full Heap’s description for these unique words and the way to ‘measure’ the parameters of the description, in any language. In particular, you can predict how the coefficients will increase farther, beyound where you have determined them, that is where someone where they are for lerners who accumulate a vocabulary of the size bigger than the Steve’s book.

One correction for a hypothetical reader of my formulas.

beta = - ( ln V_ref/V1)/ ln f1. (instead of the wrong - ln f1 * ln(V_Ref/V1) marked as the last (3) )

Here beta is the Heaps’s exponent for a language, V_rev is the vocabulary count in the entire reference, that is the entire Steve’s book in the language, ( for example V_ref = 6894), f1 is the non-dimentional fraction of the Steve’s book where you’d take the first vocabulary count (for example f1 = 0.3) and V1 is the value of that first vocabulary count (for example 1035).

“where they are for learners who accumulate a vocabulary of the size bigger than the Steve’s book.” => "where they will be for a learner who will have accumulated a vocabulary of the size bigger than in the Steve’s book.

Now the question arose over at HTLAL, and member ellasevia found these numbers (according to the CEFR scale):
"ENGLISH
A1: <1500
A2: 1500-2500
B1: 2750-3250
B2: 3250-3750
C1: 3750-4500
C2: 4500-5000

FRENCH
A1: 1160
A2: 1650
B1: 2422
B2: 2630
C1: 3212
C2: 3525

GREEK
A1: 1486
A2: 2237
B1: 3288
B2: 3956
[no numbers were given for C1 or C2 in Greek]"

(Source: CEFR, Numbers and Percentages? (Immersion, Schools & Certificates) Language Learning Forum )

I’m quite sure that the numbers above don’t include inflected words.

I suspect these are word families nit the way we count at LingQ.