The CC formula directly follows from the Heap’s law, which I’ve written above as Eqn. 1:
V_lang = V_Ref_lang * f^beta_lang
Now, how could the two parameters, V_Ref_lang and beta_lang, be ‘measured’ for an arbitrary language?
(To make formulas shorter I’ll omit the subscript _lang for a while ).
Measure V1 and V2 - the vocabulary counts (unknown words on an empty LingQ account) at two fractions, f1 and f2 of the Reference (for every language). Eqn (1) gives:
ln (V1) = ln (V_Ref) + betaln (f1)
ln (V2) = ln (V_Ref) + betaln (f2)
Solving these Eqn with respect to V_Ref and beta gives:
beta = ln (V2/V1) / ln (f2/f1) (2)
ln(V_Ref) = lnV1 – (ln f1)* [ ln (V2/ V1) /ln (f2/f1)] (3)
( I used Eqn2 in the above Master and Margarita example, approximately substituting f2/f1 for the ratio of the chapters 32/23).
It is convenient to choose f2 = 1, that is, the entire Reference. It gives:
V2 = V_Ref - you’d measure this directly (2)
beta = - ln f1 * ln(V_Ref/V1) (3)
There is still freedom where to choose the first measurement. Here is better to look at the actual data taken at a sequence of small f. The fact is, Heap’s law, being a statistical law, gets true only for a fairly large text. (You can easily see, using some example, that it is untrue with only, say, 3, 5, 10, 20, 100 etc. words in the text). For “War and Peace” I’d say, one could bravely use the Heap’s law starting from f = 0.1. Since Steve’s book is less big, I don’t know.
On the other hand, throwing out data at small f, without looking at the data, might be also not wise. It is just the area of the counts where most of the beginners are. You at LingQ have already played with it and must have better intuition.
I had anticipated that it could have only be possible to catch the difference in the Heap’s parameters between Russian and English, as the most pronounced once, but that the other languages would be barely distinguishable, because of the statistical uncertainty. However, if the three set of your numbers already reflect the data, then I am fairy happy! They are pretty distinguishable!
I know I write with errors, confusion and omission of words. Feel free to ask questions. Answerring them will teach me English. Feel also free to close this theme - you are not obliged to implement this