Alex, your numbers seem very much reasonable to me! Though please clarify:
"We split it up into three levels: Beginner, Intermediate and Advanced.
[Q: what do you split in three levels, Steve’s book or the ratios of the counts that you applying for the tree levels ? ]
Here are the rounded numbers that we got:
English = 1, 1, 1
Spanish, French, Portuguese, Italian, Japanese, Swedish = 1.15, 1.25, 1.3
German, Chinese = 1.35, 1.45, 1.5
Russian, = 1.7, 2.1, 2.2
Korean = 1.8, 2.2, 2.5 "
Q: Are these “measured” numbers (or do they reflect someone’s, maybe mine , intuition)? How did you get them? Like Paramecium or differently?
I like these 3 sets of numbers. They are in agreement with what one could expect from the Heap’s law. For each language but one conventionally chosen, in your case English, the ratios of the counts are expected, in theory, to slowly change with the number of the counts (with the user experience, with the vocabulary size)
Only that with the Heap’s law whose parameters are known you don’t have to arbitrarily define user experience at all The Heap’s automatically changes the ratios of the counts with the number of accumulated counts.
If the English counts are used as the ‘normalized’ counts, as in your case, and the user’s current English count is V_eng, then the matching count in any other language is to be computed as:
V_lang = V_lang_Ref [ V_eng / V_eng_Ref] ^ [beta_lang / beta_eng] - Count Convertion (CC).
In the CC formula, V_lang_Ref and V_eng_Ref are the vocabulary (lingQ) counts in the entire Reference document (let it be Steve’s book) in the ‘converted’ language and the English language respectively; the betas are the Heap’s exponents for the two languages. These parameters must be measured. Later in the day I’ll describe again how these parameters are to be meausured.
“…it’s maybe a little above and beyond what we were looking to go with” - I agree, and I understand it is not what attracts users. It’s fully up to LingQ.
However, neither this repells users. And it seems to describe how this vocubulary counts turns to work in real life. And LingQ may be the first in the world to come up with language-independent vocabulary characterization, and with the determination of the Hep’s params for a wide set of languages. (Actually, I haven’t researched much, just a bit more than I had cited. Still, everybody seems to cite thesame two Russians when it comes to the actual dependence of the Heap’s parameters on a language).