Word count in different languages

Ilya_L · January 8, 2011, 12:02am

The Hips’ law (if some academic sole is still interested) is in a big blue font.

Now, just for the sake of illustration , let

beta_Russian = 0.55
beta_English = 0.45

Given the same number of words N in Russian and English text, the ratio of the unique words will scales with N as:

(V russian)/ (V english) = (K russian/K english) *N (power of 0.1).

at very large N ( “War and Peace”?) the most important factor is N(power of 0.1). It predicts a slow increase in the ratios of the unique words (= vocabulary = lingq count)

( In this analysis I have neglected that N in the original Russian and the translated into English version of War and Peace are not the same. Taking that into account would not change the conclusion. The ratioo of LingQ counts for different language should be expected to slowly diverge with the number N of the words in the compared texts.

Ilya_L · January 8, 2011, 12:10am

Above was my first guess and just thesecond link looked through today - so I wouldn’t bet on it:-)

steve · January 8, 2011, 12:23am

I am afraid that we will be a little less scientific in our determination of the differences, unless, Ilya you can come up with a scientifically verifiable solution for all of our languages. :))

Ilya_L · January 8, 2011, 12:48am

Well, the prediction holds for any pair of LingQ’s languages. The ratio of lingQ counts for any pair of languages should slowly diverge with N, the number of the words in the compared texts. Though the “divergent power” beta1 - beta2 will be different for any pair.

The most scientifically prooved solution I see is not to tell this story to users- Just take some koefficients that make sense, e.g. as you or Paramecium have suggested. I’d like to have an appropriate joke, but my memory becomes loose…

Ilya_L · January 8, 2011, 3:12am

If somebody indeed wants to come up with some-thing ’ more scientific’, one can try to “rank” all the LingQ’s languages in order, using the Hip’s law.

Again, the law predicts:

LingQ’s_count = K* N(power of beta),

where N is the total num of words in the analized text, or a variety of texts, and K and beta are specific for a given language.

At small N, I believe. the picture may be confused by the variation in the coefficient K (as it varies more wildly then beta and thus may occur to be text sensetive even within the same lang). However, at large N, the factor N(power of beta) will dominate in the picture.

And therfore, the language having the biggest beta may be considerd as having the “highest counts capacity” or “higher unique vocabulary capacity” . So one can try to rank all Lingq’s languages (represented with big enough corpuses) in terms of beta.

One can measure beta for various languages by drawing the log plots of the counts verses N. The internet link I’ve given above teaches how. Perhaps things like that have been already done. Paramecium, you seem to be already able to roughly estimate beta_english and beta_russian, substituting the number of chapters for N.__

Ilya_L · January 8, 2011, 4:58pm

Erratum
If someone will google it, one should search for Heap’s law rather than Hip’s law.

Other mistakes were also noticed

Ilya_L · January 8, 2011, 7:16pm

Two Russians covered their hips under heaps of paper and indeed measured the beta coefficients:

http://www.springerlink.com/content/3pfylb1vw03q7950/

They reported beta_Russian = 0.84 and beta_English = 0.79 ( beta is denoted as h in their short paper). Though the Russian were apparently funded by Mexico, they were too lazy to deal with Spanish. They condescendingly mentioned that Spanish, according to their unpublished data, is between Russian and English. Their work wasn’t noticed to attract users to LingQ.

SanneT · January 8, 2011, 7:43pm

I always knew that I needed more paper when I did handwriting in either German or Russian as opposed to my paper consumption when I write in English. So now you have found a way to explain it…

Will the measurements of our hips increase from sitting and studying all these heaps of minutely measured methods you are offering us?

Ilya_L · January 8, 2011, 8:32pm

Sanne, definitely not, until we are funded by Mexico. And then we could dig our hips in the beach.

On the other hand, these Zipf’s -like laws are indeed curious. I’ll site Helen from this thread:

" I lingQ all forms of a word as I meet them, and after 2 years am still coming across different inflections of words I met in “Who is She?” right at the beginning. In other words, the more learned LingQs and known words I achieve, the higher the proportion of variants I have encountered."

And how many times Steve expressed a similar surprize?

The family of the Zipf’s laws allows to “explain” what have surprised them ( and, in fact, allows to see that they do not always formulate their surprise correctly. Hi Helen )

Ilya_L · January 10, 2011, 3:40am

Heap’s law initially looks destructive to Steve’s original idea ( as I understand it): to come up with some language-independent characteristic of a learner’s vocabulary level. Namely, Heap’s law says: don’t just compare two numbers, such as the lingQ counts in two languages, because the numbers will diverge with the learner’s experience.

Below is an attempt to come up with a constructive idea. It suggests a language-independent characterization of the vocabulary level. However, it makes it ugly enough, in terms of “ Vocabulary as a Fraction of Reference”. May be somebody will come up with something less ugly. Still, my approach accommodates the Heap’s law in the simplest possible way, throwing non-significant details out.

I borrow from N the idea of a big piece of literature, whose unabridged translations to all the LingQ languages must be available. For the purpose, the “War and Peace”seems to me better than “Harry Porter”, mostly because versions of Harry Porter, as I once heard from Helen, may feature non- professional, abridged or “hacked” translations.

Rational.

To visualize this approach, imagine two or more learners who started learning different languages from scratch at LingQ. They started from the beginning of War and Peace, each in their target languages, and were saving/ (lingQuing) all distinguishable words they met along their reading. Now they have reached, say, 1/3 of the War and Peace in her or his language. Though their LingQ counts are different, as we know, their vocabulary levels can be considered matched. Indeed, they have come through directly the same dialogs and descriptions of the same things and concepts.

Therefore, I am trying to match the size, V_lang, of the user’s current vocabulary count ( the known words + the saved lingQs) with the universal, language-independent Fraction f of the Reference. (The actual user may and may not read the Reference – it doesn’t matter).

With this approach, you’d say to a learner of French – your vocabulary is currently at somewhere 1/3 of the Reference . And you’d say to a learner of German – your vocabulary is (occasionally) also about 1/3 of our Reference, so you two are just at the matched levels. And you d’ say to a user of Spanish – you are advanced, your vocabulary is already near 0.9 of our Reference. And when the user asks – what is your reference? – you 'd say – the War and Peace; and, if your current Spanish is at 0.9. then your vocabulary is like the one who’s covered 1/3 of War And Peace in Spanish. Note that such things as 1.2 of the Reference also make full scene.

The idea:
TO BE CONTINUED

Ilya_L · January 10, 2011, 3:46am

I borrow from N the idea of a big piece of literature" should be => “I borrow from Paramecium and Jamie the idea of a big piece of literature”

Ilya_L · January 10, 2011, 4:17am

CONTINUED

The idea:

Construct a non-dimensional, language-independent function v(f) of the vocabulary as accumulated by the reader of the Reference, as a function of the fraction f of the document having been read.

The function v is normalized to 1. Namely, v changes from 0 to 1 as f changes from 0 to 1. Such a function would be called a scaling function in many areas.

The implementation is much easier than the Rational:

v(0) = 0 – no vocabulary accumulated if nothing is read
v(1) = 1 – the entire vocabulary is accumulated if the entire reference is read

v(f) = f^beta_lang for 0<f<1 (1)

The third line is the Heap’s law, It automatically satisfies the first and the second line at f = 0 and f = 1. [ f^beta means f(power of beta) ].

Now, as we have seen, beta depends on, but not that sensitive to the language. In the simplest and immediately language-independent approximation, let us assume beta = 0.8 – for EVERY language.
(beta = 0.8 is not close to the value 0.5, reported in my first link above, but it is close to beta_russian and beta_english found by the two Russians from my second link in the above posts.

Algorithm (the order is important)

3.1.Compute, just once, the size of the Unique vocabulary (e.g Lingq counts of new words on an empty account) for the entire War and Peace for each of LingQ’s languages. Denote these values as:

 Vref_lang         - The size of the unique vocabulary of the Reference in a language.

3.2 Normalize the current user’s vocabulary in a language, V_lang (the num of the lingq counts +known words) as follows:

v = V_lang/Vref_lang, ( 0< v <1)

Find the fraction f using from Eqn (1) substituting beta = 0.8 in it. It gives:

f = v^(1/beta) = v^(1,25).

3.3Say that the user is currently at the Fraction f of the Reference. Done!
3.4Award Ilya with exp(Vref_lang) linqQ points.
3.5Check the algorithm.

Ilya_L · January 10, 2011, 4:38am

Correction of 3.2 of the algorithm. The limitation 0<v<1 is unnecessary. As I said above, the fraction f > 1 make full sense - it means someone’s vocabulary is reacher than that of the War and Peace (or another chosen refernce) . Therefore, v can be > 1.

Jamie · January 10, 2011, 3:03pm

Wow, Ilya, you’ve certainly done some work on this !
I will need to go to Albert Heijn for some beers later and try to digest what you’ve come up with.

Ilya_L · January 12, 2011, 5:45am

Jamie, ar you still at beers? I need your responce folks!

Maybe I wasn’t good at explaining what this approach achieves. It’s a bit late, so what if I just repeat what it does, but re-collect all the hows and the formulae later?

(1) First, Steve, it solves the problem you have risen. Given the Vocabulary count in one language, we can turn it into an equvalent count in any other of the LingQ’s languages:

For example, if user’s vocabulary count is 2856 in Spanish (V_span = 2586) this is readily transformed it into the equivalent vocabulary count for English, V_engl = 1995 , or for Russian, V_russ = 3292 etc. ( The numbers are for the sake of illustration. By vocabulary count I mean the number of all distinguishable words. In LingQ terms, the user vocabulary count = the number of known words + the number of yellow lingqs. Please correct me if you mean s-th different).

(2) Second, this transformation of the counts is done, as far as I can see, conceptually right. It accommodates the Heap’s law and automatically describes e.g. such effects as divergency in the ratio of lingQ counts for different languages ( as was observed e.g. by Paramecium).

(3) This approach introduces a concept of a Reference vocabulary. Namely, the current size of the user vocabulary in a language is compared to the size wihich would have been accumulated by reading a fraction of the Reference, from the beginning, in the same language. And here we make a subtle assumption. We say that different vocabulary sizes, at different languages of the Reference, are matched, given the same fraction of the Referense having been read in every language

I first thought of this Reference vocabulary as of an ugly augmentary thing, needed to ground the counts’ transformation to different languages. But now I retreat. It is not ugly. The reference vocabulary is a “tangible” thing.

If you think about it, the reference vocabulary is more tangible that the absolute values of the lingQ counts. We just accustommed to the lingQ counts. However, they count separtely such things as “read”, reads",“reader”, which at least at an advanced level (only?) are percepted by the barin together, If you think about it, you’d agree that that the absolute values such numbers as V_engl = 1995, before or after the transformation between the languages, are fairly conventional.

(4) With this approach, you’d say to a learner of French – your vocabulary is currently at 1/3 of the Reference . And you’d say to a learner of German – your vocabulary is (occasionally) also at 1/3 of the Reference, so you two are just at the matched levels. And you d’ say to a user of Spanish – you are advanced, your vocabulary is already at 0.9 of the Reference. And when the user asks – and what is the Reference? – you 'd say – the War and Peace!, and, if your current Spanish is at 0.9. then your vocabulary is of the same size as that of the one who’s covered 0.9 of War And Peace in Spanish from scratch. (Actually, you could have not specify that the War And Peace was read in Spanish, because we agreed, every one who has read 0.9 of War And Peace has the matched (The same “tangible” vocabulary), no matter the language. Note that such things as 1.2 of the Reference make full scene - they mean that the size of someone’s vocabulary is 1.2 times bigger than the Reference vocabulary.

nobody · January 12, 2011, 6:06am

I have a lot of trouble grasping what you’re saying Ilya I’m sure it’s all good stuff – the perfect formula, if you will – but it’s maybe a little above and beyond what we were looking to go with. Plus, it helps if we can understand it so it’s easier to explain as well

We counted the number of new words in Steve’s book (which has been translated into around 9 of the 11 languages we offer) and came up with some preliminary numbers that we think might work.

We split it up into three levels: Beginner, Intermediate and Advanced. Here are the rounded numbers that we got:
English = 1, 1, 1
Spanish, French, Portuguese, Italian, Japanese, Swedish = 1.15, 1.25, 1.3
German, Chinese = 1.35, 1.45, 1.5
Russian, = 1.7, 2.1, 2.2
Korean = 1.8, 2.2, 2.5

In effect we would take the current Known Words levels for English (or perhaps alter them as well) then use those numbers as a base for each calculation. 20,000 words for English would then equate to 44,000 in Russian, supposing this number was for “Advanced 1”. Please let us know what you think.

Ilya_L · January 12, 2011, 6:14am

I also want to note that the values of the indexes in the Heap’s law that follow from your approximate measurements are already in fair agreement with that of the two Russians.

You mentioned that for the Russian version of Master and Margarita, the vocabulary counts for the chapters 32 and 23 were respectively 23363 and 18651, while for the English version, the two numbers were respectively10451 and 8793. These gives:

beta_russian = ( ln 23363/18651 ) / (ln 32/23) = 0.87
beta_english = ( ln 10451/8793 ) / (ln 32/23) = 0.82

Which is very close to the quantative result of the Russians, with the same qualitative effect: beta_russian > beta_index. Beta is the index of the power in the Heap’s law.

Also note that the Russians did not deal with matched (translated versions of ) in Russian and English.In They used text absolutely unconnected texts in the two languages. The Heap’s law holds for any texts of natural languages. However I suspect that if we’d deal with the translated versions in two or more langs, the statistical errors and fluctuations would be smaller, due to similarity of the content, and one would need much less work than the Russian made.

Ilya_L · January 12, 2011, 6:21am

@ Alex

Thank you for the comment, I’'ll come back the next day.

Ilya_L · January 12, 2011, 6:24am

Sorry, my longer comment was addressed to Paramecium, not Alex. Alex’ I’ll return.

Ilya_L · January 12, 2011, 2:07pm

Alex, your numbers seem very much reasonable to me! Though please clarify:

"We split it up into three levels: Beginner, Intermediate and Advanced.

[Q: what do you split in three levels, Steve’s book or the ratios of the counts that you applying for the tree levels ? ]

Here are the rounded numbers that we got:
English = 1, 1, 1
Spanish, French, Portuguese, Italian, Japanese, Swedish = 1.15, 1.25, 1.3
German, Chinese = 1.35, 1.45, 1.5
Russian, = 1.7, 2.1, 2.2
Korean = 1.8, 2.2, 2.5 "

Q: Are these “measured” numbers (or do they reflect someone’s, maybe mine , intuition)? How did you get them? Like Paramecium or differently?

I like these 3 sets of numbers. They are in agreement with what one could expect from the Heap’s law. For each language but one conventionally chosen, in your case English, the ratios of the counts are expected, in theory, to slowly change with the number of the counts (with the user experience, with the vocabulary size)

Only that with the Heap’s law whose parameters are known you don’t have to arbitrarily define user experience at all The Heap’s automatically changes the ratios of the counts with the number of accumulated counts.

If the English counts are used as the ‘normalized’ counts, as in your case, and the user’s current English count is V_eng, then the matching count in any other language is to be computed as:

V_lang = V_lang_Ref [ V_eng / V_eng_Ref] ^ [beta_lang / beta_eng] - Count Convertion (CC).

In the CC formula, V_lang_Ref and V_eng_Ref are the vocabulary (lingQ) counts in the entire Reference document (let it be Steve’s book) in the ‘converted’ language and the English language respectively; the betas are the Heap’s exponents for the two languages. These parameters must be measured. Later in the day I’ll describe again how these parameters are to be meausured.

“…it’s maybe a little above and beyond what we were looking to go with” - I agree, and I understand it is not what attracts users. It’s fully up to LingQ.

However, neither this repells users. And it seems to describe how this vocubulary counts turns to work in real life. And LingQ may be the first in the world to come up with language-independent vocabulary characterization, and with the determination of the Hep’s params for a wide set of languages. (Actually, I haven’t researched much, just a bit more than I had cited. Still, everybody seems to cite thesame two Russians when it comes to the actual dependence of the Heap’s parameters on a language).

Word count in different languages

v(0) = 0 – no vocabulary accumulated if nothing is read v(1) = 1 – the entire vocabulary is accumulated if the entire reference is read

v(f) = f^beta_lang for 0<f<1 (1)

v(0) = 0 – no vocabulary accumulated if nothing is read
v(1) = 1 – the entire vocabulary is accumulated if the entire reference is read