Word count in different languages

I realise this may be an oversimplification, but does it not follow from Paramecium’s experiment that a vocabulary of 10,000 words in English corresponds to a vocabulary of 21,200 words in Russian ?

After all, the analysis was performed on the same work of literature, so should be directly comparable.

I’m not talking about particular inflection differences between the languages. I’m just being very, very general.

Is it true that you need to have twice the number of "known’ words in Russian as you do in English to be at (roughly) the same level ?

If this is true, then the word count / proficiency levels within LingQ will be fairly straightforward to calculate for each language.

By “word” I realise I am talking about a discreet combination of letters (i.e. that LingQ recognises), and not a word “family”.

Jamie, yes you are right as far as I can see. I have always suspected that the difference between Russian and English was at least twice.

My questions are:

How can we quickly compare the number of unique words in like content? Paramecium do you have any advice?

Which content should we use for comparison? Do we have to use books that are no longer under copyright, or can we use books like The Alchemist or Harry Potter have all been translated into all of our languages?

Jamie, on the second thought, after you have formulated it like below, I think you might be absolutely right ;-). Thank you.

"Is it true that you need to have twice the number of "known’ words in Russian as you do in English [according to the LingQ counts] to be at (roughly) the same level ?

If this is true, then the word count / proficiency levels within LingQ will be fairly straightforward to calculate for each language. "

Sounds true. Perhaps the comparison ratio (2.12) wouldn’t much depend on the text (e.g. “Master and Margarita” translated from Russian into English ) chosen.

I didn’t see what Steve had answered, so it was my “independent vote” :wink:

Sounds like the Harry Potter series could be perfect, since it’s contemporary and available in all languages. For this analysis, the content does not need to be shared so copyright is not an issue. All that would need to be done is import them into LingQ, from e-books perhaps ?

Two things: one off topic: I loved your paragraph, Ilya!

One on-topic: Appropriately weighted word count = level is all very well in theory.

In practice (at least for me) it doesn’t quite add up, so to speak. I have a huge word count in French and quite a good one in Spanish but still only consider myself Intermediate I in French and less than that in Spanish if I were really picky (this is based on an internal, personal measurement of my speaking ability). My ‘official’ levels will always lag behind those of most other people because of my strict internal monitor. I’d have to be really good in order to put myself onto Intermediate 2 and extremely good for Advanced…

Jamie,

The problem with Harry Potter etc, is that someone has to buy the ebook in all languages, and then the PDF format does not work with LingQ. It would also be a lot of work.

I wonder if there are not some handy numbers out there.

I also think it would be enough to compare a few chapters. So if we could find a book on Gutenberg that has been translated into all languages, like War and Peace or something, we could divide up the work and just look at a few chapters. We do not need to be so accurate.

Sanne

We need a Gordian knot number for levels. The word count serves that purpose. It represent your potential. What you do with that potential is up to you!!

Steve wrote: “I also think it would be enough to compare a few chapters.”

I think, to compare just a few chapters won’t be enough. When I compared the English and Russian version of Master and Magarita, the difference between English and Russian word count grew with each chapter. If I remember it right, after three chapters the ratio was about 1:1,5.

I don’t think that anybody has to buy an E-book. There are enough free resources in the net. “Crime and Punishment” is for example freely available in different languages:

Russian: Ф. М. Достоевский. Преступление и наказание
English: http://www.kiosek.com/dostoevsky/library/crimeandpunishment.txt
German: Literatur - Kultur - DER SPIEGEL
French: Crime et Châtiment - Wikisource
Spanish: Crimen y castigo - Wikisource

Thanks for this Paramecium. Did you just import the Bulgakov books in to LingQ or how did you get your word count?

Thanks for all the input, everyone! We’ll take a look at comparing translations as we work towards more even numbers across all languages. Looks like Russian might already be done with the magical “2.12”, but we’ll see :wink:

Just one more irrelevant comment.

The original Russian text of “Master and Margarita” can be objectively more complex than the text of its English translation. ( For example, it is commonly accepted that it is easier to read a piece of literature in a foreign language, if that piece is translated from some language).

Still this comment is not relevant because it’s all not an exact science . What Steve seems to be looking for is a simple way to get some ‘normalizing’ factor. And what you boys and girls, grandmams and paps have suggested (that long because again I forgot the spelling of ‘gues’) have suggested seems good. I like it. and the rational behind it can be esy explained to users.

I’'l be irrelevant to the end. I wodn’t advice a learner of Russian to start with either Master and Margarita or Crime and punishment in Russian. They are objectively very complex Russian text. Tolstoi is simpler IMHO. -:slight_smile:

@ Sanne

Thanks Sanne, and as you see, I never confuse the first and the last letter in gyes, and you still seem to understand me. Sorry for many errors :wink:

I agree with Ilya that Tolstoi’s language is quite easy for a learner.

Steve wrote: “Did you just import the Bulgakov books in to LingQ or how did you get your word count?”

Yes. I uploaded the chapters 1-23 to LingQ. The English book I imported into my unused Portuguese profile and the Russian one into my unused Italian profile. Then I simply pressed “Know all words”. That’s it.

Ilya wrote: “I wodn’t advice a learner of Russian to start with either Master and Margarita or Crime and punishment in Russian.”

I agree, Tolstoi is surely easier to understand for a beginner. Unfortunately there are many different abridged version of “Voyna i mir”. Projekt Gutenberg contains for example such a very shortened German version. So you would have to be careful not to take different versions of the book. I have no idea how is the situation with other romans of Tolstoi e.g. “Anna Karenina”.

I propose that we have a LingQ account set up for corpus testing. I have several ebooks in multiple languages (Harry Potter in particular) and I will volunteer to upload them, chapter by chapter, so we can do lexical analysis on them. Once we have a LingQ corpus, we are bound to think of more comparitive tests we can perform. I still want to do some cross-language readability testing.

If we restrict access to the account to people doing linguistical analysis then we shouldn’t be breaking copyright laws. Should we?

I scoured the internet for common corpus analyses across languages, but to no avail. You can easily get word frequencies, but not, it seems, figures for total number of words vs. unique words.

In any case, I think Paramecium’s solution is good enough and wouldn’t take long to do. I am happy to help out.

I agree with Helen - we should have separate LingQ accounts to do this; I’d rather not pollute languages on my account that I may want to learn in the future !

Question @Paramecium:
As you were uploading the chapters, did you notice a definite convergence between English and Russian, i.e. was the final 2.12 ratio fairly stable ?

@Jamie:

To get a complete picture I have analyzed now all 32 chapters of the book.

English words recognized by LingQ: 10451
Russian words recognized by LingQ: 23363

Ratio English/Russian: 1:2,23

As one would expect, the difference between English and Russian grows slowly, the more content is analyzed. I think you have to analyze more books to notice a definite convergence. Anyway, I think that counting with a ratio of 1:2,1 or 1:2,2 is accurate enough.

Paramecium wrote:

“the difference between English and Russian grows slowly, the more content is analyzed”.

It reminds me s-thing. Define a unique word as a word occured just once in a given text. People studied the ratio Ru_nu of unique words to (say) non-unique words in various texts, in various (but not many) languages. They initialy expected to find a certain ‘stable’ Ru_nu. But it turned out that, the bigger text they analized, the bigger ratios of Ru_nu they obtained. There were no convergence, if only I am not mistaken, even as the texts reached the entire known language corpuses.

So may be Paramecium have discovered an effect of this sort? The relevant distribution I am talking about were called either K- or Z- distribution. I don’t remember actually. But it is hardly relevant to the problem you have solved - suggesting an approximate practical coefficient.

I have now taken a look at wikipediea and found that, again, I confused the terminology and the distribution names. However the idea that “explains” why “the difference between English and Russian [or any pair of languages] grows slowly” [ with the whole number of the words N in a text] seems to be OK.

I put “explains” in quotes because the explanation is based on the long known Zipf law [rather then Z distribution]. However, nobody seems to know exactly why the Zipf law holds. It’s an empirical law.

To be more close to the accepted terminology, let me reject my previous definition of a unique word. Since now, by the number V of unique words in a text comprised of total N number of we will mean the number of al distinguishable words. Distinguishable in a sense like lingQ distinguishes them - a unique combination of characters.

So V(N) is the lingQ word count in a text comprized of N words. (V is called a vocabulary in the link I’ll give below)

The Hip’s law, which is directly derived from the Zipf’s law, states:

V = K*N(power of beta)

where K is a constant and beta is an index of the power. K and beta generally vary with language.
Typically K - 10 -100, and beta = 0.4 -0.6.

Below is the readable link: