SomeResults from My Experiment with Conversion Ratios of Known Words Needed in Different Languages

t_harangi · July 14, 2020, 9:36pm

So, I did some experiments with conversion between English and romance languages, after reading various threads here about the subject. I figured the community can elaborate on these since foreign translated editions may be more easily available to others. Data would be especially interesting from slavic language editions of the same books.

This is very non scientific, but it may be able to provide some data points.

The Experiment: Compare unique word counts of different language editions of the same book to get an idea of the ratio of known words needed in different languages to understand the same book.

Tool: Calibre → Edit Book → Tools → Reports to obtain total and unique word counts.

I’ll use English as the common denominator comparison value of “1”

Here is what I have so far:

LES MISERABLES
French Original
Total Words: 515,420
Unique Words: 35,652

English Edition
Total Words: 576,039
Unique Words: 28,019

Unique word ratio French to English: 1.27

THE CHILDREN OF CAPTAIN GRANT
French Original
Total Words: 192,221
Unique Words: 18,328

English Edition
Total Words: 155,596
Unique Words: 12,263

Unique word ratio French to English: 1.49

AROUND THE WORLD IN 80 DAYS
French Original
Total Words: 67,524
Unique Words: 9,749

English Edition
Total Words: 63,890
Unique Words: 7,712

Unique word ratio French to English: 1.26

PRIDE AND PREJUDICE
English Original
Total Words: 125,352
Unique Words: 7,260

French Edition
Total Words: 106,396
Unique Words: 9,623

Unique word ratio French to English: 1.32

THE ENGLISH SPY
English original:
Total Words: 110,246
Unique Words: 11,349

Spanish Edition
Total Words: 114,234
Unique Words: 14,193

Unique word ratio Spanish to English: 1.25

Though there are many caveats to this experiment, it seems to me that the ratios here are pretty consistent between Romance languages and English back and forth translations. With an average ratio of 1.31

I’d be very curious to see numbers for German and other languages for the same titles.

ericb100 · July 14, 2020, 10:36pm

Interesting. THanks for the research!

I have the German version of the first Harry Potter book, if someone has the english version to compare against.

If I follow your instructions (does it count the beginning “stuff” in the book…i.e. copyright, publishing, numbers, etc?) I get:

Harry Potter - The Sorcerer’s Stone
German edition:
Total Words: 83,236
Unique Words: 10,568

I didn’t have the english version, but I found this link that details the english statistics…whether they used Calibre to come up with this detail I can’t tell…but it seems relatively close. The details at this particular link are very interesting too.

How many words do you really need to know? - Literature Statistics

English Edition:
Total Words: 77,883
Unique Words: 6,185

Unique word ratio German to English: 1.70

t_harangi · July 14, 2020, 11:15pm

This is good. Yeah, I included all the front matter stuff, copyright, table of contents, etc because it would be too much a hassle to remove for the little margin of error it would negate. That stuff makes up less than 0.2% of book and some of the terms in it actually correspond to the target language.

nobody · July 15, 2020, 12:30am

Great job! The most interesting question for most users is how many word forms (“Lingq” words) correspond to a dictionary word (“word familly”). To answer that question from your data we need to find out what that ratio is for English, then convert. Studies suggest that the general word form to word family ratio for English is 1.5. Because it is rather small, I think we can accept it as a good enough predictor of how many word forms learners actually encounter for each word family in normal texts. As I have argued elsewhere that would not be the case for much more inflected languages.
If we assume that the number of underlying word families is similar in the different versions of the same book and, thus, the differences that you found are mostly due to a difference in word forms, then we could conclude that, for romance languages the w2f (word form to word family) ratio would be 1.5 * 1.31 = very close to 2!!!, which is the ratio that seems to work for Romance languages, based on anecdotal evidence, as we’ve discussed in multiple threads.
This is a promising sanity check for these estimations. However, as you have mentioned this is still not very scientific because of the small sample and of the guesswork still involved in estimating the key w2f ratio. Plus we would still want to know how many words we need in order to reach a certain level in language.

This is such a great idea that it has inspired me to finally get around to doing something that I’ve been thinking about for a long time: trying to get some actual, hard data on this issue. In particular, estimating w2f (and other stats) for several languages based on a large amount of texts, while addressing some usual problems, such as those encountered by you: cleaning the text from headers, tables of contents, dates and other common junk, maybe even some “entities” such as names.
This is a challenging problem, mainly because it is hard to define what a “word family” is, plus it requires many data. However, I think it is possible to get to a good estimation by analysing a large collecion of texts in different languages, e.g., the Gutenberg library and applying natural languages processing (NLP) tools. Of course, that would be an automated, (through software), analysis, rather than manual. I happen to teach NLP techniques applied to research in Psychology of Language as part or my tutoring. For example, it is possible to programmatically “stem”/“lemmatize” texts, which are algorithms that try to find the word families that a given form corresponds to, according to different definitions: Stemming - Wikipedia
I like the Python Spacy library (https://spacy.io/) and I’ve just installed another Python library that help search Gutenberg texts (GitHub - c-w/gutenberg: A simple interface to the Project Gutenberg corpus.). I’m trying to index the texts righ now but it may take forever on my laptop.
I only have access to a small laptop right now and I am, as usual, rather busy, but I may get some help
If someone else has any ideas about this project or wants to lend a hand, just let me know. The goal is to calculate useful statistics for language learning such as w2f in the most “scientific” way possible. At the same time, I’ll try to locate some good research about how many words people really need for mastering languages. I’m not sure how much time I can devote to this right now but I hope to set up a public repository with this info. I personally find this idea very exciting

ericb100 · July 15, 2020, 12:45am

I looked at the “front stuff” and it really didn’t seem like much, so I agree it doesn’t look like it would effect the results much at all…

Congrats on 500 posts =)

aronald · July 15, 2020, 2:32am

Ftornay, I had this exact same idea about a year ago. I think I ended up running a python script to stem texts but stopped for some reason which I don’t remember. I‘ll take a look at it this weekend and get back to you.

nobody · July 15, 2020, 3:20am

I’m not surprised you had to stop because these things often end up being trickier than they sound, but let’s give it another go!
Yes, by all means explain what you tried and how far you got. What texts did you use?

khardy · July 15, 2020, 4:42am

Here are some Russian numbers. This is interesting enough, and the exchange rate is favorable enough, that I went and bought these off of litres.ru just for this purpose. So now I have some material that I can import into Lingq if I run dry. ( I don’t think I’m going to be tackling Les Miserables / Отвержение any time soon, though.)

I had posted these to your wall, not knowing when I’d get around to doing the math, but here it is. Do feel free to check the calculations. I noted the non-Russian words here, but just calculated using the unique words count which includes them. Because of the different alphabets Calibre sorts the non-Russian words to the front, making it easy to see their number.

Edit: non-Russian word counts are unique non-Russian words.
Edit 2: Removing the non-Russian words only changes the overall ratio by 0.01.

Les Miserables
Отверженние
442493 words
71439 unique words
663 non-Russian words
Unique word ratio Russian to English: 2.55

The Children of Captain Grant
Дети капитана Гранта
53198 words
16041 unique words
52 non-Russian words
Unique word ratio Russian to English: 1.31

Around the World in Eighty Days
Вокруг света в восемьдесят дней
56809 words
14839 unique words
38 non-Russian words
Unique word ratio Russian to English: 1.92

Pride and Prejudice
Гордость и предубеждение
101244 words
21128 unique words
44 non-Russian words
Unique word ratio Russian to English: 2.91

nobody · July 15, 2020, 2:17pm

A couple of observations about these data.
Notice that the correct average for rates is the harmonic, rather than arithmetic mean. For these data they are almost the same but as we add more data they’ll differ more, which is what I’m trying to do.

We have so far:
Romance, arithmetic mean: 1.318, harmonic: 1.312
Russian, arithmetic: 2.17 , harmonic: 1.98
German (one single point) 1.7

One first point is that the ratio grows slower than we would expect from the differences in inflection between languages. I argued long ago that this would be the case.
Another point is that German sits between Romance and Slavic. This matches the subjective impression by learners. I think this is due to he German convention of writing composition of words as one single word, because the language itself is somewhat less inflected than Romance languages.
The Slavic to Romance ratio is close to my predicted factor of 2, but it falls short. I think this is because of the reduced sample. Notice that there is a large variation in the results for Russian. Besides, I would expect the difference in unique words between a more inflected and a less inflected language to grow as you add more texts towards the general difference betwee all possible forms, although very slowly. This is because you are likelier to come across more and more differente forms as you keep on reading.
For this reason it is important to calculate stats as a function of the number of read words, because they will change over the course of learning.

ericb100 · July 15, 2020, 2:58pm

Additional German example.:

Die Verwandlung
German original
Total Words: 19,216
Unique Words: 3,967

English Translation
Total Words: 22,168
Unique Words: 3074

Unique word ratio German to English: 1.29

Interesting that the ratio is smaller than the other German example for Harry Potter. I wonder if to some degree it’s because the original is in German and therefore the translation might tend to be word-ier than a corresponding English to German example? Small sample size so hard to tell.

khardy · July 15, 2020, 5:42pm

Slavic ratio falls short…reduced sample: The issue of small sample size is compounded by the fact that the two works with the lowest ratio are by the same author. It may be how his style gets translated. I would not be surprised if popular literature, if Verne counts as such, uses fewer, simpler word forms than that of the likes of Hugo and Austen, as translated.

t_harangi · July 15, 2020, 8:10pm

Thanks everyone for contributing to this project! I dug trough my library and found some more examples to add to this:

For German:

THE KILLER
English original:
Total Words: 140,137
Unique words: 10,910

German Edition:
Total Words: 146,472
Unique Words: 16,156

Unique word ratio German to English: 1.48

For Korean:

THE MARTIAN
English original:
Total Words: 107,105
Unique words: 9,387

Korean Edition:
Total Words: 80,648
Unique Words: 21,845

Unique word ratio Korean to English: 2.37

BAD LUCK AND TROUBLE
English original:
Total Words: 113,440
Unique words: 10,190

Korean Edition:
Total Words: 93,028
Unique Words: 27,366

Unique word ratio Korean to English: 2.68

aronald · July 16, 2020, 12:42am

My intuition tells me that w2f ratio for Russian would increase as you add more texts. In other words, WFam(80k LingQ words) < 4*WFam(20k LingQ words). I think that’s because as someone learns more words, the new words will have a higher likelihood of being a different form of a word already known. So the w2f ratio would have to be a function of known words or word database size. We already know the ratio as the database size increases towards infinity since it’s just going to be the average number of word forms per word family. Maybe that would be something like 15 or 20.

Maybe: 20k LingQ words corresponds to 6k word families (w2f= 3.3); 40k corresponds to 10k word families (w2f=4), 120k corresponds to 20k word families (w2f=6) etc.

aronald · July 16, 2020, 12:45am

Yeah I’m not sure of the reason. I think I may have even gotten it to work but stopped because I couldn’t get a known word list from LingQ. The goal was to see how many of my known words were from different word families.

nobody · July 16, 2020, 1:29am

Yes, the main thing is to find good data. And I think it is better to use real texts, rather than simple lists because we can get predictions about what part of speech each word might be, which helps in finding the “family”

Hagowingchun · July 16, 2020, 4:41am

I found this post which I thought was interesting I’m quoting Mr. Ilya_L "
Alex, your numbers seem very much reasonable to me! Though please clarify:
“We split it up into three levels: Beginner, Intermediate and Advanced.
[Q: what do you split in three levels, Steve’s book or the ratios of the counts that you applying for the tree levels ? ]
Here are the rounded numbers that we got:
English = 1, 1, 1
Spanish, French, Portuguese, Italian, Japanese, Swedish = 1.15, 1.25, 1.3
German, Chinese = 1.35, 1.45, 1.5
Russian, = 1.7, 2.1, 2.2
Korean = 1.8, 2.2, 2.5 "
Q: Are these “measured” numbers (or do they reflect someone’s, maybe mine , intuition)? How did you get them? Like Paramecium or differently?
I like these 3 sets of numbers. They are in agreement with what one could expect from the Heap’s law. For each language but one conventionally chosen, in your case English, the ratios of the counts are expected, in theory, to slowly change with the number of the counts (with the user experience, with the vocabulary size)
Only that with the Heap’s law whose parameters are known you don’t have to arbitrarily define user experience at all The Heap’s automatically changes the ratios of the counts with the number of accumulated counts.
If the English counts are used as the ‘normalized’ counts, as in your case, and the user’s current English count is V_eng, then the matching count in any other language is to be computed as:
V_lang = V_lang_Ref [ V_eng / V_eng_Ref] ^ [beta_lang / beta_eng] - Count Convertion (CC).
In the CC formula, V_lang_Ref and V_eng_Ref are the vocabulary (lingQ) counts in the entire Reference document (let it be Steve’s book) in the ‘converted’ language and the English language respectively; the betas are the Heap’s exponents for the two languages. These parameters must be measured. Later in the day I’ll describe again how these parameters are to be meausured.
“…it’s maybe a little above and beyond what we were looking to go with” - I agree, and I understand it is not what attracts users. It’s fully up to LingQ.
However, neither this repells users. And it seems to describe how this vocubulary counts turns to work in real life. And LingQ may be the first in the world to come up with language-independent vocabulary characterization, and with the determination of the Hep’s params for a wide set of languages. (Actually, I haven’t researched much, just a bit more than I had cited. Still, everybody seems to cite thesame two Russians when it comes to the actual dependence of the Heap’s parameters on a language).”
This also raises the Question This also raises the question how many English words equals A1,A2,B1,B2,C1,C2… etc?

LILingquist · July 16, 2020, 6:18am

My God, this is wonderful.

Like Francisco I only have access to a tiny laptop and very slow desktop at the moment, but I have to find out how this calibre software works regardless. I’ve heard the Master mention it many times over the years in his YouTube vids, but haven’t used it as of yet and the few Spanish books I had were able to import directly into linq cutting and pasting from a PDF.