SomeResults from My Experiment with Conversion Ratios of Known Words Needed in Different Languages

So, I did some experiments with conversion between English and romance languages, after reading various threads here about the subject. I figured the community can elaborate on these since foreign translated editions may be more easily available to others. Data would be especially interesting from slavic language editions of the same books.

This is very non scientific, but it may be able to provide some data points.

The Experiment: Compare unique word counts of different language editions of the same book to get an idea of the ratio of known words needed in different languages to understand the same book.

Tool: Calibre ā†’ Edit Book ā†’ Tools ā†’ Reports to obtain total and unique word counts.

Iā€™ll use English as the common denominator comparison value of ā€œ1ā€

Here is what I have so far:

LES MISERABLES
French Original
Total Words: 515,420
Unique Words: 35,652

English Edition
Total Words: 576,039
Unique Words: 28,019

Unique word ratio French to English: 1.27

THE CHILDREN OF CAPTAIN GRANT
French Original
Total Words: 192,221
Unique Words: 18,328

English Edition
Total Words: 155,596
Unique Words: 12,263

Unique word ratio French to English: 1.49

AROUND THE WORLD IN 80 DAYS
French Original
Total Words: 67,524
Unique Words: 9,749

English Edition
Total Words: 63,890
Unique Words: 7,712

Unique word ratio French to English: 1.26

PRIDE AND PREJUDICE
English Original
Total Words: 125,352
Unique Words: 7,260

French Edition
Total Words: 106,396
Unique Words: 9,623

Unique word ratio French to English: 1.32

THE ENGLISH SPY
English original:
Total Words: 110,246
Unique Words: 11,349

Spanish Edition
Total Words: 114,234
Unique Words: 14,193

Unique word ratio Spanish to English: 1.25

Though there are many caveats to this experiment, it seems to me that the ratios here are pretty consistent between Romance languages and English back and forth translations. With an average ratio of 1.31

Iā€™d be very curious to see numbers for German and other languages for the same titles.

8 Likes

Interesting. THanks for the research!

I have the German version of the first Harry Potter book, if someone has the english version to compare against.

If I follow your instructions (does it count the beginning ā€œstuffā€ in the bookā€¦i.e. copyright, publishing, numbers, etc?) I get:

Harry Potter - The Sorcererā€™s Stone
German edition:
Total Words: 83,236
Unique Words: 10,568

I didnā€™t have the english version, but I found this link that details the english statisticsā€¦whether they used Calibre to come up with this detail I canā€™t tellā€¦but it seems relatively close. The details at this particular link are very interesting too.

How many words do you really need to know? - Literature Statistics

English Edition:
Total Words: 77,883
Unique Words: 6,185

Unique word ratio German to English: 1.70

2 Likes

This is good. Yeah, I included all the front matter stuff, copyright, table of contents, etc because it would be too much a hassle to remove for the little margin of error it would negate. That stuff makes up less than 0.2% of book and some of the terms in it actually correspond to the target language.

2 Likes

Great job! The most interesting question for most users is how many word forms (ā€œLingqā€ words) correspond to a dictionary word (ā€œword famillyā€). To answer that question from your data we need to find out what that ratio is for English, then convert. Studies suggest that the general word form to word family ratio for English is 1.5. Because it is rather small, I think we can accept it as a good enough predictor of how many word forms learners actually encounter for each word family in normal texts. As I have argued elsewhere that would not be the case for much more inflected languages.
If we assume that the number of underlying word families is similar in the different versions of the same book and, thus, the differences that you found are mostly due to a difference in word forms, then we could conclude that, for romance languages the w2f (word form to word family) ratio would be 1.5 * 1.31 = very close to 2!!!, which is the ratio that seems to work for Romance languages, based on anecdotal evidence, as weā€™ve discussed in multiple threads.
This is a promising sanity check for these estimations. However, as you have mentioned this is still not very scientific because of the small sample and of the guesswork still involved in estimating the key w2f ratio. Plus we would still want to know how many words we need in order to reach a certain level in language.

This is such a great idea that it has inspired me to finally get around to doing something that Iā€™ve been thinking about for a long time: trying to get some actual, hard data on this issue. In particular, estimating w2f (and other stats) for several languages based on a large amount of texts, while addressing some usual problems, such as those encountered by you: cleaning the text from headers, tables of contents, dates and other common junk, maybe even some ā€œentitiesā€ such as names.
This is a challenging problem, mainly because it is hard to define what a ā€œword familyā€ is, plus it requires many data. However, I think it is possible to get to a good estimation by analysing a large collecion of texts in different languages, e.g., the Gutenberg library and applying natural languages processing (NLP) tools. Of course, that would be an automated, (through software), analysis, rather than manual. I happen to teach NLP techniques applied to research in Psychology of Language as part or my tutoring. For example, it is possible to programmatically ā€œstemā€/ā€œlemmatizeā€ texts, which are algorithms that try to find the word families that a given form corresponds to, according to different definitions: Stemming - Wikipedia
I like the Python Spacy library (https://spacy.io/) and Iā€™ve just installed another Python library that help search Gutenberg texts (GitHub - c-w/gutenberg: A simple interface to the Project Gutenberg corpus.). Iā€™m trying to index the texts righ now but it may take forever on my laptop.
I only have access to a small laptop right now and I am, as usual, rather busy, but I may get some help
If someone else has any ideas about this project or wants to lend a hand, just let me know. The goal is to calculate useful statistics for language learning such as w2f in the most ā€œscientificā€ way possible. At the same time, Iā€™ll try to locate some good research about how many words people really need for mastering languages. Iā€™m not sure how much time I can devote to this right now but I hope to set up a public repository with this info. I personally find this idea very exciting

I looked at the ā€œfront stuffā€ and it really didnā€™t seem like much, so I agree it doesnā€™t look like it would effect the results much at allā€¦

Congrats on 500 posts =)

2 Likes

Ftornay, I had this exact same idea about a year ago. I think I ended up running a python script to stem texts but stopped for some reason which I donā€™t remember. Iā€˜ll take a look at it this weekend and get back to you.

Iā€™m not surprised you had to stop because these things often end up being trickier than they sound, but letā€™s give it another go!
Yes, by all means explain what you tried and how far you got. What texts did you use?

Here are some Russian numbers. This is interesting enough, and the exchange rate is favorable enough, that I went and bought these off of litres.ru just for this purpose. So now I have some material that I can import into Lingq if I run dry. ( I donā€™t think Iā€™m going to be tackling Les Miserables / ŠžŃ‚Š²ŠµŃ€Š¶ŠµŠ½ŠøŠµ any time soon, though.)

I had posted these to your wall, not knowing when Iā€™d get around to doing the math, but here it is. Do feel free to check the calculations. I noted the non-Russian words here, but just calculated using the unique words count which includes them. Because of the different alphabets Calibre sorts the non-Russian words to the front, making it easy to see their number.

Edit: non-Russian word counts are unique non-Russian words.
Edit 2: Removing the non-Russian words only changes the overall ratio by 0.01.

Les Miserables
ŠžŃ‚Š²ŠµŃ€Š¶ŠµŠ½Š½ŠøŠµ
442493 words
71439 unique words
663 non-Russian words
Unique word ratio Russian to English: 2.55

The Children of Captain Grant
Š”ŠµŃ‚Šø ŠŗŠ°ŠæŠøтŠ°Š½Š° Š“Ń€Š°Š½Ń‚Š°
53198 words
16041 unique words
52 non-Russian words
Unique word ratio Russian to English: 1.31

Around the World in Eighty Days
Š’Š¾ŠŗруŠ³ сŠ²ŠµŃ‚Š° Š² Š²Š¾ŃŠµŠ¼ŃŒŠ“ŠµŃŃŃ‚ Š“Š½ŠµŠ¹
56809 words
14839 unique words
38 non-Russian words
Unique word ratio Russian to English: 1.92

Pride and Prejudice
Š“Š¾Ń€Š“Š¾ŃŃ‚ŃŒ Šø ŠæрŠµŠ“уŠ±ŠµŠ¶Š“ŠµŠ½ŠøŠµ
101244 words
21128 unique words
44 non-Russian words
Unique word ratio Russian to English: 2.91

5 Likes

A couple of observations about these data.
Notice that the correct average for rates is the harmonic, rather than arithmetic mean. For these data they are almost the same but as we add more data theyā€™ll differ more, which is what Iā€™m trying to do.

We have so far:
Romance, arithmetic mean: 1.318, harmonic: 1.312
Russian, arithmetic: 2.17 , harmonic: 1.98
German (one single point) 1.7

One first point is that the ratio grows slower than we would expect from the differences in inflection between languages. I argued long ago that this would be the case.
Another point is that German sits between Romance and Slavic. This matches the subjective impression by learners. I think this is due to he German convention of writing composition of words as one single word, because the language itself is somewhat less inflected than Romance languages.
The Slavic to Romance ratio is close to my predicted factor of 2, but it falls short. I think this is because of the reduced sample. Notice that there is a large variation in the results for Russian. Besides, I would expect the difference in unique words between a more inflected and a less inflected language to grow as you add more texts towards the general difference betwee all possible forms, although very slowly. This is because you are likelier to come across more and more differente forms as you keep on reading.
For this reason it is important to calculate stats as a function of the number of read words, because they will change over the course of learning.

Additional German example.:

Die Verwandlung
German original
Total Words: 19,216
Unique Words: 3,967

English Translation
Total Words: 22,168
Unique Words: 3074

Unique word ratio German to English: 1.29

Interesting that the ratio is smaller than the other German example for Harry Potter. I wonder if to some degree itā€™s because the original is in German and therefore the translation might tend to be word-ier than a corresponding English to German example? Small sample size so hard to tell.

2 Likes

Slavic ratio falls shortā€¦reduced sample: The issue of small sample size is compounded by the fact that the two works with the lowest ratio are by the same author. It may be how his style gets translated. I would not be surprised if popular literature, if Verne counts as such, uses fewer, simpler word forms than that of the likes of Hugo and Austen, as translated.

Thanks everyone for contributing to this project! I dug trough my library and found some more examples to add to this:

For German:

THE KILLER
English original:
Total Words: 140,137
Unique words: 10,910

German Edition:
Total Words: 146,472
Unique Words: 16,156

Unique word ratio German to English: 1.48

For Korean:

THE MARTIAN
English original:
Total Words: 107,105
Unique words: 9,387

Korean Edition:
Total Words: 80,648
Unique Words: 21,845

Unique word ratio Korean to English: 2.37

BAD LUCK AND TROUBLE
English original:
Total Words: 113,440
Unique words: 10,190

Korean Edition:
Total Words: 93,028
Unique Words: 27,366

Unique word ratio Korean to English: 2.68

3 Likes

My intuition tells me that w2f ratio for Russian would increase as you add more texts. In other words, WFam(80k LingQ words) < 4*WFam(20k LingQ words). I think thatā€™s because as someone learns more words, the new words will have a higher likelihood of being a different form of a word already known. So the w2f ratio would have to be a function of known words or word database size. We already know the ratio as the database size increases towards infinity since itā€™s just going to be the average number of word forms per word family. Maybe that would be something like 15 or 20.

Maybe: 20k LingQ words corresponds to 6k word families (w2f= 3.3); 40k corresponds to 10k word families (w2f=4), 120k corresponds to 20k word families (w2f=6) etc.

Yeah Iā€™m not sure of the reason. I think I may have even gotten it to work but stopped because I couldnā€™t get a known word list from LingQ. The goal was to see how many of my known words were from different word families.

Yes, the main thing is to find good data. And I think it is better to use real texts, rather than simple lists because we can get predictions about what part of speech each word might be, which helps in finding the ā€œfamilyā€

I found this post which I thought was interesting Iā€™m quoting Mr. Ilya_L "
Alex, your numbers seem very much reasonable to me! Though please clarify:
ā€œWe split it up into three levels: Beginner, Intermediate and Advanced.
[Q: what do you split in three levels, Steveā€™s book or the ratios of the counts that you applying for the tree levels ? ]
Here are the rounded numbers that we got:
English = 1, 1, 1
Spanish, French, Portuguese, Italian, Japanese, Swedish = 1.15, 1.25, 1.3
German, Chinese = 1.35, 1.45, 1.5
Russian, = 1.7, 2.1, 2.2
Korean = 1.8, 2.2, 2.5 "
Q: Are these ā€œmeasuredā€ numbers (or do they reflect someoneā€™s, maybe mine , intuition)? How did you get them? Like Paramecium or differently?
I like these 3 sets of numbers. They are in agreement with what one could expect from the Heapā€™s law. For each language but one conventionally chosen, in your case English, the ratios of the counts are expected, in theory, to slowly change with the number of the counts (with the user experience, with the vocabulary size)
Only that with the Heapā€™s law whose parameters are known you donā€™t have to arbitrarily define user experience at all The Heapā€™s automatically changes the ratios of the counts with the number of accumulated counts.
If the English counts are used as the ā€˜normalizedā€™ counts, as in your case, and the userā€™s current English count is V_eng, then the matching count in any other language is to be computed as:
V_lang = V_lang_Ref [ V_eng / V_eng_Ref] ^ [beta_lang / beta_eng] - Count Convertion (CC).
In the CC formula, V_lang_Ref and V_eng_Ref are the vocabulary (lingQ) counts in the entire Reference document (let it be Steveā€™s book) in the ā€˜convertedā€™ language and the English language respectively; the betas are the Heapā€™s exponents for the two languages. These parameters must be measured. Later in the day Iā€™ll describe again how these parameters are to be meausured.
ā€œā€¦itā€™s maybe a little above and beyond what we were looking to go withā€ - I agree, and I understand it is not what attracts users. Itā€™s fully up to LingQ.
However, neither this repells users. And it seems to describe how this vocubulary counts turns to work in real life. And LingQ may be the first in the world to come up with language-independent vocabulary characterization, and with the determination of the Hepā€™s params for a wide set of languages. (Actually, I havenā€™t researched much, just a bit more than I had cited. Still, everybody seems to cite thesame two Russians when it comes to the actual dependence of the Heapā€™s parameters on a language).ā€
This also raises the Question This also raises the question how many English words equals A1,A2,B1,B2,C1,C2ā€¦ etc?

My God, this is wonderful.

Like Francisco I only have access to a tiny laptop and very slow desktop at the moment, but I have to find out how this calibre software works regardless. Iā€™ve heard the Master mention it many times over the years in his YouTube vids, but havenā€™t used it as of yet and the few Spanish books I had were able to import directly into linq cutting and pasting from a PDF.