Word count in different languages

steve · January 5, 2011, 8:19pm

We want to revise our known words goals for our different languages. Since LingQ counts every form of a word, the more inflected a language is, (the more forms of each word there are), the higher the word count for the same level of language proficiency, so to speak. This is not realistic.

I would be interested in any research or ideas on how best to do this.

To start the ball rolling here is a preliminary recommended list of the equivalent known word levels to make things equal. Comments please.

English , Chinese, Swedish 100
French, Spanish, Portuguese, Italian 125
German , Japanese, Korean 135
Russian 150

nobody · January 5, 2011, 8:24pm

Sorry, but I don’t understand the list. What’s it for?

steve · January 5, 2011, 8:32pm

We have “known words” goals which equate to different levels at LingQ.

It looks like this for most of our languages.

Beginner 1 up to 2,500 known words
Beginner 2 up to 5,000 known words
Intermediate 1 up to 10,000 known words
Intermediate 2 up to 15,000 known words
Advanced 1 up to 20,000 known words
Advanced 2 up to 25,000 known words and beyond

We would like to make these goals more realistic for those languages that are heavily inflected, i.e. have many forms of essentially the same word, but with different endings etc.

Jamie · January 5, 2011, 8:59pm

Is German more heavily inflected than French and Italian ?
I’m not so sure. Although German has its case structure for nouns, verbs and adjectives are less inflected.

This is just a thought and I’m not sure about it, but it seems to be true based on my experience in reviewing LingQs in both French and German. In French I seem to have many more LingQs for what is essentially the same word.

steve · January 5, 2011, 9:08pm

Thanks Jamie, I have really very little to go on here other than gut feel. I would be happy to put German together with the Romance languages. I await other comments.

VeraI · January 6, 2011, 6:01am

I would agree that the number of inflections for German and French is about the same. But German has a really huge number of compound words. Probably this leads to more words.

Jamie · January 6, 2011, 9:18am

Vera, I would argue that compound words are words in their own right and should be given the same weight as the words that make them up.

nobody · January 6, 2011, 10:58am

It may not be as simple as a straight multiplication factor across the levels. I lingQ all forms of a word as I meet them, and after 2 years am still coming across different inflections of words I met in “Who is She?” right at the beginning. In other words, the more learned LingQs and known words I achieve, the higher the proportion of variants I have encountered.

jeff_lindqvist · January 6, 2011, 4:17pm

German still has a handful of adjective endings which may confuse the beginner, and just about as many for verbs. Whether this alone puts German in another difficulty category, I’m not sure.

If we count compund words, German is beaten by the so-called “easy” (but highly agglutinative) Esperanto. How about this:

malsanulejestraranino = female board member of the hospital
mal- = un-
sana = healthy
-ul- = suffix for people
-ej- = suffix for places
=> malsanulejo (“place for sick people”) is a commonly-used word for ‘hospital’
-estr- = suffix for a leader or head of sth
-ar- = suffix for a group or collection
=> estraro is a group of leaders, e. g. a board of directors or the like
-an- = suffix for a member of a group
-in- = suffix for a female
-o = noun

(Source: Choose a global language: E vs G (General discussion) Language Learning Forum )

VeraI · January 6, 2011, 4:25pm

@Jamie: This works in “the real world” but LingQ count each compound word as a new word. If you know “Auto” and “Schlüssel” you can guess what “Autoschlüssel” means. LingQ will count this as 3 words! Therefore you have to “know” more words compared to English where you’ll have 2 words.

steve · January 6, 2011, 4:44pm

I think there are compensating factors in all languages. I wonder if there is any research out there. In any case I am leaning towards having three groups of languages.

group one English, Swedish, Chinese factor 100
group two Romance languages, German, Japanese, Korean factor 130
group three Slavic languages factor 160

Any thoughts? This need not be accurate, just fairer than what we have now.

Eginhard · January 6, 2011, 6:01pm

For languages with a reasonable (and at best similar) amount of content on LingQ you might also be able to derive these factors from the total number of different words in the respective language.

steve · January 6, 2011, 6:13pm

Eginhard,

You are right. We have compared the word count for our common content in the LingQ beginner series. We should probably look at the situation for a larger sample. However, all the libraries are not the same size. Any suggestions?

08nessh · January 6, 2011, 7:15pm

I would suggest just taking a sample like you did for the beginner series, but for every level, just take around ten lessons from each library that have an intermediate level and ten lessons from each library that are marked as advanced.

Jamie · January 6, 2011, 8:29pm

You shouldn’t need to analyse LingQ data at all.

Corpus studies exist that will give you this kind of information (word frequency statistics etc.) over a vastly larger sample size.
It must be possible to normalise these data to enable you to compare different languages directly to see which ones require more words to be learned for the same total word count. I think you can get such data on the internet.

I will try and have a look at this tomorrow, time permitting… but would welcome others’ thoughts…

nobody · January 6, 2011, 8:36pm

I did a small experiment. I have analysed the first 23 chapters (about 300 pages) of “Master and Magarita” by Bulgakov with LingQ in English and Russian.

Result:

The 23 chapters of the English version contain 8793 different words recognized by LingQ. .
The Russian version contains 18651 different words.

That would make a ratio of 1:2,12 (English:Russian).

Jamie · January 6, 2011, 9:02pm

Interesting !
I suppose the conclusion from this is that for each word the English student learns, the Russian student needs to learn 2.12 words to keep to the same level of proficiency.

Shame this book isn’t translated into all the LingQ languages.

Ilya_L · January 6, 2011, 9:45pm

@ Jamie,

“I suppose the conclusion from this is that for each word the English student learns, the Russian student needs to learn 2.12 words to keep to the same level of proficiency”.

I understand you might be joking, but the conclusion might be wrong :-). The correct conclusion is only that LingQ has distinguished 2.12 more differnt word tokens (character combination) in Russian than in English.

Some of this character combinations in Russian are as similar as the words “similar” and “similarly” in English. The student does not learn seperately, as idependent words, such Russian varialnts as the English “similar” and “similarly” , starting from rather early level .

I agree that an estimate can be found through the statistics, but 2.12 seems to be exsageration, IMO. I also agree that lingQ needs no exact estimate, and believe it is just a way for Steve to launch a thread and for me to type in English

Ilya_L · January 6, 2011, 10:05pm

Just for fun, with no disrespect nor relation to Steve’s question.

How many unknown English words will LingQ count in this paragraph?

“Aocdrnig to a rscheeahcr at an Elingsh uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt tihng is that the frist and lsat ltteer are in the rghit pclae. The rset can be a toatl mses and you can still raed it wouthit porbelm. This is bcuseae we do not raed ervey lteter by itslef but the wrod as a wlohe”.

Ilya_L · January 6, 2011, 10:09pm

Above, all chars but the first and the last character in each words were rundomly permutated by the computer - Sorry for offtopic.