How many words would the average native speaker know using LingQ's way of counting?

mark.e · September 28, 2021, 7:19pm

This is meant to be just a fun question. I’m curious what you guys think? 100K?
(German/French)

ericb100 · September 28, 2021, 8:00pm

Not sure…At one point I had started to throw English articles into LingQ to see where I ended up at, but I stopped =D.

aronald · September 28, 2021, 8:16pm

It would be a lot more than 100k. Keep in mind that there is a difference between “knowing” a word and “encountering a known word”. The big difference is that it could take years of reading to encounter a word that you already know. A better comparison would be: if a native speaker read 5M words, how many known words in LingQ would they have? I’m sure there are some people who read in their native language on LingQ that would be great for this question. 120k is probably a decent guess for Spanish and I’m sure French wouldn’t be too much different.

If a native speaker knows 20k word families and each word family contains 10 word forms (assuming Spanish) then that’s 200k right there. Maybe the average number of word forms is less than 10, but it’s definitely more than 5. So I’d say the total word forms is 150k-175k.

For German it could be 1.5x higher (total guess here since I’m not learning German). Slavic languages are probably at least 2x higher.

Also, if a language learner knows, say 20k word forms, the reading easiness will probably be the same for any language even though the declined language learners are still “learning” new words to get from 150k to 300k known words.

I’d take a look at people who have the highest words read counts in French and German (excluding maybe the top 1-2 because I don’t trust their numbers), add their total known words + LingQs not learned yet, and then assume a native knows at least 98% of those words. That’s probably the best estimate that you’re going to get.

edit: I checked a few of the top people in French and it looks like 100k-110k known words per 5M read would be a decent estimate. For German it’s about 180k. Russian numbers are close to German.

Kotatsukatze · September 29, 2021, 11:03am

Depends on the language. I use LingQ for Japanese and Swedish. Japanese has very few conjugations and the amound of “words” in LingQ is nearly identical to the words I know (I know 5000, LingQ shows 5067). Swedish has much more conjugations and I know around 2000 Swedish words but LingQ shows nearly 3000. For German which has even more conjugations than Swedish it would probably still higher.

For Japanese I would guess 100k. For Swedish I would say maybe 200k (the smaller amount of words already considered, Japanese has more words than Swedish).

Hagowingchun · September 29, 2021, 12:22pm

Do you use an anki deck to keep track of how many base forms you know or how are you accomplishing this?

Hagowingchun · September 29, 2021, 12:30pm

Is the 20k a basic native and would 40k be more accurate for an educated older native who reads for example?. Also do you think a native knows 98% of words encountered (obv the case for tv) but idk if im a stickler for marking known words or not but in my native language i would be closer to 90% if books were being read on my account also wikipedia articles have random big words that are extremely specific that are words than (if i cared) would try and learn. The percentage might even be 80-85% depending on difficulty (bible, old books, new domains, weird slang, dialects) right? This is just from my experience reading stuff in spanish and not knowing the word in english also dialects in english (idk words etc).

Kotatsukatze · September 29, 2021, 12:38pm

No, I just guessed. I also wouldn‘t expect that a native would have every conjugation of every word in LingQ or use every conjugation of every word he/she knows in his/her life. Otherwise the numbers would be much higher.

nobody · September 29, 2021, 7:25pm

My experience is the exact opposite. I find that there are tons of different forms for many Japanese words. Besides, many of the stems can be written in more than one way. The end result is that Lingq word count is hugely larger than the “word family” count would be.

noxialisrex · September 29, 2021, 10:13pm

This is an important concept that in LingQ a Blue/Unknown word might be totally known but you simply haven’t encountered that form of the word yet. In the beginning, yes all your unknown words are truly unknown, but there are certain inflections and conjugations you might simply see less and then when you encounter them for the first time later on you wouldn’t need to “LingQ” it.

This made me think that a “fun” way to speedrun known words would to be to read a dictionary with inflections and conjugations. By fun, that sounds about as fun as reading a dictionary. But it was fun to think about for a moment.

From what I’ve read, estimates seem to vary between 20 - 40.000 known word families for an average adult. So take {that estimate} * {the average forms per word} and you’d have their theoretical “maximum” or upper bound for known words in LingQ.

nobody · September 29, 2021, 11:20pm

This effect is very easy to notice. For example, in Russian there are some lessons that list a conjugation or declension. Reading such a lesson gives your known word count a huge boost.
I agree that it would be unbearably boring to read such material for a continued period of time. Plus I think it would just be self-deception about my actual level in the language.

Kotatsukatze · September 30, 2021, 6:30am

Maybe an example makes it more clear:
I eat/you eat/he/she eats/we eat/you eat/they eat = 食べる or 食べます

German:
I eat = ich esse
you eat - du isst / Sie essen (polite form)
he/she eats = er/sie isst
we eat = wir essen
you eat = ihr esst / Sie essen (polite form)
they eat = sie essen

Japanese has two forms (dictionary and polite) in this example and German has eight conjugations (two for polite form) where only a few are identically if you ignore irregular verbs.
And German has multiple versions for other forms like the past too. You’ll get maybe 20 conjugations of one verb (I haven’t really counted it, just a guess).

In languages like German you have a much bigger ammount of conjugations per verb. And nouns and adjectives are conjugated too. Adjectives are conjugated differently for the three genders German has for nouns. In Japanese they stay the same in all cases. You only change them for negation, time etc. That’s way less variants.

nobody · September 30, 2021, 9:33am

taberu tabeta tabete tabenakatta tabemasu taberareru, tabesasenai, tabetai, tabetakunai, tabetakunakatta tabesou tabesaserareta
The list goes on

Kotatsukatze · September 30, 2021, 10:56am

But they are still fewer than conjugations in some languages. And the transitive and intransitive versions can also be completely separate words (e.g. 入れる (ireru) and 入る (hairu)) like in other languages.
Besides, in languages like German those constructs also change the word to its infinitive form that Japanese doesn’t have as separate form: ich esse/ich will essen. That would also create two “words” in LingQ.

Some of them are also much less common than a conjugated verb and also LingQ doesn’t treat something like 食べたい as one word but breaks it up into 食べ and たい. So if you have three verbs with dictionary form and the “want to” suffix it would create four “words” in LingQ and not six as expected.

I’d prefer six words actually but the Japanese analyzer is a bit strange in LingQ. :-/

I can see the effect that conjugation creates more “words” in my Swedish and Japanese account. The amount of “words” compared to real words I actually know is much higher in Swedish (around 50% more). I don’t have that effect in Japanese. And Swedish doesn’t even have that much conjugations, it’s really simplified compared to e.g. German.

nobody · September 30, 2021, 11:37am

They are not
But the main point is this: you stated that there is a one to one correspondence between wor forms and word families in Japanese. This is simply far from true. To what extent the proportion is larger or smaller in other languages is outside the scope of this discussion (although, for the record, I’d argue that in fact Japanese ranks rather high in the word forms to word families metric). Your assertion was simply misleading. I don’t discuss how many words you know or whether in your case they match your Lingq count. I have no way to know that.
However, it’s simply a fact that a person reading Japanese texts on Lingq and mariking known word forms would have a much higher “Lingq” known word count than the number of word families they know.

nobody · September 30, 2021, 11:56am

In case you’re interested in some hard data, have a look at this paper:

It compares several measures of morphological complexity across languages and finds a high correlation. At the beginning you have a table with languages ranked according to one of those measures (Cwals). Notice that the explanation of the measure mentions some of the factors you alluded to, such as number of genders (so-called “typological” features).
Japanese ranks way higher than German and even higher than Russian, which matches my experience.
Of course, this is a general measure and there are differences with what you’ll find on Lingq. In fact the difference is even higher because you have to add inconsistencies in word parsing, which are common in Japanese and rare in alphabetic scripts and the different spellings of the same root, which can be written in kana or kanji depending on situation and often in several different kanji and occasionally in whimsical ways (e.g. I’ve found a tendency to spell words for animals in katakana in some media).
[Edit] This is the Research Gate page for the above paper, with publication details:
https://www.researchgate.net/publication/310247943_A_Comparison_Between_Morphological_Complexity_Measures_Typological_Data_vs_Language_Corpora

Hagowingchun · September 30, 2021, 1:29pm

On paper and according to this study japanese beats russian its an agglu vs fusional but after looking over lingq users stats the russian learner runs into more word forms with similiar words read. How does lingq counting japanese lower its word form versions? (I know very little actual japanese knowledge just hypothetical knowledge etc)? I looked at the study and looked up word token comparison this is how lingq counts words so idk where im failing here lol.

nobody · September 30, 2021, 6:40pm

@Hagowingchun
In the conversation above, user Kotatsuzake has been moving between two very different issues:
a) how many different word forms you can expect to encounter according to Lingq count.
b) How many word forms there are in the language per word family in average.
This confusion makes the discussion hard to follow. I tried to address both of them my conclusion for point b is that Japanese may in fact have more morphological complexity than German or Russian, contrary to the stereotype or, even if you allow for a big measure error, not substantially fewer than those.
For point a, which is what you ask, my point is that the ratio between Lingq word forms and word family in Japanese is certainly much higher than one to one, which is what Kotatsuzake had claimed and not very different from other morphologically complex language.

That doesn’t mean, however, that the Lingq word count is necessarily higher in Japanese than in Russian or German, because there are other factors at play. In fact, as you point out, the opposite may be true.
To begin with another example, which may be easier to follow:
In the paper, Spanish ranks higher in morphological complexity than German. I think this is quite correct, people tend to freak out because German has noun declension but in practice it only slightly increases the number of noun forms and has even less impact in Lingq word count because some forms are rare in normal speech. At any rate, this increase is very far from making up for the larger inventory of verb forms in Spanish.
So, in reality (point b) Spanish has more forms than German. The difference in practice is not huge, so German in fact groups closely with the romance languages in Lingq word forms, probably being on the lower end.
However, if you look at the data for Lingq users, you’ll find that you need way more known words in German than in the romance languages to reach the same level. Why is that? IMO it’s mostly because of the German convention of writing together nouns that complement other nouns. So, what in English would be a very lame “tennis ball” (two words, already known, nothing to write home about) woud become a formidable German “Tennisball” (new word, in blue). This is a silly example but such word add up to make a huge difference. Besides, the learner is forced to make hard choices about what to ignore and what to consider a known word: e.g. I would probably ignore “Tennisbal” but consider “Bildschirm” (picture + umbrella, meaning “screen”) a separate word.
This example goes on to show that there’s no simple relationship between actual morphological complexity and Ling word count.
So, there’s no simple answer to your question. It’s very hard in practice to compare, say, Russian and Japanese because of the different ways of expression, written system, frequlency of the different forms and so on.
For example, the different verb prefixes in Russian probably have as much or even more impact on word count than the declension system itself. For example, English “go out” is Russian “выходить” where the вы prefix is the equivalent of English “out”.
Another important factor (maybe the most important) is learner strategy. Kotatsukatze has pointed out that Lingq’s flaky parsing of Japanese words may in fact be used by learners to reduce the number of different forms. I don’t agree that this results in a lower number of forms encountered by learners per se (because it’s very inconsistent and other parsings do artificially increase word count, you get both “kirei” and “kireina” as word forms), much less that it makes Japanese simpler than German, because you could use the same strategy there.
However, it’s true that the faulty parsing may encourage many learners to ignore a large chunk of inconsistently parsed Japanese word forms, much more so than in languages in which the parsing is consistent, for which the tendency is to take the proposed word forms at face value, even if you could just as easily ignore obvious derivations.
To sum up, it’s very possible that many learners are lingqing as proper words, forms that only exist because of spelling conventions (such as “Tennisball”) in some languages, while ignoring proper conjugations such as “tabetakunakatta” or “tabetari” in others. In the end, however, neither strategy is more correct than the other.

Hagowingchun · October 1, 2021, 4:16am

This explains it. Compound nouns are a big thing in Chinese. I know Chinese has no conjugations except a simple progressive marker but they have a ton of word forms (on lingq) but its because they have an extremely large amount of compound nouns like German. I’ve only really studied Spanish on here so German example was good to see. Thank you!

Kotatsukatze · October 3, 2021, 4:37pm

I’m sorry, but I just don’t have the amount of sparetime to go through all the references you posted.

The topic is very complex anyhow. I still think that conjugations in languages like German will beat the amount of conjugations in Japanese easily. But on the other hands Japanese has much more words (e.g. 10,000 as basic vocabulary where Western languages have around 5,000). But a lot of them are just compound words where you just leave out the space (same in German btw, I can create millions of German words artificially that way via a script if I wanted to and they all would be valid words, you just wouldn’t find them in a dictionary).

And if you count something like Japanese relative clause structures as conjugation of verbs (which you could, it works the same way like adding “to want” with “tai”), it doesn’t really makes sense but you could increase the amount of “words” endlessly.

And LingQ has its own way to count words. Japanese conjugations are often counted as two words and not as one like for e.g. Swedish words where the parser just separates everything by spaces which isn’t doable with Japanese because there are no spaces.

So discussing this will have no real result that covers anything anyway.