Korean Number of Words (per LingQ) and language level

kraemder · March 4, 2023, 12:55am

I’m wondering if LingQ keeps a reference per language of how many known words you need in LingQ to be A1/A2/B1/B2 etc. I’m studying Korean so if I learn one word I get credit for learning maybe 100 words due to all the conjugations so it’s hard to judge CEFR level even roughly. But maybe someone tracks this stuff?

edit I’m currently at 16,873 words in Korean per LingQ.

Hagowingchun · March 6, 2023, 5:57am

A rough conversion is grab whatever people say for romance languages and x5 it from the data I have collected. But generally you could base it off if you can understand a newscast where they speak clearly then your at least b2 etc and things of this nature. Or a harry potter book without a dictionary would be at least b2. These would be my best assessments. I have just under 40,000 words in spanish but would not consider myself c2 because I have read over 6 million words which increases the number of word forms/tokens that I have confronted therefore raising the bar per say.

Another thing you could do is what % of the total word forms do you know. A native would be at 95+% if its normal stuff so one could deduce from there levels of known words vs lingqs etc.

I wish you the best of luck in your korean adventures.

noxialisrex · March 6, 2023, 4:15pm

Something to note with this, not all possible inflections/conjugations of a word are likely to come up in reading.
In Swedish as an example you have with nouns:

Indefinite Singular
Definite Singular
Indefinite Singular Genitive
Definite Singular Genitive
Indefinite Plural
Definite Plural
Indefinite Plural Genitive
Definite Plural Genitive

So each noun has roughly 8 words as LingQ would define them. But LingQ would only capture them as known if they appeared in a text. So before you are tempted to do complicated comparisons between Korean and another language to come to a guess, I’d think about which lemmas are likely to come up in a text.

From there I’d go back to the morphemes you need to know to reach the CEFR levels. As an example, I read a paper that estimated the following was needed for Swedish:

A1 - 2.500
A2 - 7.500
B1 - 20.000
B2 - 32.000
C1 - 40.000

I would guess this pattern is pretty similar for all languages, but you can do research on this for Korean. Because Korean is agglutinative, I would think it combines its morphemes in regular ways. So then the question is, roughly how many of these combinations are you likely to run into? With Swedish, I’d be tempted to multiply these by 2,5, so let’s double it to 5 for Korean:

A1 - 12.500
A2 - 37.500
B1 - 100.000
B2 - 160.000
C1 - 200.000

Based on some of the Korean known word counts I have seen on LingQ, I think this doesn’t look completely unreasonable, but who knows, this is paper napkin math afterall.

One thing to stress, LingQ only knows the words you know because you saw them in a text. This means over time, you will naturally see more and more words that are actually “known” before ever seeing them in a text. To get the most out of LingQ, I would not recommend importing inflection or conjugation tables, dictionaries, etc. with the purpose of getting that known word count raised. If you’re at a point where you truly knew all of that, you’d be at a point where what you need is actually the opposite of what LingQ naturally provides, i.e., you only need a pop-up dictionary for the handful of words you don’t know.

nfera · March 7, 2023, 4:32pm

LingQ doesn’t claim that their levels are the same as the CEFR, but their levels are very similar. If you use LingQ in the way it’s designed to be, then the lower and intermediate levels probably do reasonably correspond to the CEFR. That is, just because you have marked lots of Known Words, but don’t have much reading and listening practice, then you wouldn’t be really be at that comprehension level. Obviously there are lots of caveats, but the only way to know is to take the a test yourself.

I tested it for Italian and my experience was that Intermediate 1 = B1. I’m almost at Intermediate 2 (a few hundred Known Words away) and I feel I’m probably B2. I can watch a lot of TV shows and even though I don’t understand every word, I understand the story. I’ll take an online B2 test when I cross the threshold.

With regard to exactly Korean, my best guess would be to follow LingQ’s levels. They are probably a reasonable guess. It assumes you do lots of reading and listening though, not just marking lists of Known Words.

noxialisrex · March 7, 2023, 6:56pm

The LingQ levels are a good estimate for English and romance languages, but they are too low for synthetic languages because the same morpheme will be combined in so many different words.

Russian, Korean, and a few others I think, used to put Advanced 2 at something like 60.000 words before moving that number down to the current system.

Hagowingchun · March 8, 2023, 6:13am

Also korean is an east asain language which means not too long ago they used characters. I don’t know if this has a direct effect but korean spacing rules have official rules but natives struggle with these so lots of things are written without spaces when they should have spaces and all sino or chinese words are written without spaces which plus the 20 noun “particles” and the 50-100 common conjugations that exist there are tons of combinations.

bamboozled · March 8, 2023, 6:46am

I literally now nothing about Korean, but I did notice that Korean on LingQ doesn’t employ a word splitter or tokenizer. This came as a surprise because LingQ does so in Chinese and Japanese.
For example import this sentence into LingQ: “아버지가방에들어가신다”. LingQ considers this one word?! Wouldn’t the expected result be: "아버지가 + 방에 + 들어가신다 "?

Hagowingchun · March 8, 2023, 10:01am

very rarely it is written as 아버지가방에들어가신다 but this could take a few different forms like
아버지가 방에 들어 가신다 or 아버지가 방에 들어가신다 with longer sentences that use add on grammar to verbs or modifying noun phrases peole will leave out spaces when their should be but like its agglunative so tech there is like 500 verb conjugations cuz of so many abbreviations (commonly used)
아버지가 방에 들어 가신다

산들 바람 (gentle + wind) or 산들바람 (gentle breeze) ive seen written both ways.

or korean has tons of native compound verbs like 날아다니다 (to fly around) which could be written as 날아 다니다 or 날아다니다 which combines two already existing verbs like 날다 (fly), and 다니다 (to go about).

just an example of korean verb conjugation (quote form) basically (" ")
다고 하다 (normal)
대 (abbreviated form)
대요 (abbreviated polite form)
다고 해요 (polite form can be written like this or without a space) like below
다고해요
다고해 (more casual form without a space can be written as 다고 해 also)
다고하지 (normal form + 지 can be written as 다고 하지 also)
다고하 + plus like 10-20 grammar forms on to this with or without a space)
(few examples of this) 다고 하면서 while x person said
다고 하면 if x person said
다고 해도 even if x person said
다고 하고 and x person said
다고 하고요 and x person said but is like a and…
다고 하는데 but x person said
다고 하는데요 same as above a hanging but…
다고 하는데도 even though x person said
다고 하고 있다 x person is saying
다고 하니까 x person said x so
다고 해서 x person said x (different way to say/other naunces)
다고 한 후에 after x person said
다고 하기 전에 before x person said
다고 하자마자 right after x person said
다고 한 뒤에 after x person said
다고 할 때 (when x person said)
다고 하기 때문에 (because of x person saying)
다고 하지만 (person said x but)
(there are probably another 20 common ones but any of these could be written with one space or two like
다고 하기 전에 (correct in books and stuff)
다고 하기전에 (can be seen on blogs and stuff)
다고하기 전에 (same as above)
or the worst being
다고하기전에 (way more rare more like text messages/anything an unedcuated person would write)

This is just for declarative sentences that 다고 (direct quotes) can be any of the following depending on sentence type
다고/라고 (indirect quotes; 라고 is used with the copula “이다” aka sentences that end in nouns instead of verbs or adjectives)
냐고 (quoted questions)
자고 (quoted suggestions)
라고 (quoted commands)

but yeah korean is a bit messy haha sorry if this is long but does a decent job of showing why korean has so many words. There are multple people on lingq with over 120k known words with only 1-2 million words read. This is unprecedented with other languages.

Nouns are less crazy with only say 10-20 common particles but there are abbreviations for those as well so maybe 15-30 but yeah korean verbs and adjectives are both conjugated like this (when they end a sentence) there are different kinds of conjugations probably only 20-50 total that modify a verb or adjective when it comes before a noun, so lots of possibilities. If you have any questions about this just let me know.

a korean word splitter would be amazing and i think would be easier to employ for korean than chinese or japanese because the spacings are more predicatable. although I have no idea how hard it is to “turn on” a tokenizer for korean.

bamboozled · March 8, 2023, 11:03am

Word segmentation / tokenization is part of NLP (natural language processing) and remains an active research area. You can probably find many articles dealing with this problem. I guess the problem cannot truly be solved, but modern applications can get close to correct results. At least for Chinese some authors claim accuracy levels above 97%. But these implementations typically use machine learning and neural networks, and are thus unsuited for use by LingQ. I assume they require very light (memory / CPU) implementations, so that costs remain low and importing doesn’t take forever. The tokenizer they use for Chinese is a bit primitive and achieves up to 85% or so in standardized tests, i.e. not real life. But you are right, the results should be far superior in Korean because most words have already split correctly by native speakers.
As I said I’m clueless about Korean but a quick google presented me this software as an example: KoNLPy: Korean NLP in Python — KoNLPy 0.6.0 documentation , that’s also where I got the above sample sentence from.
I’ll raise this issue on LingQ’s Slack channel, maybe they will look into implementing something like this.

nfera · March 8, 2023, 11:17am

If it’s such a massive difference that 120k Known Words for ~2 million words read, then, as you guys are saying, surely the Korean levels need to be updated. The issue is that, as @noxialisrex said, you can only mark words as Known when you encounter them (in reading). Because of this, you can’t directly apply theoretical morphemes to the levels on LingQ. I guess the easiest way to establish most accurate levels is from direct feedback from those who have studied Korean on LingQ. Perhaps there are a few people who could who have used LingQ for Korean and can provide some feedback and their opinion? (It would be better to do it, as I did, by taking an online CEFR test, but it’s not always possible.)

Hagowingchun · March 8, 2023, 11:32am

@ nfera i am not completely sure for the word counts for specific levels. This would be amazingly helpful haha at least for general guidelines, but for any language one can count this person read 1 million words in spanish and encountered 40k tokens, 1 million words in russian and encountered 80k tokens, and korean 1 million words read and encountered 170k tokens or something like this just for ballparks. I have done this with a friend of mine for a few of the languages and plotted them. A few things can affect this like complexity of material and re-reading etc. What originally started this was the fact that chinese has more tokens than spanish while chinese has 0 conjugations vs the 60-80 of spanish not counting infinitive with pronouns or double pronouns, but this is because of the compound words of chinese and a language with compound words will have overall way more words since they are abusing suffixes and prefixes or in this case characters etc.

nfera · March 8, 2023, 11:44am

@Hagowingchun Then the question is, Is it faster to read 1 million tokens in Korean/Chinese than in 1 million tokens in Spanish? So in other words, for the same given time, you would end up reading more in Chinese/Korean than in, say, Spanish?

noxialisrex · March 8, 2023, 1:19pm

I do wonder how implementing segmentation/tokenization may affect existing users’ words already known/LingQ’d. Do you leave them be, or go back and retroactively apply them? A similar strategy could be used for any agglutinative language on LingQ. Any polysynthetic languages on LingQ would require this on launch due to the incredibly high volume of morphemes per word.

A fun example from Greenlandic:
Qiteqatigerusuppingaa? – Would you like to dance with me?

Tamarind · March 8, 2023, 2:14pm

@ noxialisrex
➞ so let’s double it to 5 for Korean:
A1 - 12.500
A2 - 37.500
B1 - 100.000
B2 - 160.000
C1 - 200.000

On top of the agglutination Korean has native words topped with Sino-Korean vocabulary, regional Korean dialects, English loan words, etc.

My guess is that I reached B1 much earlier than 100K as a result of becoming very familiar with native standard Korean words.

Just the other day I learned 부친 also means father, 父親. According to the Daum dictionary’s definitions of father I’m at level 3, and have a few more to learn:

1. 아버지
1. 부친
1. 신부
1. 창시자
1. 성직자
  All this to say, it’s very possible to be comfortably fluent in daily life Korean, and then turn on the news or open a book and not understand anything. My guesses for word count/CEFR levels are:
  A1 - 15K
  A2 - 30K
  B1 - 75K
  B2 - 225K
  C1 - 675K
  The higher your level, the bigger the climb to reach the next level. But not quite exponential.

bamboozled · March 8, 2023, 2:30pm

Well, the chances of this getting implemented appear to be slim, so this discussion is probably going to remain theoretical. Anyways, if a word splitter were implemented, or in fact if the software changes in Japanese or Chinese it doesn’t really impact users. Newly imported lessons will be split using the new algorithm, everything else remains as is. Previously Lingqed words that are now split, appear as phrases.

What I had in mind, and what is currently performed in C&J, is just a simple word splitter. In essence this is just a glorified dictionary with frequency information at the backend. A more encompassing solution could attempt to identify inflected word forms, derived forms (suffixes / prefixes) etc and only count word families. I don’t know how well this works in practice, but this would of course have a dramatic impact. This isn’t going to happen at LingQ obviously.

Coming back to the topic, I think the best way would be indeed to collect feedback from users, gather opinions and experiences, and then adjust the levels accordingly. There is no right or wrong, these are just heuristics. That the current levels are almost identical across languages is strange and doesn’t seem to represent the realities.

On a tangent, I’m not really a fan of word segmentation, it just happens to be a necessary evil. I have struggled with the system quite a bit, here are some of my opinions: Chinese Word Recognition (Wrong Characters Are Combined) ...

The quality of the word segmenter in Chinese is also the reason for the inflated word counts over there. Of course it depends on how you use LingQ, what and how much you read. If one only reads fiction the (blue word) system works quite well, but, for example, inaccurate transcripts or anything technical make the blue word count balloon. Users who want to preserve their sanity will either take their reading elsewhere or disable highlighting thus inflating their word count. I’ll attach an example for a piece of tech news (here the English words join the surrounding Chinese characters creating an infinite stream of blues):

Tamarind · March 8, 2023, 3:05pm

And THIS is why I like using LingQ on a computer. I’m wearing out my ‘X’ key “remove/ignore.”

Hagowingchun · March 8, 2023, 3:12pm

Im glad a way more experienced korean learner posted. I hypothesized that korean was that crazy but theory vs practice one never knows what it will be. Also, do you study hanja at all or 父親 was just from the dictionary? Man 700k would look awesome on the leaderboards.
4 and 5 are both very easy with hanja but again learning characters are insanely time/energy consuming i wonder if its worth the time if i had to do it over again.
부친 父親
신부 神父
창시자 創始者
성직자 聖職者

noxialisrex · March 8, 2023, 3:19pm

I have almost no knowledge of Korean except a few basic factlets so feel free to manipulate my numbers however you guys like . One point I do have on C1, I don’t doubt that the total pool of words that a person with advanced Korean speaker would recognize is ridiculous, say 625K like you said. However there is a phenomenon I think that’s not quite captured and that’s what percentage of the words you know, that LingQ knows you know.

When you’re in the beginning, that’s really easy. You know nothing, and LingQ knows exactly what words you know. But as your knowledge grows, the quantity of words you’ll effortlessly recognize at first sight in LingQ will start to increase.

That’s why I think that in real terms, you’re absolutely right, the difference between B2 and C1 will be a exponential one. But in LingQ known word terms, that difference could be even smaller than B1 to B2.

Tamarind · March 8, 2023, 3:22pm

I’m not consciously studying Hanja, I’m just using them (from the dictionary) to parse multiple Korean definitions. 부친 is a good example.

I knew 부친 very well as it’s commonly used to mean pan-fried. While reading I came across many other definitions of that word. Hanja helps me clarify word definitions when there are many, many meanings.

Someday I’d like to return to studying the Japanese I learned in college decades ago, so I don’t mind using Chinese characters.

Tamarind · March 8, 2023, 4:05pm

yeah, I use the LingQ numbers to keep moving, continue going through text, to make sure I’m not giving up or slowing down.

Beyond that I don’t find them useful. Knowing a Korean word on LingQ that has dozens of different meanings depending on context does not tell you much about your abilities. Every once in a while I’ll look at CEFR descriptions to judge my level. That’s good enough for me.