LingQ Data Analysis: Visualizing Patterns in My Known Words

Findings

  1. You can estimate the number of real unknown numbers as 57.6% of the displayed numbers.
  2. 40% for Noun, 25.5% for Proper Noun, 16.2% for Verb, 13.8% for Adjective, and 4.5% for Adverb.

Recently, I found a way to download all the known words from LingQ. (You can export every known/unknown word/LingQ as a CSV file)

I’m still not sure how I can use this data for something useful for my learning. Maybe I can make a stat or visualization with a word embedding and LLM, but I don’t have a firm idea yet.

This is what I tried today

  1. Downloaded all the known words (started from 41906 words)
  2. lemmatized the words (e.g., Bones → bone, did → do, was → be) (30234 words left)
  3. Removed non-English, broken, meaningless words (because I’m an English learner) (24393 words left)

Interestingly, there were so many non-English words (including words with typos or foreign names, etc)

  1. Removed the stop words (e.g., the, a, become, per, myself). It seems it is a part of standard procedure in the NLP community. (24157 left)

This is the ratio for each part of speech of my words


To find unknown but common words, I used the Google 10000 words dataset, including the most common 10,000 words.

  1. Conducted the same process as before (9894 → 7361 → 7304 → 7051)

This dataset doesn’t have many outliers. So, the outliers detected from the known data were real outliers.

  1. I counted the number of unknown but common words and drew a plot.

As I expected, less common words are less known to me. For me, 709 words were unknown.


With the NLP tech, we can assign a vector to a word corresponding to its meaning. I used the vector embeddings of known words and visualized the vectors. You can see that the words have similar meanings and are located close together.

This is the 2D

This is the 3D plot


I’m still thinking about how I can use this to make something useful, but I haven’t concluded yet. If you have some idea, put it.

3 Likes

This is why Lingq would do well to have a good method for dealing with all these words.

Sorry to say that, determining whether a word is valid or not, and finding the base form, is quite a challenging problem. I’m not sure it’s technically possible, considering users can import any content.

When a word is in a different alphabet to the target language, it could be ignored. URLs could be ignored. Perhaps a word could be ignored if it is not in the lingq dictionary. Proper nouns could be ignored etc.

That’s right.
It’s easier to think of the number as just a rough approximation of the number of words you know. It’s at least monotonically increasing. That’s all.

Interesting! I tried to export the words to excel once but they came out wrong(well maybe not technically wrong per se since its excels fault). Any spanish words with an Ñ, í, ü, didnt get written correctly.

Open the file with a text editor. MS Excel always fails to open CSV written in non-English languages.

FYI You can download words without waiting with this script [UPDATED] Customizable LingQ Reader + Add-ons - #79 by vet8t6z79pc4. Try this.

Sample:

TODO

Personalized Learning:

  1. Assess the difficulty of unknown words by their closeness to the known words’ embedding.
  2. Find the closest known words for each unknown word.
  3. Generate example sentences with LLM for each unknown word for study.

Embedding Visualization:

  1. Overlay unknown words’ embeddings on known words’ one. To get a visual map of my lexical knowledge.
  2. Check if there are clusters of unknown words. Find a semantic holes in my vocabulary.

Graph Analysis:

  1. Build a graph with the words (nodes: words, edges: strong similarities, class: known, unknown common, and unknown from LingQ)
  2. Find communities of words from the graph.
  3. Find important words based on the centrality.

If those are conducted well and I feel it is useful, I will share the code or program to let you proceed with the same work.

I see value in analysing things in various ways as part of a broad and creative way to improve learning.
I’m interested in how to read more interesting things at a high level of understanding or
have a high understanding even if the text isn’t all that interesting.
Weirdly, the LIngq % isn’t ideal as it’s just blue words as far as I remember.
I’m learning Greek so text are moderately hard to find with some but limited supplies of interesting text / books at an intermediate level.
In a way, I’m interested in analysing the texts (books etc.) in comparision to my known vocabulary rather than my vocabulary.
Quickly picking high interest-moderately difficult text or some variation might need some analysis to support it.
Picking texts based on know words or semi-known words might be another way.
Mostly I just want to read but suffer in terms of vocabulary or interest.
And with AI, maybe we are near creating texts at a certain level of difficulty (possible) and at a target level of interest (I am less sure).

Which text can I read that is closest to i+1?
What’s the minimum number of words to learn to then know 95% of the text? Which text can I know 95% of the words by learning the minimum number of words?

1 Like

Embedding Visualization:

Unlike my expectation, unknown words didn’t make a cluster that is located far from known words. They quite well overlapped. (Black dots are known words, and colored dots are unknown words)

In the embedding space, words are quite well clustered by their topics. Except for some regions that have too diverse words, I can figure out the topic of the region by skimming the words located there.

This is a topic map I made.

This is 10% of the unknown words plotted with text. You can see that the words are matched to the topic I assigned.


I’m thinking about how this info could be used for personalized learning.

Personalized vocabulary lists by topics?
Would it help learning to memorize words included in the same topic together?

Always nice to see more stats especially if I can sort my courses/books according to it too. If data from that stas can be used to generate a readability score for material we import that will be useful. A score enhanced, based on words we already know.

There is a grammar site https://www.kwiziq.com/ that has a mind map system. Your analysis and visualization remind me of that.

I like you have a Harry Potter topic circle in your visualization. That is one of my goal to be able to read that book in my TL. Your visualization will show me how close I am to be able to read it. This can be motivating. Maybe if it can even recommend me books/material to read in my library in an efficient sequence to reach that goal that will be good.