How hard is a piece of text?

Greetings fellow tutors and other interested parties,

As some of you may know I’ve been puzzling for some time for a simple yardstick to measure how hard a piece of text is. Obviously as a learner I use the “known words” statistics provided by LingQ (that’s the idea!) But no-one else will be able to evaluate a piece of text on my behalf, not even someone who started learning the same language at the same time.

There are measures of sentence complexity, but they are all fairy…complex. We want something simple which will yield the same rule-of-thumvb results for all tutors. Here is my latest suggestion:

I have imported a list of the most frequenty-used 1000 words in English into my workdesk and selected “read”. I now have a word-stock of 1000 common words. I figure that this is roughly comparable to the level of an intermediate 1 student, although they will, the laws of probability being what they are, not have encountered some and instead know a whole bunch of really quite obscure words (like “dodo”).

Now I have some meaningful statistics on my workdesk. If I had paid more attention in probability theory classes I could do some sums here, but sadly I didn’t so I have to make this next bit up.

Less than 25% unknown words - beginner 1
25-40% - beginner 2
40% - 55% - intermediate 1, challenging but managable with LingQ and a tutor
55% - 60% - intermediate 2, ditto
over 60% - quite challenging even for a native. Advanced 1.
over 70% - don’t go there!

Any thoughts?

I should add I used this list: Wiktionary:Frequency lists/PG/2005/10/1-1000 - Wiktionary


You are a genius. I presume that you mean that you import the most common 1000 words and then click on Update to put them in your database.

Normally the first 1000 words will account for 60%-70% of just about any content.I suggest we use the first 2000 which account for 70-80%

So I would suggest we go

Beginner one <5%
B-2 6-10%
I-1 11-20%
I-2 21-30%
Advanced > 30%

Once a learner is past intermediate it will depend more on his or her own word list than a general frequency indicator.

We can also suggest that learners import word lists themselves in order to make their “new words” count correspond to their level. But of course that means finding frequency lists for the different languages.

Any more ideas or thoughts. And do people have frequency lists, or other word lists for the different languages.

The idea is really great!
I hope I can find such a list for German. Does anyone has such a list?

I got my English words from here:

I got the first 2000 German words and the first 1000 Russian words from there too.

Thinking further on this, we could create content items which consist of the 1000 most frequent words in each language and put them into the Library. It would mean that someone would have to read this list.

The purpose of the list is primarily just to make the “new words” count meaningful for the learner who has just arrived at our site.

We could provide more lists, but it is best if the learner imports lists for him or herself since we can only share content that has sound. Furthermore listening to a list of words kind of goes against the LingQ philosophy of meaningful content.

We do have the intention of introducing better difficulty yardsticks like this into the library, but the more things that we can do without bothering our programmer for now, the better.

I will look for frequency lists in German and other languages. I hope some of you can share any lists you know of.

I just tried updating my recently imported lists. I am not sure this function is working properly right now. I will check. in any case it seems that it measures the number of new words as a percentage of total running words, not unique words, so the percentages that I spoke of do not apply.

Let’s do a little more experimenting before we decide on levels. I will report further on whether this is working as it should right now.

Dr Seuss, when he wrote his Cat in the Hat books, did exactly what you suggest Steve. He was given a list of words and was told to write whatever he liked as long as he stuck to the vocabulary in the list.

I was considering doing a Seuss myself when I have time and inspiration.

I spoke to Mike. Our “new word” calculation is based on unique words, not running words. Thus the numbers I indicated above will not work. In other words, if the top 1000 words for frequency account for 70% of the running words, it will be a lower number for unique words. So we should experiment in each language to find out where the dividing line is between the different levels.

Ok, I have just uploaded the top 2000 words, and had a look at the new words statistics on my workdesk for the items I am considering using for conversations. If I’ve understood you right Steve then the % new words isn’t what we should be using, but rather the number of (unique) new words divided by the total number of words used. We don’t have that statistic (though presumably it wouldn’t be too hard to display it…) so we either count them ourselves (import the text into Word, say) or we can just divide the number of new words by the length of the audio and waggle our hands a bit to indicate that it isn’t a precise measurement.

e.g.: Interesting Thing of the Day, Decimal Time
Time: 1:33 New words: 201 (45%)

is advanced, whereas:

A Magic Ring
Time: 2:53 New words: 29 (13%)

is a good beginner (beginner 2?)

Interesting Thing of the Day, Esperanto
Time: 5:05 New words: 151 (42%)

Is a good intermediate (intermediate 2)?

It has “face validity”, which means roughly “I like the sound of it”.


I doubt that we need to go to that much trouble. However, if you are interested, you can find the total number of running words in any text at LingQ by clicking on the little arrow to update your reading statistics. If you want to divide the new words by that number you will get the “new words” per running word count, (not necessarily accurate, however, since we do not know how many times the new words were used in the text, but good enough).

Alternatively you can submit any text to the compleat lexical tutor, put out by tomm Cobb of the University of Montreal. Look under Frequency Indexer. This will give a breakdown of words used in the text by frequency levels for both English and French. See link.

I think, however, that this is all too complicated. We only need to come up with a guideline, or share our own opinions on what % of new words , as counted by LingQ now, corresponds to our learners levels. The % will vary depending on language perhaps. But if we just use the count that we have, we will soon come up with the appropriate levels.

I think you’re right about the length of the material also being critical. A lot of what I considered to be advanced is only a couple of minutes long. An advanced student really should be able to assimilate a quarter of an hour’s worth at a time. Intermediates between 5 - 10 mins maybe.