How to determine the level of a piece of text?

AS regular forum readers will know, I have been brooding over this question ever since I started at LingQ. A text is more or less difficult to understand depending on the proportion of nouns and verbs to other words, the length and obscurity of the words used, the verb tenses used. You could probably create a computer program to factor in all of these variables. Well, I couldn’t. And frankly, life’s too short to try.

I was lying in bed yesterday, enjoying the latest of my kids’ winter colds, and I did an experiment. I counted the average sentence length in words and syllables, of my LingQ Russian lessons. I got these results:

Beginner 1: words per sentence: 8 syllables per sentence: 8
Beginner 2: words per sentence: 12 syllables per sentence: 16
Intermediate 1: words per sentence: 16 syllables per sentence: 32
Intermediate 2: words per sentence: 20 syllables per sentence 48

If this relationship holds true, and isn’t just the product of my feverish fantasy, it means that we can all pick up a piece of text, and using just our eyeballs, estimate its LingQ level. As long as it’s in Russian, anyway.

When the next cold incapacitates me I shall look at French and German and see if there is a corresponding relationship.

thats assuming short words are easier to remember and/or more important/frequent than long ones though isn’t it?

I think in order to fully know you’d need to have a base level of 1 (including most frequent words / usually easiest because you are exposed more to them)… and judge higher levels based on how many unknown words above that level there are?

The level thing is arbitrary. We tried to automate it using word frequency and got some strange results. Maybe we should try this approach. For now we are hoping that providers will assign a level, and then learners can rely on their own new word count.

We may experiment with this approach when we get some free time in front of our programmers.I think you may be on to something, Helen.

I’ve managed to stay out of bed long enough to play around on Google for sentence analysers. The trouble is that most are language specific. Any program that does a syllable anaysis based on English text gets confused with Russian text, because there are more consonants to a syllable than it can cope with.

However the Automated Readability Index Automated readability index - Wikipedia doesn’t count syllables, just words and letters. I notice Google Docs has this built in. It produces results for Russian, but I can’t work out how much sense the numbers make. To check the results relies on me finding some easy and some hard Russian texts, and my brain’s a bit too foggy at the moment.

Please could someone provide me with examples of easy, middling and difficult text in Russian or the language of your choice so I can explore this question further?

You’re going to love this.

Варкалось. Хливкие шорьки
Пырялись по наве,
И хрюкотали зелюки,
Как мюмзики в мове.

О бойся Бармаглота, сын!
Он так свирлеп и дик,
А в глуше рымит исполин —
Злопастный Брандашмыг.


good joke! :slight_smile: But just using the word frequency can decide this issue :slight_smile:

I think the using of word frequency is a mandatory part of technology.
But idea to calculate words per sentence and syllables per sentence is great! This calculation might become an important part too. Can we combine both of these parts?

One day we can experiment with these things. Remember that all languages are different and I do not see us having a different algorithm for each language. One day, one day…

We can use this mechanism for languages that allow to do this :slight_smile: Like a handy
option. And don’t use it for another languages :wink:
One day :slight_smile:

There are things aptly called complex sentences, which contain 2 or more propositions that are embedded within each other,… in English in the form of relative clauses mainly.

In my experience, if a text has a lot of multi-proposition sentences, it is complex as far as I can see.

I usually level lessons by this algorithm:
Beginner I – short, very simple and rather unnatural texts (“unnatural” as natives tend to use more complicated sentences).
Beginner II – rather short, rather simple and quite natural texts
Intermediate I – natural texts about daily routine of any length, and quite short text on any other theme.
Intermediate II - Advanced I – natural texts on more complex themes

I would like to come up with a rule of thumb which will even work for a language you can’t read. Rasana, as a Russian tutor you can grade lessons, but as a Russian learner it’s not as obvious to me. As a Japanese beginner, nearly all texts are too hard for me, what I’d like is an idea of how many months/years it will take me to be able to understand any given novel or web site.

It would have to be based on the number of words and characters per sentence, because you can count them without even knowing what a vowel looks like. It can’t involve counting prepositions, because that assumes you can recognise a preposition when you see it. It would need a different fudge factor per language, which would need determining empirically.

I think some graphs may be involved.

Readability Estimation

What if LingQ will make a jury for doing this work?

There are too many factors that have an influence on text complexity.
What about intonations and cultural affinity?

Is this text easy for understanding?

Why he is saying “КОпЕнгагене” not “Капингагене“ as usually Russians do? Or “Венэции“ instead of “Венеции“. Why he is saying “Брюссссель”?
Why are they laughing when he is repeating “20 в Антверпене”? Why he is speaking about windsurfing and freestyle?

No complex words. Very short sentences. But is this text easy for understanding? And what about the easiness of listening (in linguistic sense of phrase)?

Interesting! I wonder what the algorithm is that they use?

Anyone know the Russian for “readability test”?


@Astamoore: Ah! I’ve just recognised the poem you quote. One of the finest ever written in the English language! It still sounds wonderful in translation.

Readability Estimation

"For the criterion, we use a textbook corpus that we compiled. It consists of 1478 samples (about 1,000,000 characters) extracted from 127 textbooks.
First, the program calculates a likelihood value for each grade by using the character bigram model, which is constructed from the textbook corpus. The grade whose likelihood value is the best among 13 obtained values is the estimated grade. "