I have been pondering for a while on document readability. In English there are several well-known formulae designed to give a rough-and-ready index to how hard a piece of text is to read, the Gunning-Fog index, SMOG etc. There are online calculators which clever Herberts have written, so you just have to copy and paste a sample passage from your text in to get a relative measure of its complexity.
The problem is that they mostly don’t work for Russian, for the simple reason that most count syllables, and if you can’t recognise a vowel, your syllable count will fail.
A little thought suggests that there are two measures which are simple to determine and give a rough estimate of linguistic complexity: average words per sentence and average characters per word. As long as you can tell where one word ends and the next one starts (tricky for Japanese but not for European languages usually) then you are away. Of course you have to recognise that Russian, by its prefix, infix and suffix nature, favours longer words, while French, with its fiddly little particle nature, favours longer sentences. But it shouldn’t be hard for a sufficiently clever person to come up with a multilanguage readability scale, and as long as we calibrate it correctly (eg run some “average” texts through it in each language and see what “average” scores look like" we should be away.
Someone has in fact done this, and come up with the LIX readability scale, described here: Readability of Newspapers in 11 Languages on JSTOR. Damn you Doctor Björnsson for nicking my idea!
Anyway, here’s the thing. I still can’t find any online readability calculator which will calculate the LIX index for a text in Russian. It looks like a simple programming job, but still well beyond my powers. Could anyone help me, either writing some code or finding me a program already written?
A little more googling shows that its proper name is the Laesbarhedsindex (LIX) formula and it is quoted here: HOW DO I DECIDE WHICH READABILITY FORMULA OR FORMULAS TO USE ON MY DOCUMENT? by Brian Scott as being “Useful for documents of any Western European language”.
Here’s the formula:
Lix = words/sentences+100*(words >= 6 characters)/words
and here’s the paper I found it in:
Here is my offer Helen: I’ll do it by tomorrow and as exe file for Windows, and you speak to me additional 15min in English
P.S. WAT and Software support is not included. Let me know if you don’t need it for the Windows, or if you have a chance to find s-th ready by tomorrow.
It turned out to demand highly professional skills Helen. I have to change you more for my little RussianLex_ForHelen.exe
Therefore, here are the changed conditions of the offer I speak to YOU additional 15 min in English
Debugging: change you more => charge you more
Debugging: I don’t mean 30 min, I min 15 min all at all.
Quality Assurance: I don’t mean 30 min, I mean 15 min all at all.
I have sent the working Windows application to your profile’s address Helen. Cut and paste, or type in any text. You are still silent and I am afraid you favor Mac. You are a real woman. Just say so…
Not silent, just very stressed with a sudden outbreak of illness here (you do NOT want to know the details).
Ilya if you can help me with this one I am happy to talk with you for as long as you want. I have spent two days googling, and I figure this LIX is probably the best readability index across languages, but I want to check the results against LingQ library material, and I really don’t have time to do that by hand. Ideally I would talk Mark into adding it into LingQ, but that’s more practical once I have figures to show it is useful
Aargh! There was a typo in the paper I copied the formula from! I have just checked Bjornsson’s original paper and long words are defined as greater than 6 letters, not greater than or equal to six letters.
Sorry Ilya, I will rethink the spec. It would be worth adding in RIX and possibibly ARI too.
There will be no problem to change the formula Helen. It takes more time to choose the useful one, to build the interface for it to work (which seems to be finished), as well as to read Benny’s thread (which is not). The last is the most pleasure of all. We need to make Benny stay with LingQ!
By the way, I have been fiddling with the LIX exe some more and either there is a counter which isn’t reset when you clear the text box, or I’m failing to clear the text box, because the index keeps going up when you clear out the input box and then keep pressing the “calculate” button.
I’ve also had a look at 10 online tools this morning. All look nice, none can handle Russian. What have Russian language obsessed computer programmers been doing with their time since the 1960s???
Reading the Benny thread is terribly addictive, isn’t it? You keep thinking you can give it up, then you find yourself logging on to sneak another peak…
I absolutely believe that in an group of people you need at least one member who thinks that a lot of the other members are nuts. Otherwise you get a mutual admiration society that can evolve into a cult.
I’ve been trying to wind people up here for 2 years will little success. Benny has a knack for getting people animated that I can only envy. I agree that his presence is doing us a lot of good.
Re: Benny - big things first - I fully agree with you. I am afraid Steve is missing (or at least not checking ) the opportunity. I even had an impulsive desire to skype and talk him into cooperation with Benny… but dropped the idea, as it’s their businesses after all.
Yes, these forums are very addictive, not only the Benny’s thread. Especially I find it addictive upon making my own comment on them. whatever silly one. I am soon strongly attracted to come back check if somebody responded. Looking at the time wasted, I console myself with the myth about the writing practice;-)
As to the LIX, to be certain, I refer to LIX2.exe, which I have sent you yesterday to the yahoo, not to the gmail address, By default LIX2.exe should opened with just a single, hardly visible full stop in the text box, and with the following values of the indexes:
LIX = 1; Lix1 = 1; Lix2 = 0; RIX = 0; ARI = 5.21.
Trying to reproduce what you have written, I had first thought that you were right. Once, after clearing the text box, the indexes indeed did not reset to the expected message “Enter at least one period”. However, then I scrolled the seemingly empty text field all the way down, to find that it was not empty.
Since then I could not “reproduce” the non-reset effect. (Yeah, it is more reliable to delete the text with the Windows Delete, rather the Ctrl-X. The Ctrl-A (select all) does not work in the box) And I an sure there are bugs and I see that the application is not yet robust. If it behaves weird, please try to reproduce the weirdness and e-mail me. You can always restart it. How was I at practicing my English?
Thanks for telling me about the e-mail Ilya, it ended up in the spam folder and I wouldn’t have thought to look there, as it isn’t my account and I’m not familiar with Yahoo. I have downloaded the files and I will play with them today.
I’m working from my new netbook, which means I can’t see the whole text box at once. I did wonder if it was non-empty further down the page.
I had a play with 3 common word-processing packages, the latest Word, Open Office and Google documents. Although they all offer multilinguage support, their readability statistics are rubbish for Russian.
When my Russian gets half as good as your English, I will be very happy
=> “I’m working from my new netbook, which means I can’t see the whole text box at once.”
Helen, tell which size of the application in pixels, widthheight, would be convenient for you to play, and on the next change I will include the new size as well. To answer, you should look at your netbook screen resolution. (R-click on the empty screen/Properties/Settings) .It gives the size of the screen in pixels. Currently the Lex’s size is set to 640800 pixels.
The British are I so polite that I wonder whether you meant that Lex was or was not rubbish with Russian. If it is not, it must be the same with all other languages and texts in Unicode. When my English gets as half polite as yours, I shall be happy.
I think I’m working with 1024 * whatever, which doesn’t actually all fit on the screen
I’ve had a quick test of the LIX2 and the statistics look right in English and Russian, certainly more plausible that any of the dozen or so utilities I’ve used so far! I’m sitting down now to give it another test.
As far as I can see, the problem with all the other utilities is that they only recognise the ASCII characters A-Z and a-z as legitimate characters. И for instance is mapped to some non-ASCII value and is therefore ignored. The LIX2 works I think because you make no distinction between text characters and non-text characters, just between periods and spaces and everything else counts as text. It does mean that it doesn’t recognise sentences ending in a question mark or an exclamation mark as being sentences, but that shouldn’t greatly affect the results. Not unless the input text is The Cat in the Hat. Actually, I should try that one…
Mike regained the power of coherent speech just long enough to look at and approve of your program this morning.
The width 1024 is usually accompanied with whatever = 768. Which is less than 800 of the LIX, so it doesn’t fully fit in. Pick the size you wish it to feel on your screen tell me. Take into account that, probably, you wish to have a part of the screen uncovered for other staff to see. You probably look at it from close distance - so you may advice me to decrease the font of the text in the box as well.
At the meantime you may try to increase a bit your vertical resolution (properties/settings) for the whole LIX to fit in. I will be sitting against the computer part of the day, leaving my skype green, so we could improve LIX and my Enlighsh with one shot -
After a bit more playing I have a couple of thoughts:
All the calculations look right, hurrah! You’ve missed the end off the ARI formula, there should be minus 21.43 at the end. But as that’s a constant I can still use the formula. LIXes and RIX look fine, although I wonder why you round the RIX to the nearest integer?
Secondly, would it be possible to display the total number of words, total number of hard words, total number of sentences and total number of non-space, non-period characters? I would like to include them in my spreadsheet, and maybe use them in calculating other statistics.
Thirdly, you don’t have to spend time on making the front end user-friendly. I can use it fas it is, and until anyone else shows an interest in it, that is all that is needed.
I’ve started putting together a spreadsheet of LingQ lessons, checking the assigned level against the readability indices.
=> “and until anyone else shows an interest in it”
Not a chance unless Benny is involved in it
Thanks Helen, I will resend the fixes to your yahoo later. The additional info will be opened in the additional window, so I will not have to change the interface.