Frequency lists

Thanks for the link skyblueteapot. It didn;t work but I got to
Частотный словарь современного русского языка and found the lists on the Russian page. Looks interesting but unfortunately the frequency lists have file extensions .al.zip and num.zip and when I import them I can unzip but can’t open them. Did you manage it somehow?

Ah. Try editing out the “.”'s of the file name (who put them in there?) and add “.txt” as the file extension.

Well that’s some trick! It worked for the .num file at least, thanks skyblueteapot ! Then I got a text file with gobbledegook but managed to convert the characters to cyrillic. So now I’ve got a list of 32 459 most frequent words with и as no.1 ( no surprise there) and экстремизм at no 32 459. The list is hard to use at the minute as it need formatting to get into excel in a sortable format. Still that’s for another day …getting late now!

How do you convert it into cyrillic?

It should in theory be possible to load the most frequent 2 000 words into a concordance program so that, for any given text, you can analyse the total number of words, the total number of unique words, and the total number of unique “hard” (ie infrequent) words. That, combined with its Lix score, should give you a reasonable picture of how hard the text will be for a learner to read.

I have found two concordance programs, but can’t get either to work with Russian: the Simple Concordance Program http://www.textworld.eu/scp/index.html and concordance http://www.concordancesoftware.co.uk/. The first is free, the second they want money for. Random!

If anyone knows what I’m talking about and knows of any other concordance program that will work across languages, I would be eager for details.

skyblueteapot,
The textworld.eu program (Simple Concordance Program 4.09) works in Russian on my computer, but I have no idea why. Maybe, try: on the View menu, pick Project, then set the alphabet to Russian-CP1251, and pick a nice Cyrillic font.

(In addition, I was able to use the most frequent 2000 list as a stop list, but this list is “headwords,” and my text has the inflected forms. I don’t know how to solve this problem.)

For some reason, the encoding always resets itself to Western European 1252 which means the Cyrillic gets decoded as ?'s, which aren’t word characters. I suspect it may be an operating system thing. I’m running Windows 7 starter, which lacks certain capabilities. What are you running @lastsafari?

skyblue teapot,
My operating system is Windows XP Home Edition. I just tried on another computer with Vista Home Premium, and the program runs fine on that machine also. I just realized that you can only set the font when you start a new project.

By the way, an older version of Simple Word Sorter is still available for free download:

http://www.softholm.com/download-software-free6239.htm

This will give you a list of words in a text, with the frequency counts. It ran a whole novel in about 10 minutes. It is not a concordance, however, and probably has its own encoding issues.

There is a space in the URL above, that needs to be removed. (before the .)

“How do you convert it into cyrillic?”
James123 - Convert the .num file to a text file by changing the extension to .txt . Open MS Word 2007 then open the .txt file. A dialog box will open . Select the radio button for “other encoding”. Select the cyrillic (windows) option. You will see the characters change to cyrillic. Click ok. Then save the file as a Word file. Then you can import into Excel and manipulate the data to get it into separate columns and end up with the list of Russian words on their own in one column. I’m just doing this now. So then the fun will really start as to how to use it to compare texts against the list. But I agree with skyblueteapot’s point about importing lists - I don;t want a lot of new words without phrases for them. Hmm…

@lastsafari: Which Cyrillic font did you pick?

Had no success on the full Windows 7 either. Will try tomorrow on husband’s Windows XP.

skyblueteapot,
Courier New is the default font, that one works fine. I also tried Arial, Arial bold, and Times New Roman.

This is what I am doing:
On the File menu, pick New. A new project box opens up.
At the lower right hand corner, under alphabet, pick Russian CP-1251, and click on Font.
In the new window that opens up, on the right side of the window, pick Windows Cyrillic (CP 1251).
Click OK. The window will close, and there should now be cyrillic characters in the Alphabet boxes.
Select a text file in Russian and pick OK.

The text file in Russian has to be Windows default encoding. Unicode UTF-8 files don’t seem to load.

I cannot duplicate your question marks. Either everything is fine, or I get nothing at all, or I get a bunch of funny looking characters.