Following the recent Russian thread on LingQ with Google Translate (GT), it seems that GT is doing a fairly decent job most of the time. My experience in using GT for Spanish to English translation is that the quality is often pretty good, sometimes downright impressive. However English/Japanese translation is often rather embarrassing, and I wonder why.
There is a brief Wikipedia article which explains GT’s approach:
A few key points:
GT does not apply grammatical rules, since its algorithms are based on statistical analysis rather than traditional rule-based analysis.
A solid base for developing a usable statistical machine translation system for a new pair of languages from scratch, would consist in having a bilingual text corpus (or parallel collection) of more than a million words and two monolingual corpora of each more than a billion words.
To acquire this huge amount of linguistic data, Google used United Nations documents. The same document is normally available in all six official UN languages (Arabic, Chinese, English, French, Russian, Spanish), thus Google now has a 7-language corpus of 20 billion words’ worth of human translations.
Now I think I know why the Japanese translation is often so poor: Japanese is not a UN language and hence GT does not have a sufficiently large corpus for the necessary statistical analysis.
The article also says that “Google representatives have been very active at domestic conferences in Japan in the field asking researchers to provide them with bilingual corpora.” I guess it would take them some time to get a sufficiently large text corpus, and so I am not holding my breath for a dramatic improvement in the Japanese translations any time soon.
The interesting development is that there are many languages spoken in the European Union, and all official documents have to be translated into all those languages. What this should mean is that, I don’t know exactly how long it would take but a number of years down the road, GT will have the same text available in many different languages. This should mean that some time in the future we would have good quality machine translations in many languages.
In the meantime, LingQ can make a valuable contribution to help Google further its goals. There are a number of lessons available in all LingQ languages, e.g. Steve’s book. Maybe Steve should submit all versions of his book to Google. Who knows, maybe after that we will have much improved Swedish to Italian translations!
I think that as long as GT’s algorithms are based only on statistical analysis, the quality of translation cannot be perfect because statistics relies on “probability” theory. A human being with a good knowledge of grammatical rules does not necessarily need a huge database.
Thank you, Cantotango. By reading your post, I was able to understand the characteristics of the GT system.
I don’t know whether the Russian above (Тогда я пошел немного чокнутый) makes any sense, the Goggle translation into German isn’t quite correct. “Dann ging ich ein bisschen verrückt” should be “Dann wurde ich ein bisschen verrückt”, aber man hätte es sicherlich verstehen können.
That’s because both English and German belong to the same family. The Russian “Тогда я пошел немного чокнутый” is a calque. I can recognize it as a calque only because I know English. A Russian speaker with no background in foreign languages would have a hard time understanding it. Just as an English speaker with no knowledge of Russian would have trouble understanding the phrase “Then I gently dislocated.”