How Google Translate works

Following the recent Russian thread on LingQ with Google Translate (GT), it seems that GT is doing a fairly decent job most of the time. My experience in using GT for Spanish to English translation is that the quality is often pretty good, sometimes downright impressive. However English/Japanese translation is often rather embarrassing, and I wonder why.

There is a brief Wikipedia article which explains GT’s approach:

A few key points:

  • GT does not apply grammatical rules, since its algorithms are based on statistical analysis rather than traditional rule-based analysis.

  • A solid base for developing a usable statistical machine translation system for a new pair of languages from scratch, would consist in having a bilingual text corpus (or parallel collection) of more than a million words and two monolingual corpora of each more than a billion words.

  • To acquire this huge amount of linguistic data, Google used United Nations documents. The same document is normally available in all six official UN languages (Arabic, Chinese, English, French, Russian, Spanish), thus Google now has a 7-language corpus of 20 billion words’ worth of human translations.

Now I think I know why the Japanese translation is often so poor: Japanese is not a UN language and hence GT does not have a sufficiently large corpus for the necessary statistical analysis.

The article also says that “Google representatives have been very active at domestic conferences in Japan in the field asking researchers to provide them with bilingual corpora.” I guess it would take them some time to get a sufficiently large text corpus, and so I am not holding my breath for a dramatic improvement in the Japanese translations any time soon.

The interesting development is that there are many languages spoken in the European Union, and all official documents have to be translated into all those languages. What this should mean is that, I don’t know exactly how long it would take but a number of years down the road, GT will have the same text available in many different languages. This should mean that some time in the future we would have good quality machine translations in many languages.

In the meantime, LingQ can make a valuable contribution to help Google further its goals. There are a number of lessons available in all LingQ languages, e.g. Steve’s book. Maybe Steve should submit all versions of his book to Google. Who knows, maybe after that we will have much improved Swedish to Italian translations!

I think that as long as GT’s algorithms are based only on statistical analysis, the quality of translation cannot be perfect because statistics relies on “probability” theory. A human being with a good knowledge of grammatical rules does not necessarily need a huge database.

Thank you, Cantotango. By reading your post, I was able to understand the characteristics of the GT system.

An experiment with the GT system

我が輩は、猫である。—> My fellow is a cat.
吾輩は猫である。—> I Am a Cat.
私は、猫である。—> I have a cat.
私は猫である。—> I am a cat.

吾輩は、猫である。—> Spence is a cat. ( What is SPENCE? )

Spence may be a name, short for Spencer.

吾輩は、犬である。—> Spence is a dog.
吾輩は犬である。—> Am a dog.
私は、犬である。—> I have a dog.
私は犬である。—> I am a dog.

I am a cat. My name is Spencer…

Ich bin eine Katze. Mein Name ist Spencer.

Yutaka,

No doubt about it, machines will not be able to match good human translators. However there have also been times that I was genuinely impressed by GT.

Here’s what Steve wrote in one of the Russian threads:

Я был бизнесменом, пока я не начал LingQ. Тогда я как бы пошли немного сумасшедшие.

Translation by GT:

I was a businessman, until I began LingQ. Then I kind of went a little crazy.

I was stunned when I read this. The tone and manner is so Steve like! This could be exactly what Steve would have said had he written in English.

私は猫です。私の名前はスペンサーです。

私(watashi)は(wa)猫(neko)です(desu)。私(watashi)の(no)名前(namae)は(wa)スペンサー(supensaa)です(desu)。

But "Тогда я как бы пошли немного сумасшедшие. " does not make sense in Russian at all…

Cantotango,

"However there have also been times that I was genuinely impressed by GT. "
Yes, I agree with you on that.

Does “Тогда я пошел немного чокнутый” make sense in Russian?

Я кошку. Меня зовут Спенсер.

Je suis un chat. Mon nom est Spencer.

I don’t know whether the Russian above (Тогда я пошел немного чокнутый) makes any sense, the Goggle translation into German isn’t quite correct. “Dann ging ich ein bisschen verrückt” should be “Dann wurde ich ein bisschen verrückt”, aber man hätte es sicherlich verstehen können.

That’s because both English and German belong to the same family. The Russian “Тогда я пошел немного чокнутый” is a calque. I can recognize it as a calque only because I know English. A Russian speaker with no background in foreign languages would have a hard time understanding it. Just as an English speaker with no knowledge of Russian would have trouble understanding the phrase “Then I gently dislocated.”

I hope it wasn’t too painful!

No, I’m all ri—
(faints)

I hope you have recovered and are strong enough now to enlighten me: What was distributed or deployed gently?