How Google Translate works

nobody · December 21, 2009, 4:46pm

Following the recent Russian thread on LingQ with Google Translate (GT), it seems that GT is doing a fairly decent job most of the time. My experience in using GT for Spanish to English translation is that the quality is often pretty good, sometimes downright impressive. However English/Japanese translation is often rather embarrassing, and I wonder why.

There is a brief Wikipedia article which explains GT’s approach:

A few key points:

GT does not apply grammatical rules, since its algorithms are based on statistical analysis rather than traditional rule-based analysis.
A solid base for developing a usable statistical machine translation system for a new pair of languages from scratch, would consist in having a bilingual text corpus (or parallel collection) of more than a million words and two monolingual corpora of each more than a billion words.
To acquire this huge amount of linguistic data, Google used United Nations documents. The same document is normally available in all six official UN languages (Arabic, Chinese, English, French, Russian, Spanish), thus Google now has a 7-language corpus of 20 billion words’ worth of human translations.

Now I think I know why the Japanese translation is often so poor: Japanese is not a UN language and hence GT does not have a sufficiently large corpus for the necessary statistical analysis.

The article also says that “Google representatives have been very active at domestic conferences in Japan in the field asking researchers to provide them with bilingual corpora.” I guess it would take them some time to get a sufficiently large text corpus, and so I am not holding my breath for a dramatic improvement in the Japanese translations any time soon.

The interesting development is that there are many languages spoken in the European Union, and all official documents have to be translated into all those languages. What this should mean is that, I don’t know exactly how long it would take but a number of years down the road, GT will have the same text available in many different languages. This should mean that some time in the future we would have good quality machine translations in many languages.

In the meantime, LingQ can make a valuable contribution to help Google further its goals. There are a number of lessons available in all LingQ languages, e.g. Steve’s book. Maybe Steve should submit all versions of his book to Google. Who knows, maybe after that we will have much improved Swedish to Italian translations!

Yutaka · December 24, 2009, 11:13am

I think that as long as GT’s algorithms are based only on statistical analysis, the quality of translation cannot be perfect because statistics relies on “probability” theory. A human being with a good knowledge of grammatical rules does not necessarily need a huge database.

Thank you, Cantotango. By reading your post, I was able to understand the characteristics of the GT system.

Yutaka · December 24, 2009, 11:37am

An experiment with the GT system

我が輩は、猫である。—> My fellow is a cat.
吾輩は猫である。—> I Am a Cat.
私は、猫である。—> I have a cat.
私は猫である。—> I am a cat.

Yutaka · December 24, 2009, 11:47am

吾輩は、猫である。—> Spence is a cat. ( What is SPENCE? )

SanneT · December 24, 2009, 11:59am

Spence may be a name, short for Spencer.

Yutaka · December 24, 2009, 12:04pm

吾輩は、犬である。—> Spence is a dog.
吾輩は犬である。—> Am a dog.
私は、犬である。—> I have a dog.
私は犬である。—> I am a dog.

Yutaka · December 24, 2009, 12:08pm

I am a cat. My name is Spencer…

Yutaka · December 24, 2009, 12:15pm

Ich bin eine Katze. Mein Name ist Spencer.

nobody · December 24, 2009, 12:17pm

Yutaka,

No doubt about it, machines will not be able to match good human translators. However there have also been times that I was genuinely impressed by GT.

Here’s what Steve wrote in one of the Russian threads:

Я был бизнесменом, пока я не начал LingQ. Тогда я как бы пошли немного сумасшедшие.

Translation by GT:

I was a businessman, until I began LingQ. Then I kind of went a little crazy.

I was stunned when I read this. The tone and manner is so Steve like! This could be exactly what Steve would have said had he written in English.

Yutaka · December 24, 2009, 12:17pm

私は猫です。私の名前はスペンサーです。

私(watashi)は(wa)猫(neko)です(desu)。私(watashi)の(no)名前(namae)は(wa)スペンサー(supensaa)です(desu)。

nobody · December 24, 2009, 12:25pm

But "Тогда я как бы пошли немного сумасшедшие. " does not make sense in Russian at all…

Yutaka · December 24, 2009, 12:31pm

Cantotango,

"However there have also been times that I was genuinely impressed by GT. "
Yes, I agree with you on that.

Yutaka · December 24, 2009, 1:14pm

Does “Тогда я пошел немного чокнутый” make sense in Russian?

Yutaka · December 24, 2009, 1:38pm

Я кошку. Меня зовут Спенсер.

Yutaka · December 24, 2009, 1:40pm

Je suis un chat. Mon nom est Spencer.

SanneT · December 25, 2009, 10:28pm

I don’t know whether the Russian above (Тогда я пошел немного чокнутый) makes any sense, the Goggle translation into German isn’t quite correct. “Dann ging ich ein bisschen verrückt” should be “Dann wurde ich ein bisschen verrückt”, aber man hätte es sicherlich verstehen können.

nobody · December 25, 2009, 10:39pm

That’s because both English and German belong to the same family. The Russian “Тогда я пошел немного чокнутый” is a calque. I can recognize it as a calque only because I know English. A Russian speaker with no background in foreign languages would have a hard time understanding it. Just as an English speaker with no knowledge of Russian would have trouble understanding the phrase “Then I gently dislocated.”

SanneT · December 25, 2009, 10:58pm

I hope it wasn’t too painful!

nobody · December 25, 2009, 11:01pm

No, I’m all ri—
(faints)

SanneT · December 25, 2009, 11:19pm

I hope you have recovered and are strong enough now to enlighten me: What was distributed or deployed gently?