Posts Tagged ‘text’

 

Introduction to AI

AI (Artificial Intelligence) is the study and design of intelligent agents. AI programs are called Intelligent Agent. Here is how it works:


The Intelligent Agent (on the left) interacts with an Environment (on the right). The Agent perceives the state of the Environment through its sensors and at the same time it affects its state through its actuators.

The real challenge about AI is the function that maps sensors to actuators: that is called Control Policy for the Agent.

Based on the data received from sensors, the agent makes decisions and pass them over to its actuators. These decisions take place several times and the loop of environment, feedback from sensors, agent’s decision and actuators interaction with the environment is called Perception-Action-Cycle.

AI is used in many fields, among which:

  • Finance
  • Robotics
  • Games
  • Medicine
  • And of course: the Web

AI and uncertainty

AI is all about uncertainty management. In other words, we use AI if we want to know what to do when we don’t know what to do. There could be many reasons for uncertainty in a computer program:

  • Sensor limits
  • Adversaries that make it hard for you to understand what’s happening
  • Stochastic environment (where behaviors are intrinsically non-deterministic)
  • Laziness
  • Plain ignorance (many people that don’t know what’s going on, could easily learn it, but they just don’t care)

All of the above are possible causes for uncertainty and AI.

Example of AI in practice

One of the many key applications of AI techniques is Machine Translation. How does Machine Translation work?

Machine Translation generates translations using AI techniques based on bilingual text corpora. Where such corpora are available, impressive results can be achieved translating texts of a very similar kind. Unfortunately, such corpora of bilingual texts are still very rare and the size of the available corpora varies significantly from one language combination to the other.

So what does Machine Translation looks like? On a large scale Machine Translation system, examples are found on the web. On a small scale, they can be found anywhere. This example was found in a Chinese restaurant in Cupertino:

In these type of text a line in Chinese corresponds to a line in English. To learn from this text, we need to find out the correspondence between words in Chinese and words in English. For example, we can highlight the word “wonton” in English. It appears 3 times throughout the text. In each of those lines there is also one Chinese character that appears: 雲. So it seems that there is a high probability that this ideogram in Chinese corresponds to the word “wonton” in English. Please note that we are talking about probabilities here. As a matter of fact “wonton” in Chinese is 雲吞 and not just 雲. For some reason the ideogram 雲吞 on line 65 is abbreviated to just 雲. And it’s not a common abbreviation.

You can go further, and try to find out what ideogram in Chinese correspond to the word “chicken” in English:


Please note that we aren’t 100% sure that 雞 is the ideogram for “chicken” in Chinese but we do know that there is a good chance because each time the word “chicken” appears in English this ideogram appears in Chinese.

Now let’s see if we can find a correspondence for the word Soup:


As you can see the word “soup” occurs in most these phrases but not in all of them. In the English side of the menu is missing in 1 place (65. Egg Drop Wonton Mix). Equivalently, on the Chinese side of the menu is missing in 1 difference place (廣東雲吞 60).

The correspondence doesn’t have to be 100% to tell us that there is still a good chance of a correlation.

In Machine Translation these type of alignment is used to create probability tables. Hence the name Statistical Machine Translation. In other words, the probability of one phrase in one language to correspond to another phrase in another language.

More on Machine Translation in future posts. Stay tuned.

Francesco Pugliano

 

The end of free Machine Translation API

Last June Adam Feldman (API Product Manager at Google), announced they were pulling the plug on their Google Translate API, causing a lot of concern and some protests in the developers and localization world. You can read the announcement here.

Then in August, Jeff Chin, (International Product Manager at Google) took that back and announced that they were offering the Translate API at a cost instead of free of charge. You can read the announcement here.

Here is Google’s pricing model:
$20 per million characters of text translated.

In September, Vikram Dendi (Director of Product Management at Microsoft), announced something very similar, but not many people took notice. You can read the announcement here.

Here’s Microsoft’s pricing model:
No cost up to 4M characters a month. Then $10 per million characters.

Unlike Google, Microsoft will only charge you when you reach the threshold of 4M characters a month and will then cost half as much ($10 per million characters instead of $20).

Quality of Google and Bing Machine Translation services

The quality of Google and Bing Statistical Machine Translation systems now that the technology is mature, heavily depends on the quality of the parallel text found on the web and crawled by their MT engines. Before the advent of Google and Bing translate, parallel text found on the web – more often than not – was produced by professional translators, and therefore of good quality.

Now, translating content professionally is expensive. Depending on the domain of translation and the language pairs, professional translation can cost as much as $0.50 per word for a language such as Japanese and between $0.18 to $0.21 per word for European languages.

During the recent financial crunch in 2008, many web publishers needed to cut costs. It’s not a surprise that they started to abuse the free Google Translate and Bing Translate API to translate content and then publish it as is, with no professional review.

This is a common technique that SEO companies have been applying to bring more users to a website and then turn them to premium content (professionally translated content).

The problem is that no algorithm is (yet) capable to understand whether content has been translated by a Machine Translation system or by a professional translator. Only trained human translators that speak the language can do that.

Today, both Microsoft and Google Machine Translation engines are crawling and processing web content that may have been published without any human proof-reading after being translated using the very same Google or Microsoft’s translation API.

In other words, these two companies are “polluting their own drinking water”.

I hope that by starting to charge for their Machine Translation Services both Google and Microsoft can decrease or at least control the amount of sub-standard translations published on the web so that in turn their MT engines can produce more reliable translations. Feeding their engine with United Nations and European Union bilingual documents is not enough to produce high quality translation.

Size doesn’t matter without quality

Many publishers in recent years have started to build their own corpora of bilingual texts to feed their Machine Translation engines with. It’s a given that an ad-hoc Machine Translation database fed only with high quality human translated and proof-read bilingual text in a specific domain can produce higher quality than Bing and Google Translate.

Unfortunately at some point these publishers may start to pollute their MT systems with content that has been machine translated and not carefully reviewed by professional translators.

We have seen this happening in the past, for example when the hype was all about Translation Memories instead of MT engines as it appears to be today. Some companies saw their Translation Memories growing bigger and bigger with no or little control on the quality of the content they were fed with, thus polluting their TMs and making them almost unusable.

Francesco Pugliano