Analysis and evaluation of Comparable Corpora for Under Resourced Areas of machine Translation

Information for General Public

In this section you can find questions and answers that give examples from the field of ACCURAT project. For more technical presentation of our work can be consulted in our Publications section or Video lectures section.

Is fully automated high-quality Machine Translation possible?

The answers to this question were different throughout the history of research in Machine Translation (MT). The early optimism with completely positive answer was replaced by deep scepticism and completely negative answer after the legendary ALPAC report in 1966 that almost put to the end all research in MT. The revival in recent decade let us believe that this answer could be positive again. The huge amount and availability of the same (parallel) e-text in two or more languages (i.e. parallel corpora) provide fundamental data for Statistical MT systems (SMT). Today we witness unforeseen growth of number of these e-texts on the Worl Wide Web and in other sources (such as parliamentary debates in multilingual countries, translation of Acquis Communautaire, translated newswire services, localised technical manuals etc.). The success of contemporaty SMT systems is based on this fact.

Is large amount of parallel e-texts enough?

Having in mind the given state of (S)MT technology, even the extremely large amount of parallel e-texts does not always guarantee that the SMT systems built around them will perform satisfactory. Different languages behave differently because they have different linguistic structures. While translation from language X to language Y using SMT could be rather easy, it does not necessary apply for the reverse direction, i.e. translation from language Y to language X. In this process a number of factors play the role: lexical richness, morphological complexity, syntactic complexity etc.

Does large amount of parallel e-texts exist for all languages?

Parallel e-texts do not appear on World Wide Web in sufficient amounts for all languages equally and this is certainly true for rare language pairs or "under-resources languages" i.e. languages serving linguistic communities with smaller number of speakers. While English-French or English-German parallel e-text is easy to find, Romanian-Greek or Latvian-Croatian is very hard to find in amounts sufficient for training SMT systems. This is why ACCURAT is trying to investigate how different kind of e-texts i.e. non-parallel or weakly parallel e-texts could be used for training SMT systems. This is particularly useful for under-resourced langauges where any kind of e-text covering certain domain adds a valuable contribution to a limited repository.

Does large amount of parallel e-texts exist for all domains?

Even if sufficient amount of parallel e-texts exists for a certain language pair, there is no guarantee that SMT systems trained for one domain would yield satisfactory translations for texts in another domain. The training stage of all SMT systems is known for its high domain dependency leading to very poor results when a SMT system trained in domain A is applied to domain B. In this respect ACCURAT will try to apply its methodology and results in two test-cases of domain-restricted e-texts in under-resourced languages.

| 2010-03-04 |

0