Analysis and evaluation of Comparable Corpora for Under Resourced Areas of machine Translation

About the Project

ACCURAT is a Collaborative project funded within FP7-ICT-2009-4 call and action ICT-2009.2.2: Language-based interaction under Grant agreement no. 248347.

Project summary

The aim of the ACCURAT project is to research methods and techniques to overcome one of the central problems of machine translation (MT) – the lack of linguistic resources for under-resourced areas of machine translation. The main goal is to find, analyze and evaluate novel methods that exploit comparable corpora on order to compensate for the shortage of linguistic resources, and ultimately to significantly improve MT quality for under-resourced languages and narrow domains.

The applicability of current data-driven methods directly depends on the availability of very large quantities of parallel corpus data. For this reason the translation quality of current data-driven MT systems varies dramatically from being quite good for language pairs with large corpora available (e.g. English and French) to being almost unusable for under-resourced languages and domains (e.g. Latvian and Croatian). Therefore the ultimate ACCURAT goal is to achieve a significant increase in translation quality for under-resourced languages and narrow domains.

The key innovation of ACCURAT will be the creation of methodology and tools to measure, to find and to use comparable corpora to improve the quality of MT for under-resourced languages and domains. Thus the ACCURAT project will bring significant contributions not only the theory of MT, but also to corpus linguistics, information extraction and natural language processing in general and will strongly advance theoretical foundations and methodology for research in corpus linguistics.

Scientific objectives

  • Create comparability metrics – to develop the methodology and determine criteria to measure the comparability of source and target language documents in comparable corpora;
  • Research methods for alignment and extraction of lexical, terminological and other linguistic data from comparable corpora;
  • Research methods for automatic acquisition of a comparable corpus from the Web;
  • Measure improvements from applying acquired data against baseline results from SMT and RBMT systems.

The project will use the latest state-of-the-art in SMT and rule-based MT systems as a baseline and will provide novel methods to achieve much better results by extending these systems through the use of comparable corpora. Initial research demonstrates promising results from the use of comparable corpora in SMT (Munteanu and Marcu, 2005; see also chapter on the state-of-the-art below) and RBMT (Thurmair, 2006) and this makes the ACCURAT consortium confident of the feasibility of the proposed approach.

Technological objectives

  • To develop methods and tools to automatically select similar documents from comparable corpora and align them at paragraph/sentence level for texts with different degree of parallelism;
  • To elaborate advanced techniques for extraction of lexical, terminological and other linguistic data from comparable corpora in order to provide training and customization data for MT;
  • To develop methods and tools for automatic acquisition of comparable corpora from the Web;
  • To improve quality of baseline SMT and RBMT systems by integration of data extracted from the comparable corpora;
  • To evaluate and validate the ACCURAT project results in three practical applications.

The ACCURAT project will investigate two broader use cases where the scarcity of linguistic resources poses a major challenge – adjusting machine translation for under-resourced languages and narrow domains.

The ACCURAT project will provide researchers and developers with a methodology and fully functional model for exploiting comparable corpora in MT, including corpus acquisition from the Web and other sources, analysis and metrics of comparability, multi-level alignment and extraction of lexical data and techniques for applying aligned text and extracted lexical data to increase translation quality of existing SMT and RBMT systems.

ACCURAT will provide an optimal approach to achieve quality MT translation for a number of new EU official languages and languages of associated countries, as well novel approaches for adapting existing MT technologies to specific narrow domains, significantly increasing language and domain coverage of automated translation.

ACCURAT will make its novel methodology for under-resourced areas of MT openly accessible in respect to comparability metrics, methods and techniques of alignment for comparable corpora, methods and techniques of information extraction from aligned comparable corpora at different levels (document, paragraph, phrase / word), methods and techniques of collecting comparable corpora from the Web as well as collections of comparable corpora for the project languages.

| 2010-03-04 |