Analysis and evaluation of Comparable Corpora for Under Resourced Areas of machine Translation

Publications

ACCURAT publications at Mendeley

You can follow our publications and discussions about them at our (Mendeley Group)

Papers

2010

Eisele A., Xu J. Improving Machine Translation Performance Using Comparable Corpora // Proceedings of the 3rd Workshop on Building and Using Comparable Corpora, European Language Resources Association (ELRA), La Valletta, Malta, pp 35-41, May 2010. (PDF)

Skadiņa, I., Vasiļjevs, A., Skadiņš, R., Gaizauskas, R., Tufiş, D, Gornostay, T. Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation // Proceedings of the 3rd Workshop on Building and Using Comparable Corpora. European Language Resources Association (ELRA), La Valletta, Malta, pp 6-14, May 2010. (PDF)

Irimia, E., Ceauşu, A. Augmenting a statistical machine translation baseline system with syntactically motivated translation examples // Proceedings of the Workshop on Exploitation of Multilingual Resources and Tools for Central and (South) Eastern European Languages. European Language Resources Association (ELRA), La Valletta, Malta, pp 1-8, May 2010 (PDF)

Guthrie, D., Hepple, M., Liu, Wei. Efficient minimal perfect hash language models // Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC), pages 2889– 2896, Valletta, Malta, May 2010. (PDF)

Irimia, E., Ceauşu, A. Dependency-based translation equivalents for factored machine translation // Gelbukh, A. (ed.) Research in Computing Science, Special Issue: Natural Language Processing and Applications, vol 46, Instituto Politecnico Nacional, Centro de Investigacion en Computation, Mexico, pp 205-216, 2010 (PDF)

David Guthrie and Mark Hepple. Minimal perfect hash rank: Compact storage of large language models // Proceedings of the Microsoft N-gram Workshop, Geneva, Switzerland, June 2010. (PDF)

Boroş, T., Tufiş, D., Ceauşu, A. Construcţia automată de corpusuri multilinguale // Proceedings of the CONSILR2010 Conference, Editura Universităţii “A.I. Cuza”, Iaşi, pp 103-112, 2010. (PDF)

Ion, R., Tufiş, D., Boroş, T., Ceauşu, A., Ştefănescu, D. On-Line Compilation of Comparable Corpora and their Evaluation // Proceedings of the 7th International Conference Formal Approaches to South Slavic and Balkan Languages (FASSBL7), Croatian Language Technologies Society – Faculty of Humanities and Social Sciences, University of Zagreb, Dubrovnik, Croatia, pp 29-34, October 2010. (PDF)

Šojat, K., Agić, Ž., Tadić, M. Verb Valency Frame Extraction Using Morphological and Syntactic Features of Croatian // Proceedings of the 7th International Conference Formal Approaches to South Slavic and Balkan Languages (FASSBL7), Croatian Language Technologies Society – Faculty of Humanities and Social Sciences, University of Zagreb, Dubrovnik, Croatia, pp 119-126, October 2010. (PDF)

Vučković, K., Agić, Ž., Tadić, M. Sentence Classification and Clause Detection for Croatian // Proceedings of the 7th International Conference Formal Approaches to South Slavic and Balkan Languages (FASSBL7), Croatian Language Technologies Society – Faculty of Humanities and Social Sciences, University of Zagreb, Dubrovnik, Croatia, pp 131-138, October 2010. (PDF)

Skadiņa, I., Aker, A., Giouli, V., Tufis, D., Gaizauskas, R., Mieriņa M., Mastropavlos, N. A Collection of Comparable Corpora for Under-resourced Languages // Proceedings of the Fourth International Conference Baltic HLT 2010, IOS Press, Frontiers in Artificial Intelligence and Applications, Vol. 219, Riga, Latvia, pp 161-168, October 2010. (PDF)

David Guthrie and Mark Hepple. Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval // Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP), Cambridge, MA, October 2010. (PDF)

Tufiş, D., Ion, R., Ceauşu, A., Ştefănescu, D. Reifying the Alignments // Tufiş, D., Forăscu, C. (eds.) Multilinguality and Interoperability in Language Processing with Emphasis on Romanian, Editura Academiei, pp 69-86, 2010 (PDF)

2011

Berović, D., Merkler, D., Agić, Ž. Disambiguation of homographic adjective and adverb forms in Croatian // Proceedings of the NooJ2011 Conference, Dubrovnik, Croatia, 13-15 June 2011, Cambridge Scientific Publishers, pp 137-145. (PDF)

Fišer, D., Ljubešić, N., Vintar, Š., Pollak, S. Building and using comparable corpora for domain-specific bilingual lexicon extraction // Proceedings of the 4th Workshop on Building and Using Comparable Corpora (BUCC), Portland, USA, 24 June 2011, pp 19-26. (PDF)

Ion, R., Ceauşu, A., Irimia, E. An Expectation Maximization Algorithm for Textual Unit Alignment // Proceedings of the 4th Workshop on Building and Using Comparable Corpora (BUCC), Portland, USA, 24 June 2011, pp 128-135. (PDF)

Ceauşu, A., Tufiş, D. Addressing SMT Data Sparseness when Translating into Morphologically-Rich Languages // Sharp, B.; Zock, M.; Carl, M.; Jakobsen, A. L. (eds.) Proceedings of the 8th international NLPCS workshop. Special theme: Human-machine interaction in translation, Copenhagen Business School, 2011, pp. 57-68. (PDF)

Pinnis, M., Goba, K. Maximum Entropy Model for Disambiguation of Rich Morphological Tags // Systems and Frameworks for Computational Morphology, Communications in Computer and Information Science, 1, Volume 100, The 2nd Workshop on Systems and Frameworks for Computational Morphology (SFCM2011), Zürich, 26 August 2011, Springer, Heidelberg, pp 14-22. (PDF)

Ljubešić, N., Fišer, D. Bootstrapping bilingual lexicons from comparable corpora for closely related languages // Proceedings of the 14th International Conference Text, Speech and Dialogue (TSD2011), Plzeň, Czech Republic, 1-5 September 2011, Lecture Notes in Artificial Intelligence 6836, Springer, Heidelberg, pp 91-98. (PDF)

Ljubešić, N., Erjavec, T. hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene // Proceedings of the 14th International Conference Text, Speech and Dialogue (TSD2011), Plzeň, Czech Republic, 1-5 September 2011, Lecture Notes in Artificial Intelligence 6836, Springer, Heidelberg, pp 395-402. (PDF)

Fišer, D., Ljubešić, N. Bilingual lexicon extraction from comparable corpora for closely related languages // Proceedings of the International conference Recent Advances in Natural Language Processing (RANLP2011), Hissar, Bulgaria, 12-14 September 2011, Bulgarian Academy of Sciences, Sofia, pp 125-131. (PDF)

Agić, Ž., Berović, D., Merkler, D., Tadić, M. Development and Applications of the Croatian 1984 Corpus for the MULTEXT-East Resources // Proceedings of the 2nd International Conference on Slavic Corpora (SlaviCorp2011), Dubrovnik, Croatia, 12-14 September 2011, (in press).

2012

Irimia, E. DEACC – Lexical Dictionary Extractor from Comparable Corpora // Moruz, M. A., Cristea, D., Tufiş, D., Iftene, A., Teodorescu, H.-N. (eds.) Proceedings of the 8th International Conference Linguistic resources and tools for processing of the Romanian language, Bucharest, Romania, 8-9 December 2011 and 26-27 April 2012, “Alexandru Ioan Cuza” University Publishing House, Iaşi, pp. 173-180. (PDF)

Ion, R. Graphic Comparability Levels for Comparable Corpora // Moruz, M. A., Cristea, D., Tufiş, D., Iftene, A., Teodorescu, H.-N. (eds.) Proceedings of the 8th International Conference Linguistic resources and tools for processing of the Romanian language, Bucharest, Romania, 8-9 December 2011 and 26-27 April 2012, “Alexandru Ioan Cuza” University Publishing House, Iaşi, pp. 127-133. (PDF)

Ştefănescu, D. Extracting Parallel Terminology from Comparable Corpora // Moruz, M. A., Cristea, D., Tufiş, D., Iftene, A., Teodorescu, H.-N. (eds.) Proceedings of the 8th International Conference Linguistic resources and tools for processing of the Romanian language, Bucharest, Romania, 8-9 December 2011 and 26-27 April 2012, “Alexandru Ioan Cuza” University Publishing House, Iaşi, pp. 181-188. (PDF)

Brunello, M. Understanding the composition of parallel corpora from the web // Proceedings of the 7th Web as Corpus Workshop (WAC7), Lyon, France, 17 April 2012, pp. 7-13. (PDF)

Su, F., Babych, B. Measuring Comparability of Documents in Non-Parallel Corpora for Efficient Extraction of (Semi-)Parallel Translation Equivalents // Proceedings of the EACL'12 Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra), Avignon, France, 23-27 April 2012, pp. 10-19. (PDF)

Aker, A., Kanoulas, E., Gaizauskas, R. A light way to collect comparable corpora from the web // Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey, 21-27 May 2012, pp. 1258-1265. (PDF)

Paramita, M. L. Clough, P., Aker, A., Gaizauskas, R. Correlation between Similarity Measures for Inter-Language Linked Wikipedia Articles // Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey, 21-27 May 2012, pp. 790-797. (PDF)

Barker, E., Gaizauskas, R. Assessing the Comparability of News Texts // Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey, 21-27 May 2012, pp. 3996-4003. (PDF)

Su, F., Babych, B. Development and Application of a Cross-language Document Comparability Metric // Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey, 21-27 May 2012, pp. 3956-3962. (PDF)

Pinnis, M. Latvian and Lithuanian Named Entity Recognition with TildeNER // Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey, 21-27 May 2012, pp. 1258-1265. (PDF)

Ion, R. PEXACC: A Parallel Sentence Mining Algorithm from Comparable Corpora // Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey, 21-27 May 2012, pp. 2181-2188. (PDF)

Skadiņa, I., Aker, A., Mastropavlos, N., Su, F., Tufiş, D., Verlič, M., Vasiļjevs, A., Babych, B., Clough, P., Gaizauskas, R., Glaros, N., Paramita, M. L., Pinnis, M. Collecting and Using Comparable Corpora for Statistical Machine Translation // Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC), Istanbul, Turkey, 21-27 May 2012, pp. 438-445. (PDF)

Berović, D., Agić, Ž., Tadić, M. Croatian Dependency Treebank: Recent Development and Initial Experiments // Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC), Istanbul, Turkey, 21-27 May 2012, pp. 1902-1906. (PDF)

Ştefănescu, D. Mining for Term Translations in Comparable Corpora // Proceedings of the 5th Workshop on Building and Using Comparable Corpora (BUCC 2012), Istanbul, Turkey, 26 May 2012, pp. 98-103. (PDF)

Skadiņa, I., Analysis and Evaluation of Comparable Corpora for Under-Resourced Areas of Machine Translation // Proceedings of the 5th Workshop on Building and Using Comparable Corpora (BUCC 2012), Istanbul, Turkey, 26 May 2012, pp. 17-19. (PDF)

Ljubešić, N., Vintar, Š., Fišer, D. Multi-word term extraction from comparable corpora by combining contextual and constituent clues // Proceedings of the 5th Workshop on Building and Using Comparable Corpora (BUCC 2012), Istanbul, Turkey, 26 May 2012, pp. 143-147. (PDF)

Irimia, E. Experimenting with Extracting Lexical Dictionaries from Comparable Corpora for English-Romanian language pair // Proceedings of the 5th Workshop on Building and Using Comparable Corpora (BUCC 2012), Istanbul, Turkey, 26 May 2012, pp. 49-55. (PDF)

Ştefănescu, D., Ion, R., Hunsicker, S. Hybrid Parallel Sentence Mining from Comparable Corpora // Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT 2012), Trento, Italy, May 28-30, 2012, pp. 137-144. (PDF)

Tufiş, D., Dumitrescu, S. D. Cascaded Phrase-Based Statistical Machine Translation Systems // Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT 2012), Trento, Italy, May 28-30, 2012, pp. 129-136. (PDF)

Thurmair, G., Aleksić, V. Creating Term and Lexicon Entries from Phrase Tables // Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT 2012), Trento, Italy, May 28-30, 2012, pp. 253-260. (PDF)

Preiss, J. Identifying Comparable Corpora Using LDA // Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Montréal, Canada, 3-8 June 2012, pp. 558-562. (PDF)

Pinnis, M., Ljubešić, N., Ştefănescu, D., Skadiņa, I., Tadić, M., Gornostay, T., Term Extraction, Tagging, and Mapping Tools for Under-Resourced Languages // Proceedings of the 10th Terminology and Knowledge Engineering Conference (TKE 2012), Madrid, Spain, 19-22 June 2012, pp. 193-208. (PDF)

Peters, C., Braschler, M., Clough, P. Multilingual Information Retrieval: From Research to Practice // Multilingual Information Retrieval: From Research to Practice, Springer: Heidelberg, Germany, 2012. (PDF)

Pinnis, M., Ion, R., Ştefănescu, D., Su, F., Skadiņa, I., Vasiļjevs, A., Babych, B. Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora // Proceedings of ACL 2012, System Demonstrations Track, Jeju Island, Republic of Korea, 8-14 July 2012. (PDF)

Invited talks

Eisele, A., From corpora to resources and tools – towards a proper treatment of Eastern European languages, The Fourth International Conference Human Language Technologies — the Baltic Perspective, Riga, Latvia, October 7–8, 2010.

Vasiljevs, A. ACCURAT - using comparable corpora for MT, LT Days 2010, Luxembourg, March 22-23, 2010. (Slides)

Vasiļjevs, A. How to get more data for under-resourced languages and domains?, FLaReNet Forum 2011, Venice, 26-27 May 2011. (Slides)

Tufiş, D. Analysis and Evaluation of Comparable Corpora for Under-Resourced Areas of Machine Translation, Polytechnic University of Valencia, 2011-10-25.

Conference Presentations

Eisele A., Xu J. Improving Machine Translation Performance Using Comparable Corpora // The 3rd Workshop on Building and Using Comparable Corpora, LREC2010, La Valletta, Malta, 22 May 2010. (Slides)

Skadiņa I., Vasiļjevs A., Skadiņš R., Gaizauskas R., Tufiş D, Gornostay T. Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation // The 3rd Workshop on Building and Using Comparable Corpora, LREC2010, La Valletta, Malta, 22 May 2010. (Slides)

Vasiļjevs, A. ACCURAT: Metrics for the evaluation of comparability of multilingual corpora // The Workshop on Methods for the Automatic Acquisition of Language Resources and their Evaluation Methods, LREC2010, La Valletta, Malta, 23 May 2010. (Slides)

Eisele, A. ACCURAT poster // 14th Annual Conference of the European Association for Machine Translation, Saint-Raphaël, France, 27-28 May 2010.

Štefanec, V., Vučković, K., Dovedan, Z. Towards Parsing Croatian Complex Sentences: Dependent Noun Clauses // NooJ2010 Conference, Komotini, Greece, 27-29 May 2010. (Slides)

Vučković, K., Bekavac, B., Dovedan, Z. Improved Parser for Simple Croatian Sentences // NooJ2010 Conference, Komotini, Greece, 27-29 May 2010. (Slides)

Ion, R., Tufiş, D., Boroş, T., Ceauşu, A., Ştefănescu, D. On-Line Compilation of Comparable Corpora and their Evaluation // The 7th International Conference Formal Approaches to South Slavic and Balkan Languages (FASSBL7), Dubrovnik, Croatia, 4-6 October 2010. (Slides)

Šojat, K., Agić, Ž., Tadić, M. Verb Valency Frame Extraction Using Morphological And Syntactic Features Of Croatian // The 7th International Conference Formal Approaches to South Slavic and Balkan Languages (FASSBL7), Dubrovnik, Croatia, 4-6 October 2010. (Slides)

Vučković, K., Agić, Ž., Tadić, M. Sentence Classification and Clause Detection for Croatian // The 7th International Conference Formal Approaches to South Slavic and Balkan Languages (FASSBL7), Dubrovnik, Croatia, 4-6 October 2010. (Slides)

Skadiņa, I., Aker, A., Giouli, V., Tufiş, D., Gaizauskas, R., Mieriņa M., Mastropavlos, N. A Collection of Comparable Corpora for Under-resourced Languages // The Fourth International Conference Baltic HLT 2010, Riga, Latvia, 7-8 October 2010.

Goba, K., Skadiņš, R. Improving SMT with Morphology Knowledge for Baltic Languages // Workshop on Machine Translation and Morphologically rich Languages, Haifa, Israel, 23-27 January 2011.

Babych, B., Hartley, A. Meta-evaluation of comparability metrics using parallel corpora // 12th International Conference Computational Linguistics and Intelligent Text Processing CICLing2011, Tokyo, Japan, 20-26 February 2011.

Vasiļjevs, A. Bridging technological gap between smaller and larger languages // W3C Workshop: Content on the Multilingual Web, Pisa, Italy, 4-5 April 2011. (Slides)

Berović, D., Merkler, D. Problemi lematizacije priloga i veznika u hrvatskim tekstovima // Annual Conference of the Croatian Applied Linguistics Society, Osijek, Croatia, 12-14 May 2011.

Berović, D., Merkler, D., Agić, Ž. Disambiguation of homographic adjective and adverb forms in Croatian // NooJ2011 Conference, Dubrovnik, Croatia, 13-15 June 2011. (Slides)

Fišer, D., Ljubešić, N., Vintar, Š., Pollak, S. Building and using comparable corpora for domain-specific bilingual lexicon extraction // The 4th Workshop on Building and Using Comparable Corpora (BUCC), Portland, USA, 24 June 2011.

Ion, R., Ceauşu, A., Irimia, E. An Expectation Maximization Algorithm for Textual Unit Alignment // The 4th Workshop on Building and Using Comparable Corpora (BUCC), Portland, USA, 24 June 2011.

Ceauşu, A., Tufiş, D. Addressing SMT Data Sparseness when Translating into Morphologically-Rich Languages // 8th international NLPCS workshop. Special theme: Human-machine interaction in translation, Copenhagen Business School, 20-21 August 2011.

Pinnis, M., Goba, K. Maximum Entropy Model for Disambiguation of Rich Morphological Tags // The 2nd Workshop on Systems and Frameworks for Computational Morphology (SFCM2011), Zürich, 26 August 2011.

Ljubešić, N., Fišer, D. Bootstrapping bilingual lexicons from comparable corpora for closely related languages // The 14th International Conference Text, Speech and Dialogue (TSD2011), Plzeň, Czech Republic, 1-5 September 2011.

Ljubešić, N., Erjavec, T. hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene // Proceedings of the 14th International Conference Text, Speech and Dialogue (TSD2011), Plzeň, Czech Republic, 1-5 September 2011.

Fišer, D., Ljubešić, N. Bilingual lexicon extraction from comparable corpora for closely related languages // Proceedings of the International conference Recent Advances in Natural Language Processing (RANLP2011), Hissar, Bulgaria, 12-14 September 2011.

Agić, Ž., Berović, D., Merkler, D., Tadić, M. Development and Applications of the Croatian 1984 Corpus for the MULTEXT-East Resources // The 2nd International Conference on Slavic Corpora (SlaviCorp2011), Dubrovnik, Croatia, 12-14 September 2011. (Slides)

Ljubešić, N. Erjavec, T. hrWaC and slWac: Web Corpora for Croatian and Slovene // The 2nd International Conference on Slavic Corpora (SlaviCorp2011), Dubrovnik, Croatia, 12-14 September 2011. (Slides)

| 2012-01-10 |

0