News Release

In the quest for excellence in machine translation

Business Announcement

Elhuyar Fundazioa

All machine translators have their limitations. The translations are not perfect by any means, but they are suitable tools for providing help on more than one occasion. The UPV/EHU's IXA Group has recently secured a European project to overcome these limitations and to go on conducting research into machine translation: QTLeap. The IXA Group will be working in collaboration with many other European institutions that are pioneers in machine translation: the DKFI of Germany, University of Lisbon, the Charles University in Prague, the Bulgarian Academy of Sciences IICT-BAS, the Humboldt University of Berlin, and the University of Groningen.

"It's easier to produce a good machine translation between language pairs that are grammatically and morphologically similar, like the Spanish-Catalan or the Spanish-Galician pairs," explained Kepa Sarasola, a member of the IXA Group. In the case of Basque, however, the difficulty is greater and the quality more doubtful. "There are three main difficulties: firstly, the big morphological and grammatical differences between Basque and other languages; secondly, the selection of suitable equivalents that one word has in other languages (having to choose which meaning to use in each specific context); and finally, having a limited corpus of translated texts." Compared with other languages, Basque has a totally different structure and it is very difficult for machine translators to get the order of the elements in the translation right. In addition, one of the main challenges facing Basque machine translators is to obtain a sufficiently large number of translated texts so that large corpora can be produced.

Closer to perfection

Members of the UPV/EHU's IXA Group have been working hard on machine translators, and with the European QTLeap project they want to take a new step forward in this field and on the research level. In this project they will be endeavouring to overcome the limitations of today's machine translators.

For this purpose they will be using treebanks. "The aim behind using a lot of sentences that have been properly analysed syntactically, in other words by means of treebanks, is to help machine translators select the syntax better," says Sarasola.

On the other hand, the information needed to distinguish the meaning of a word is not only to be found in dictionaries, it can also be obtained from the Internet. So certain Internet resources will be used to properly distinguish between the meanings of words in order to dispel any doubts. For example, in large text collections like Wikipedia, very often the meaning of a word has been properly specified. In other words, "if there is a link below a word, the link will lead to one Wikipedia meaning or another." If you gather a lot of these links together, computers can learn from them in order to distinguish between meanings. Wikipedia is just one of the possible sources. "Today there are more and more texts on the Internet with links of this kind; all these possibilities are known as Linked Open Data," he added. That would help, for example, to clarify whether the Basque word baso is bosque (forest) or vaso (glass) in Spanish.

"We will also be working specifically on proper names, people's names, names of organisations or geographical locations. In fact, it is impossible to have all proper names properly tagged, but having a large number of them sorted would greatly improve the quality of the translations," said Sarasola. That would provide an opportunity, for example, to keep the name Pilar del Castillo as it is in translations and not to end up with the result gazteluko pilarea (the castle pillar).

These resources can be consulted in two ways: off-line or on-line. Off-line, a lot of information can be gathered and organised before starting to translate, and it can then be easily used in the translations. In an on-line consultation, on the other hand, while translating the program can resort to the Internet and look up how to translate the word or proper name that the machine does not understand.

Therefore, these three main branches —treebanks, Internet resources and proper names— will be studied by the IXA Group from November onwards in the QTLeap project, to look for new solutions for the three main problems dogging current Basque machine translators —morphological and grammatical differences, meaning disambiguation, and poor corpus.

The long journey of machine translation

In the 1950s people started working on ideas to develop machine translation. Since then, many approaches have been put forward with the aim of achieving a successful system. The UPV/EHU's IXA Group has been working hard on these tools over the last fifteen years in collaboration with Elhuyar.

Recent years have seen a flowering of tools over the Internet for doing machine translation, including ones that anyone can download and use. One of them is Opentrad. OpenTrad is an open-code-based machine translation system available in all of Spain's official languages: Spanish, Catalan, Galician and Basque. Opentrad translation is underpinned by two technologies: Matxin (Spanish-Basque) developed by the UPV/EHU's IXA Group, and Apertium (Spanish-Catalan-Galician, etc.).

###


Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.