行业文章

搜索导航

Meta-Evaluation For Translation Technology

2024-01-17 09:32:05 etogether.net 网络次

Automatic evaluation metrics and QE measures both need to be evaluated by what is called meta-evaluation, in order to validate their reliability and identify their strengths and weaknesses. It has become one of the main themes in various open MT evaluations. The reliability of an evaluation metric depends on its consistency with human judgments, i.e., the correlation of its evaluation results with manual assessment, to be measured by correlation coefficients. Commonly used correlation coefficients for this purpose include Pearson's r (Pearson 1900: 157–175), Spearman's p (Spearman 1904: 72–101) and Kendall's t (Kendall 1938: 81–93). The magnitude of correlation between evaluation scores and human judgments serves as the most important indicator of the performance of a metric.

Using the correlation rate with human judgment as objective function, parameters of an evaluation metric can be optimized. Two parameters that have been extensively studied are the amount of test data and the number of reference versions needed to rank MT systems reliably. Different experiments (Coughlin 2003: 63–70; Estrella et al. 2007: 167–174; Zhang and Vogel 2004: 85–94) give results to support the idea that a minimum of 250 sentences are required for texts of the same domain, and 500 for different domains. As there are various ways to translate a sentence, relying on only one version of reference may miss many other possible translations. Hence multiple references are recommended for a necessary coverage of translation options. Finch et al. (2004: 2019–2022) find that the correlation rate of a metric usually rises along with the number of references in use and becomes steady at four. No significant gain is then further obtained from more references. Furthermore, Coughlin (2003: 63-70) shows that even a single reference can yield a reliable evaluation result if the size of test data is large enough, i.e., 500 sentences or above, or the text domain is highly technical, e.g., computer.

Nevertheless, the reliability of evaluation metrics remains a highly disputed issue. Although the evaluation results of automatic metrics do correlate well with human judgments in most cases, there are still discordant ones. For instance, Culy and Riehemann (2003: 1–8) show that BLEU performs poorly on ranking MT output and human translation for literary texts, and some MT outputs even erroneously outscore professional human translations. Callison-Burch et al. (2006: 249–256) also give an example in a 2005 NIST MT evaluation exercise in which a system ranked at the top in human evaluation is ranked only sixth by BLEU scoring. Thurmair (2005) attributes the unreliable performance of evaluation metrics, especially BLEU, to the way they score translation quality. Since most evaluation metrics rely heavily on word matching against reference translation, a direct word-to-word translation is likely to yield a high evaluation score, but a free translation would then be a disaster. Babych et al. (2005: 412–418) state that the evaluation metrics currently in use in the field cannot give a 'universal' prediction of human perception of translation quality, and their predictive power is‘local' to a particular language or text type. The Metrics for Machine Translation Challenge which aims at formally evaluating existing automatic MT evaluation technology results in the following views on the shortcomings of current metrics (NIST 2010):

• They have not yet been proved able to consistently predict the usefulness, adequacy, and reliability of MT technologies.

• They have not demonstrated that they are as meaningful in target languages other than English.

• They need more insights into what properties of a translation should be evaluated and into how to evaluate those properties.

Currently, MT evaluation results based on automatic metrics are mainly used for ranking systems. They provide no other useful information about the quality of a particular piece of translation. Human evaluation is still indispensable whenever an in-depth and informative analysis is needed.

责任编辑：admin

上一篇：不同文化背景下的翻译工作问题
下一篇：英语习语的翻译及应用举例

微信公众号搜索“译员”关注我们，每天为您推送翻译理论和技巧，外语学习及翻译招聘信息。

行业文章

相关行业文章