Automatic evaluation metrics and QE measures both need to be evaluated by what is called meta-evaluation, in order to validate their reliability and identify their strengths and weaknesses. It has become one of the main themes in various open MT evaluations. The reliability of an evaluation metric depends on its consistency with human judgments, i.e., the correlation of its evaluation results with manual assessment, to be measured by correlation coefficients. Commonly used correlation coefficients for this purpose include Pearson's r (Pearson 1900: 157–175), Spearman's p (Spearman 1904: 72–101) and Kendall's t (Kendall 1938: 81–93). The magnitude of correlation between evaluation scores and human judgments serves as the most important indicator of the performance of a metric.
Using the correlation rate with human judgment as objective function, parameters of an evaluation metric can be optimized. Two parameters that have been extensively studied are the amount of test data and the number of reference versions needed to rank MT systems reliably. Different experiments (Coughlin 2003: 63–70; Estrella et al. 2007: 167–174; Zhang and Vogel 2004: 85–94) give results to support the idea that a minimum of 250 sentences are required for texts of the same domain, and 500 for different domains. As there are various ways to translate a sentence, relying on only one version of reference may miss many other possible translations. Hence multiple references are recommended for a necessary coverage of translation options. Finch et al. (2004: 2019–2022) find that the correlation rate of a metric usually rises along with the number of references in use and becomes steady at four. No significant gain is then further obtained from more references. Furthermore, Coughlin (2003: 63-70) shows that even a single reference can yield a reliable evaluation result if the size of test data is large enough, i.e., 500 sentences or above, or the text domain is highly technical, e.g., computer.
Nevertheless, the reliability of evaluation metrics remains a highly disputed issue. Although the evaluation results of automatic metrics do correlate well with human judgments in most cases, there are still discordant ones. For instance, Culy and Riehemann (2003: 1–8) show that BLEU performs poorly on ranking MT output and human translation for literary texts, and some MT outputs even erroneously outscore professional human translations. Callison-Burch et al. (2006: 249–256) also give an example in a 2005 NIST MT evaluation exercise in which a system ranked at the top in human evaluation is ranked only sixth by BLEU scoring. Thurmair (2005) attributes the unreliable performance of evaluation metrics, especially BLEU, to the way they score translation quality. Since most evaluation metrics rely heavily on word matching against reference translation, a direct word-to-word translation is likely to yield a high evaluation score, but a free translation would then be a disaster. Babych et al. (2005: 412–418) state that the evaluation metrics currently in use in the field cannot give a 'universal' prediction of human perception of translation quality, and their predictive power is‘local' to a particular language or text type. The Metrics for Machine Translation Challenge which aims at formally evaluating existing automatic MT evaluation technology results in the following views on the shortcomings of current metrics (NIST 2010):
• They have not yet been proved able to consistently predict the usefulness, adequacy, and reliability of MT technologies.
• They have not demonstrated that they are as meaningful in target languages other than English.
• They need more insights into what properties of a translation should be evaluated and into how to evaluate those properties.
Currently, MT evaluation results based on automatic metrics are mainly used for ranking systems. They provide no other useful information about the quality of a particular piece of translation. Human evaluation is still indispensable whenever an in-depth and informative analysis is needed.
责任编辑:admin