Automatic Machine Translation Evaluation - COMET Explained
Motivation
While the advance in deep learning has dramatically improved the machine translation quality, there is little development in the evaluation of machine translation models. The most widely-used metrics like BLEU [Papineni et al., 2002] and METEOR [Lavie and Denkowski, 2009] simply match the n-gram between the hypothesis text and reference text, which is too rigid without considering the variance in ground-truth translations and fail to differentiate the current highest performance machine translation models. They also cannot be accurately correlated with human judgment for a piece of text.
Background
Recently, model-based evaluation metrics have been proposed. Some matrics like METEOR-VECTOR [Servan et al., 2016], BLEU2VEC [T¨attar and Fishel, 2017], YISI-1 [Lo, 2019], MOVERSCORE [Zhao et al., 2019], and BERTSCORE [Zhang et al., 2020] are based on the pre-trained word embeddings like word2vec and BERT, to capture the semantic level similarity between hypotheses and references. Others like BLEURT [Sellam et al., 2020] aim to directly optimize the correlation with human judgment by training a machine learning model to learn the appropriate metric. There is also another kind of evaluation metrics called Quality Estimation that does not require reference text. COMET is built upon this category of metrics.
Method
COMET predict human evaluation based on a cross-lingual pre-trained language model (e.g. multilingual BERT [Devlin et al., 2019], XLM [Conneau and Lample, 2019] or XLM-RoBERTa [Conneau et al., 2019]) to make use of both the reference and source sentences. Authors propose two different architectures for COMET: The Estimator model which directly regresses the human evaluation score (e.g. DA, MQM, and HTER), and the Translation Ranking model which is trained to rank the hypotheses according to human evaluation.
First, they produce a sentence embedding using the pre-trained language model. For Estimator model, they aggregate the embeddings of the source sentence, hypothesis sentence, and reference sentence together the regress using a feed-forward layer. For Translation Ranking model, they pair two hypotheses per source/reference, such that one reference ranks higher than the other in terms of human evaluation. Then they apply the triplet margin loss to both the source-hypothesis triplet and the reference-hypothesis triplet and add them together. The results show that the Translation Ranking model performs better than the Estimator model and many state-of-the-art baseline metrics in terms of consistency with human evaluation.