
Alibaba-Translate China’s Submission for
WMT 2022 Metrics Shared Task
Yu Wan1,2∗Keqin Bao1,3∗Dayiheng Liu1Baosong Yang1Derek F. Wong2
Lidia S. Chao2Wenqiang Lei4Jun Xie1
1DAMO Academy, Alibaba Group 2NLP2CT Lab, University of Macau
3University of Science and Technology of China 4National University of Singapore
nlp2ct.ywan@gmail.com baokq@mail.ustc.edu.cn
{liudayiheng.ldyh,yangbaosong.ybs,qingjing.xj}@alibaba-inc.com
{derekfw,lidiasc}@um.edu.mo wenqianglei@gmail.com
Abstract
In this report, we present our submission to the
WMT 2022 Metrics Shared Task. We build
our system based on the core idea of UNITE
(Unified Translation Evaluation), which uni-
fies source-only, reference-only, and source-
reference-combined evaluation scenarios into
one single model. Specifically, during the
model pre-training phase, we first apply the
pseudo-labeled data examples to continuously
pre-train UNITE. Notably, to reduce the gap
between pre-training and fine-tuning, we use
data cropping and a ranking-based score nor-
malization strategy. During the fine-tuning
phase, we use both Direct Assessment (DA)
and Multidimensional Quality Metrics (MQM)
data from past years’ WMT competitions. Spe-
cially, we collect the results from models with
different pre-trained language model back-
bones, and use different ensembling strategies
for involved translation directions.
1 Introduction
Translation metric aims at delivering accurate and
convincing predictions to identify the translation
quality of outputs with access to one or many gold-
standard reference translations (Ma et al.,2018,
2019;Mathur et al.,2020;Freitag et al.,2021b).
As the development of neural machine translation
research (Vaswani et al.,2017;Wei et al.,2022), the
metric methods should be capable of evaluating the
high-quality translations at the level of semantics
rather than surfance-level features (Sellam et al.,
2020;Ranasinghe et al.,2020;Rei et al.,2020;Wan
et al.,2022a). In this paper, we describe Alibaba
Translate China’s submissions to the WMT 2022
Metrics Shared Task to deliver a more adequate
evaluation solution at the level of semantics.
Pre-trained language models (PLMs) like
BERT (Devlin et al.,2019) and XLM-R (Conneau
∗
Equal contribution. Work was done when Yu Wan
and Keqin Bao were interning at DAMO Academy, Alibaba
Group.
et al.,2020) have shown promising results in iden-
tifying the quality of translation outputs. Com-
pared to conventional statistical- (e.g., BLEU, Pa-
pineni et al.,2002 and representation-based meth-
ods (e.g., BERTSCORE,Zhang et al.,2020), the
model-based approaches (e.g., BLEURT, Sellam
et al.,2020; COMET, Rei et al.,2020; UNITE, Wan
et al.,2022a) show their strong ability on delivering
more accurate quality predictions, especially those
approaches which apply source sentences as addi-
tional input for the metric model (Rei et al.,2020;
Takahashi et al.,2020;Wan et al.,2021,2022a).
Specifically, those metric models are designed as
a combination of PLM and feedforward network,
where the former is in charge of deriving represen-
tations on input sequence, and the latter predicts
the translation quality based on the representation.
The metric model, which is trained on synthetic
or human annotations following a regressive objec-
tive, learns to mimic human predictions to identify
the translation quality of the hypothesis sentence.
Although those model-based metrics have shown
promising results in modern applications and trans-
lation quality estimation, they still show their own
shortcomings as follows. First, they often han-
dle one specific evaluation scenario, e.g., COMET
serves source-reference-only evaluation, where the
source and reference sentence should be concur-
rently fed to the model for prediction. For the other
evaluation scenarios, they hardly give accurate pre-
dictions, showing the straits of metric models due
to the disagreement between training and inference.
Besides, recent studies have investigated the feasi-
bility of unifying those evaluation scenarios into
one single model, which can further improve the
evaluation correlation with human ratings in any
scenario among source-only, reference-only, and
source-reference-combined evaluation (Wan et al.,
2021,2022a). This indicates that, training with
multiple input formats than a specific one can de-
liver more appropriate predictions for translation
arXiv:2210.09683v2 [cs.CL] 17 Feb 2023