the translation quality scores following a source-
only setting. As for the training data, we collect
synthetic data examples as supervision for con-
tinuous pre-training and apply a dataset pruning
strategy to increase the translation quality of the
training set. Also, during fine-tuning our QE model,
we use all available Direct Assessment (DA, Bo-
jar et al.,2017;Ma et al.,2018,2019;Mathur
et al.,2020) and Multidimensional Quality Met-
rics datasets (MQM, Freitag et al.,2021a,b) from
previous WMT competitions to further improve the
performance of our model. Besides, regarding the
applied PLM for UNITE models, we find that for
English-Russian (En-Ru) and Chinese-English (Zh-
En) directions, PLM enhanced with cross-lingual
alignments (INFOXLM, Chi et al.,2021) can de-
liver better results than conventional ones (XLM-R,
Conneau et al.,2020). Moreover, for each subtask
including English to German (En-De), En-Ru, Zh-
En, and multilingual direction evaluations, we build
an ensembled QE system to derive more accurate
and convincing results as final predictions.
Our models show impressive performances in all
translation directions. When only considering the
primary metric – Spearman’s correlation, we get
2nd, 3rd, and 3rd place in En-Ru, Zh-En, and multi-
lingual direction, respectively. More notably, when
taking all metrics into account, despite the slight
decrease in Spearman’s correlations, our systems
show outstanding overall performance than other
systems, achieving 1st place in En-Ru and multilin-
gual, and 2nd in En-De and Zh-En direction.
2 Method
As outlined in §1, we apply the UNITE frame-
work (Wan et al.,2022a) to obtain QE models. We
unify three types of input formats (i.e., source-only,
reference-only, and source-reference-combined)
into one single model during training. While during
inference, we only use the source-only paradigm
to collect evaluation scores. In this section, we in-
troduce the applied model architecture (§2.1), syn-
thetic data construction method (§2.2), and model
training strategy (§2.3).
2.1 Model architecture
Input Format
Following Wan et al. (2022a), we
design our QE model which is capable of pro-
cessing
source-only
,
reference-only
, and
source-
reference-combined
evaluation scenarios. Conse-
quently, for the consistency of training across all
input formats, we construct the input sequence for
source-only, reference-only, and source-reference-
combined input formats as follows:
xSRC =hsihh/sih/sish/si,(1)
xREF =hsihh/sih/sirh/si,(2)
xSRC+REF =hsihh/sih/sish/sih/sirh/si,
(3)
where
h
,
s
, and
r
represent hypothesis, source,
and reference sentence, respectively. During the
pre-training phase, we apply all input formats to
enhance the performance of QE models. Notably,
we only use the source-only format setting when
fine-tuning on this year’s dev set and inferring the
test set.
Model Backbone Selection
The core of quality
estimation aims at evaluating the translation quality
of output given source sentence. As the source and
hypothesis sentence are from different languages,
evaluating the translation quality requires the abil-
ity of multilingual processing. Furthermore, we be-
lieve that those PLMs which possess cross-lingual
semantic alignments can ease the learning of trans-
lation quality evaluation.
Referring to the setting of existing meth-
ods (Ranasinghe et al.,2020;Rei et al.,2020;Sel-
lam et al.,2020;Wan et al.,2022a), they often apply
XLM-R (Conneau et al.,2020) as the backbone of
evaluation models for better multilingual support.
To testify whether cross-lingual alignments can
help the evaluation model training, we further ap-
ply INFOXLM (Chi et al.,2021), which enhances
the XLM-R model with cross-lingual alignments,
as the backbone of evaluation models.
Model Training
For the training dataset includ-
ing source, reference, and hypothesis sentences,
we first equally split all examples into three parts,
each of which only serves one input format training.
As to each training example, after concatenating
the required input sentences into one sequence and
feeding it to PLM, we collect the corresponding
representations –
HREF,HSRC ,HSRC+REF
for each
input format, respectively. After that, we use the
output embedding assigned with CLS token
h
as
the sequence representation. Finally, a feedforward
network takes
h
as input and gives a scalar
p
as a