Alibaba-Translate Chinas Submission for WMT 2022 Quality Estimation Shared Task Keqin Bao12Yu Wan13Dayiheng Liu1Baosong Yang1Wenqiang Lei4

2025-04-30 0 0 281.74KB 9 页 10玖币
侵权投诉
Alibaba-Translate China’s Submission for
WMT 2022 Quality Estimation Shared Task
Keqin Bao1,2Yu Wan1,3Dayiheng Liu1Baosong Yang1Wenqiang Lei4
Xiangnan He2Derek F. Wong3Jun Xie1
1DAMO Academy, Alibaba Group 2University of Science and Technology of China
3NLP2CT Lab, University of Macau 4National University of Singapore
baokq@mail.ustc.edu.cn nlp2ct.ywan@gmail.com
{liudayiheng.ldyh,yangbaosong.ybs,qingjing.xj}@alibaba-inc.com
wenqianglei@gmail.com xiangnanhe@gmail.com derekfw@um.edu.mo
Abstract
In this paper, we present our submission to
the sentence-level MQM benchmark at Qual-
ity Estimation Shared Task, named UNITE
(Unified Translation Evaluation). Specifically,
our systems employ the framework of UNITE,
which combined three types of input formats
during training with a pre-trained language
model. First, we apply the pseudo-labeled data
examples for the continuously pre-training
phase. Notably, to reduce the gap between
pre-training and fine-tuning, we use data prun-
ing and a ranking-based score normalization
strategy. For the fine-tuning phase, we use
both Direct Assessment (DA) and Multidimen-
sional Quality Metrics (MQM) data from past
years’ WMT competitions. Finally, we collect
the source-only evaluation results, and ensem-
ble the predictions generated by two UNITE
models, whose backbones are XLM-R and IN-
FOXLM, respectively. Results show that our
models reach 1st overall ranking in the Mul-
tilingual and English-Russian settings, and
2nd overall ranking in English-German and
Chinese-English settings, showing relatively
strong performances in this year’s quality es-
timation competition.
1 Introduction
Quality Estimation (QE) aims at evaluating ma-
chine translation without access to a gold-standard
reference translation (Blatz et al.,2004;Specia
et al.,2018). Different from other evaluation tasks
(e.g., metric), QE arranges its process of evalu-
ation via only accessing source input. As the
performance of modern machine translation ap-
proaches increase (Vaswani et al.,2017;Lin et al.,
2022;Wei et al.,2022;Zhang et al.,2022), the
QE systems should better quantify the agreement
of cross-lingual semantics on source sentence and
translation hypothesis. The evaluation paradigm
Equal contribution. Work was done when Keqin Bao and
Yu Wan were interning at DAMO Academy, Alibaba Group.
of QE shows its own potential for real-world ap-
plications (Wang et al.,2021;Park et al.,2021;
Specia et al.,2021). This paper describes Alibaba
Translate China’s submission to the sentence-level
MQM benchmark at WMT 2022 Quality Estima-
tion Shared Task (Zerva et al.,2022).
In recent years, pre-trained language models
(PLMs) have shown their strong ability on extract-
ing cross-lingual information (Conneau et al.,2020;
Chi et al.,2021). To achieve a higher correlation
with human ratings on the quality of translation
outputs, plenty of trainable model-based QE ap-
proaches appear, e.g., COMET-QE (Rei et al.,
2020) and QEMIND (Wang et al.,2021). They
both first derive the embeddings assigned with
source and hypothesis sentence with given PLM,
then predict the overall score based on their embed-
dings with a followed feedforward network. Those
model-based approaches have greatly facilitated
the development of the QE community. However,
those models can only handle source-only input
format, which neglects the other two evaluation
scenarios, i.e., reference-only and source-reference-
combined evaluation. More importantly, training
with multiple input formats can achieve a higher
correlation with human assessments than individu-
ally training on specific evaluation scenarios (Wan
et al.,2021,2022a). Those findings indicate that,
the QE and Metric tasks share plenty of knowledge
when identifying the quality of translated outputs,
and unifying the functionalities of three evaluation
scenarios into one model can also enhance the per-
formance of the evaluation model on each scenario.
As a consequence, when building a single model
for a sentence-level QE task, we use the pipeline
of UNITE (Wan et al.,2022a), which integrates
source-only, reference-only, and source-reference-
combined translation evaluation ability into one
single model. When collecting the system out-
puts for WMT 2022 Quality Estimation Shared
Task, we employ our UNITE models to predict
arXiv:2210.10049v2 [cs.CL] 17 Feb 2023
the translation quality scores following a source-
only setting. As for the training data, we collect
synthetic data examples as supervision for con-
tinuous pre-training and apply a dataset pruning
strategy to increase the translation quality of the
training set. Also, during fine-tuning our QE model,
we use all available Direct Assessment (DA, Bo-
jar et al.,2017;Ma et al.,2018,2019;Mathur
et al.,2020) and Multidimensional Quality Met-
rics datasets (MQM, Freitag et al.,2021a,b) from
previous WMT competitions to further improve the
performance of our model. Besides, regarding the
applied PLM for UNITE models, we find that for
English-Russian (En-Ru) and Chinese-English (Zh-
En) directions, PLM enhanced with cross-lingual
alignments (INFOXLM, Chi et al.,2021) can de-
liver better results than conventional ones (XLM-R,
Conneau et al.,2020). Moreover, for each subtask
including English to German (En-De), En-Ru, Zh-
En, and multilingual direction evaluations, we build
an ensembled QE system to derive more accurate
and convincing results as final predictions.
Our models show impressive performances in all
translation directions. When only considering the
primary metric – Spearman’s correlation, we get
2nd, 3rd, and 3rd place in En-Ru, Zh-En, and multi-
lingual direction, respectively. More notably, when
taking all metrics into account, despite the slight
decrease in Spearman’s correlations, our systems
show outstanding overall performance than other
systems, achieving 1st place in En-Ru and multilin-
gual, and 2nd in En-De and Zh-En direction.
2 Method
As outlined in §1, we apply the UNITE frame-
work (Wan et al.,2022a) to obtain QE models. We
unify three types of input formats (i.e., source-only,
reference-only, and source-reference-combined)
into one single model during training. While during
inference, we only use the source-only paradigm
to collect evaluation scores. In this section, we in-
troduce the applied model architecture (§2.1), syn-
thetic data construction method (§2.2), and model
training strategy (§2.3).
2.1 Model architecture
Input Format
Following Wan et al. (2022a), we
design our QE model which is capable of pro-
cessing
source-only
,
reference-only
, and
source-
reference-combined
evaluation scenarios. Conse-
quently, for the consistency of training across all
input formats, we construct the input sequence for
source-only, reference-only, and source-reference-
combined input formats as follows:
xSRC =hsihh/sih/sish/si,(1)
xREF =hsihh/sih/sirh/si,(2)
xSRC+REF =hsihh/sih/sish/sih/sirh/si,
(3)
where
h
,
s
, and
r
represent hypothesis, source,
and reference sentence, respectively. During the
pre-training phase, we apply all input formats to
enhance the performance of QE models. Notably,
we only use the source-only format setting when
fine-tuning on this year’s dev set and inferring the
test set.
Model Backbone Selection
The core of quality
estimation aims at evaluating the translation quality
of output given source sentence. As the source and
hypothesis sentence are from different languages,
evaluating the translation quality requires the abil-
ity of multilingual processing. Furthermore, we be-
lieve that those PLMs which possess cross-lingual
semantic alignments can ease the learning of trans-
lation quality evaluation.
Referring to the setting of existing meth-
ods (Ranasinghe et al.,2020;Rei et al.,2020;Sel-
lam et al.,2020;Wan et al.,2022a), they often apply
XLM-R (Conneau et al.,2020) as the backbone of
evaluation models for better multilingual support.
To testify whether cross-lingual alignments can
help the evaluation model training, we further ap-
ply INFOXLM (Chi et al.,2021), which enhances
the XLM-R model with cross-lingual alignments,
as the backbone of evaluation models.
Model Training
For the training dataset includ-
ing source, reference, and hypothesis sentences,
we first equally split all examples into three parts,
each of which only serves one input format training.
As to each training example, after concatenating
the required input sentences into one sequence and
feeding it to PLM, we collect the corresponding
representations –
HREF,HSRC ,HSRC+REF
for each
input format, respectively. After that, we use the
output embedding assigned with CLS token
h
as
the sequence representation. Finally, a feedforward
network takes
h
as input and gives a scalar
p
as a
摘要:

Alibaba-TranslateChina'sSubmissionforWMT2022QualityEstimationSharedTaskKeqinBao1;2YuWan1;3DayihengLiu1BaosongYang1WenqiangLei4XiangnanHe2DerekF.Wong3JunXie11DAMOAcademy,AlibabaGroup2UniversityofScienceandTechnologyofChina3NLP2CTLab,UniversityofMacau4NationalUniversityofSingaporebaokq@mail.ustc.edu...

展开>> 收起<<
Alibaba-Translate Chinas Submission for WMT 2022 Quality Estimation Shared Task Keqin Bao12Yu Wan13Dayiheng Liu1Baosong Yang1Wenqiang Lei4.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:281.74KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注