Alibaba-Translate Chinas Submission for WMT 2022 Quality Estimation Shared Task Keqin Bao12Yu Wan13Dayiheng Liu1Baosong Yang1Wenqiang Lei4

2025-04-30 2 0 281.74KB 9 页 10玖币

侵权投诉

Alibaba-Translate China’s Submission for

WMT 2022 Quality Estimation Shared Task

Keqin Bao1,2∗Yu Wan1,3∗Dayiheng Liu1Baosong Yang1Wenqiang Lei4

Xiangnan He2Derek F. Wong3Jun Xie1

1DAMO Academy, Alibaba Group 2University of Science and Technology of China

3NLP2CT Lab, University of Macau 4National University of Singapore

baokq@mail.ustc.edu.cn nlp2ct.ywan@gmail.com

{liudayiheng.ldyh,yangbaosong.ybs,qingjing.xj}@alibaba-inc.com

wenqianglei@gmail.com xiangnanhe@gmail.com derekfw@um.edu.mo

Abstract

In this paper, we present our submission to

the sentence-level MQM benchmark at Qual-

ity Estimation Shared Task, named UNITE

(Uniﬁed Translation Evaluation). Speciﬁcally,

our systems employ the framework of UNITE,

which combined three types of input formats

during training with a pre-trained language

model. First, we apply the pseudo-labeled data

examples for the continuously pre-training

phase. Notably, to reduce the gap between

pre-training and ﬁne-tuning, we use data prun-

ing and a ranking-based score normalization

strategy. For the ﬁne-tuning phase, we use

both Direct Assessment (DA) and Multidimen-

sional Quality Metrics (MQM) data from past

years’ WMT competitions. Finally, we collect

the source-only evaluation results, and ensem-

ble the predictions generated by two UNITE

models, whose backbones are XLM-R and IN-

FOXLM, respectively. Results show that our

models reach 1st overall ranking in the Mul-

tilingual and English-Russian settings, and

2nd overall ranking in English-German and

Chinese-English settings, showing relatively

strong performances in this year’s quality es-

timation competition.

1 Introduction

Quality Estimation (QE) aims at evaluating ma-

chine translation without access to a gold-standard

reference translation (Blatz et al.,2004;Specia

et al.,2018). Different from other evaluation tasks

(e.g., metric), QE arranges its process of evalu-

ation via only accessing source input. As the

performance of modern machine translation ap-

proaches increase (Vaswani et al.,2017;Lin et al.,

2022;Wei et al.,2022;Zhang et al.,2022), the

QE systems should better quantify the agreement

of cross-lingual semantics on source sentence and

translation hypothesis. The evaluation paradigm

∗

Equal contribution. Work was done when Keqin Bao and

Yu Wan were interning at DAMO Academy, Alibaba Group.

of QE shows its own potential for real-world ap-

plications (Wang et al.,2021;Park et al.,2021;

Specia et al.,2021). This paper describes Alibaba

Translate China’s submission to the sentence-level

MQM benchmark at WMT 2022 Quality Estima-

tion Shared Task (Zerva et al.,2022).

In recent years, pre-trained language models

(PLMs) have shown their strong ability on extract-

ing cross-lingual information (Conneau et al.,2020;

Chi et al.,2021). To achieve a higher correlation

with human ratings on the quality of translation

outputs, plenty of trainable model-based QE ap-

proaches appear, e.g., COMET-QE (Rei et al.,

2020) and QEMIND (Wang et al.,2021). They

both ﬁrst derive the embeddings assigned with

source and hypothesis sentence with given PLM,

then predict the overall score based on their embed-

dings with a followed feedforward network. Those

model-based approaches have greatly facilitated

the development of the QE community. However,

those models can only handle source-only input

format, which neglects the other two evaluation

scenarios, i.e., reference-only and source-reference-

combined evaluation. More importantly, training

with multiple input formats can achieve a higher

correlation with human assessments than individu-

ally training on speciﬁc evaluation scenarios (Wan

et al.,2021,2022a). Those ﬁndings indicate that,

the QE and Metric tasks share plenty of knowledge

when identifying the quality of translated outputs,

and unifying the functionalities of three evaluation

scenarios into one model can also enhance the per-

formance of the evaluation model on each scenario.

As a consequence, when building a single model

for a sentence-level QE task, we use the pipeline

of UNITE (Wan et al.,2022a), which integrates

source-only, reference-only, and source-reference-

combined translation evaluation ability into one

single model. When collecting the system out-

puts for WMT 2022 Quality Estimation Shared

Task, we employ our UNITE models to predict

arXiv:2210.10049v2 [cs.CL] 17 Feb 2023

the translation quality scores following a source-

only setting. As for the training data, we collect

synthetic data examples as supervision for con-

tinuous pre-training and apply a dataset pruning

strategy to increase the translation quality of the

training set. Also, during ﬁne-tuning our QE model,

we use all available Direct Assessment (DA, Bo-

jar et al.,2017;Ma et al.,2018,2019;Mathur

et al.,2020) and Multidimensional Quality Met-

rics datasets (MQM, Freitag et al.,2021a,b) from

previous WMT competitions to further improve the

performance of our model. Besides, regarding the

applied PLM for UNITE models, we ﬁnd that for

English-Russian (En-Ru) and Chinese-English (Zh-

En) directions, PLM enhanced with cross-lingual

alignments (INFOXLM, Chi et al.,2021) can de-

liver better results than conventional ones (XLM-R,

Conneau et al.,2020). Moreover, for each subtask

including English to German (En-De), En-Ru, Zh-

En, and multilingual direction evaluations, we build

an ensembled QE system to derive more accurate

and convincing results as ﬁnal predictions.

Our models show impressive performances in all

translation directions. When only considering the

primary metric – Spearman’s correlation, we get

2nd, 3rd, and 3rd place in En-Ru, Zh-En, and multi-

lingual direction, respectively. More notably, when

taking all metrics into account, despite the slight

decrease in Spearman’s correlations, our systems

show outstanding overall performance than other

systems, achieving 1st place in En-Ru and multilin-

gual, and 2nd in En-De and Zh-En direction.

2 Method

As outlined in §1, we apply the UNITE frame-

work (Wan et al.,2022a) to obtain QE models. We

unify three types of input formats (i.e., source-only,

reference-only, and source-reference-combined)

into one single model during training. While during

inference, we only use the source-only paradigm

to collect evaluation scores. In this section, we in-

troduce the applied model architecture (§2.1), syn-

thetic data construction method (§2.2), and model

training strategy (§2.3).

2.1 Model architecture

Input Format

Following Wan et al. (2022a), we

design our QE model which is capable of pro-

cessing

source-only

reference-only

, and

source-

reference-combined

evaluation scenarios. Conse-

quently, for the consistency of training across all

input formats, we construct the input sequence for

source-only, reference-only, and source-reference-

combined input formats as follows:

xSRC =hsihh/sih/sish/si,(1)

xREF =hsihh/sih/sirh/si,(2)

xSRC+REF =hsihh/sih/sish/sih/sirh/si,

(3)

where

, and

represent hypothesis, source,

and reference sentence, respectively. During the

pre-training phase, we apply all input formats to

enhance the performance of QE models. Notably,

we only use the source-only format setting when

ﬁne-tuning on this year’s dev set and inferring the

test set.

Model Backbone Selection

The core of quality

estimation aims at evaluating the translation quality

of output given source sentence. As the source and

hypothesis sentence are from different languages,

evaluating the translation quality requires the abil-

ity of multilingual processing. Furthermore, we be-

lieve that those PLMs which possess cross-lingual

semantic alignments can ease the learning of trans-

lation quality evaluation.

Referring to the setting of existing meth-

ods (Ranasinghe et al.,2020;Rei et al.,2020;Sel-

lam et al.,2020;Wan et al.,2022a), they often apply

XLM-R (Conneau et al.,2020) as the backbone of

evaluation models for better multilingual support.

To testify whether cross-lingual alignments can

help the evaluation model training, we further ap-

ply INFOXLM (Chi et al.,2021), which enhances

the XLM-R model with cross-lingual alignments,

as the backbone of evaluation models.

Model Training

For the training dataset includ-

ing source, reference, and hypothesis sentences,

we ﬁrst equally split all examples into three parts,

each of which only serves one input format training.

As to each training example, after concatenating

the required input sentences into one sequence and

feeding it to PLM, we collect the corresponding

representations –

HREF,HSRC ,HSRC+REF

for each

input format, respectively. After that, we use the

output embedding assigned with CLS token

the sequence representation. Finally, a feedforward

network takes

as input and gives a scalar

as a

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Alibaba-TranslateChina'sSubmissionforWMT2022QualityEstimationSharedTaskKeqinBao1;2YuWan1;3DayihengLiu1BaosongYang1WenqiangLei4XiangnanHe2DerekF.Wong3JunXie11DAMOAcademy,AlibabaGroup2UniversityofScienceandTechnologyofChina3NLP2CTLab,UniversityofMacau4NationalUniversityofSingaporebaokq@mail.ustc.edu...

展开>> 收起<<

Alibaba-Translate Chinas Submission for WMT 2022 Quality Estimation Shared Task Keqin Bao12Yu Wan13Dayiheng Liu1Baosong Yang1Wenqiang Lei4.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Alibaba-Translate Chinas Submission for WMT 2022 Quality Estimation Shared Task Keqin Bao12Yu Wan13Dayiheng Liu1Baosong Yang1Wenqiang Lei4

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: