Alibaba-Translate Chinas Submission for WMT 2022 Metrics Shared Task Yu Wan12Keqin Bao13Dayiheng Liu1Baosong Yang1Derek F. Wong2

2025-04-30 0 0 261.85KB 7 页 10玖币

侵权投诉

Alibaba-Translate China’s Submission for

WMT 2022 Metrics Shared Task

Yu Wan1,2∗Keqin Bao1,3∗Dayiheng Liu1Baosong Yang1Derek F. Wong2

Lidia S. Chao2Wenqiang Lei4Jun Xie1

1DAMO Academy, Alibaba Group 2NLP2CT Lab, University of Macau

3University of Science and Technology of China 4National University of Singapore

nlp2ct.ywan@gmail.com baokq@mail.ustc.edu.cn

{liudayiheng.ldyh,yangbaosong.ybs,qingjing.xj}@alibaba-inc.com

{derekfw,lidiasc}@um.edu.mo wenqianglei@gmail.com

Abstract

In this report, we present our submission to the

WMT 2022 Metrics Shared Task. We build

our system based on the core idea of UNITE

(Uniﬁed Translation Evaluation), which uni-

ﬁes source-only, reference-only, and source-

reference-combined evaluation scenarios into

one single model. Speciﬁcally, during the

model pre-training phase, we ﬁrst apply the

pseudo-labeled data examples to continuously

pre-train UNITE. Notably, to reduce the gap

between pre-training and ﬁne-tuning, we use

data cropping and a ranking-based score nor-

malization strategy. During the ﬁne-tuning

phase, we use both Direct Assessment (DA)

and Multidimensional Quality Metrics (MQM)

data from past years’ WMT competitions. Spe-

cially, we collect the results from models with

different pre-trained language model back-

bones, and use different ensembling strategies

for involved translation directions.

1 Introduction

Translation metric aims at delivering accurate and

convincing predictions to identify the translation

quality of outputs with access to one or many gold-

standard reference translations (Ma et al.,2018,

2019;Mathur et al.,2020;Freitag et al.,2021b).

As the development of neural machine translation

research (Vaswani et al.,2017;Wei et al.,2022), the

metric methods should be capable of evaluating the

high-quality translations at the level of semantics

rather than surfance-level features (Sellam et al.,

2020;Ranasinghe et al.,2020;Rei et al.,2020;Wan

et al.,2022a). In this paper, we describe Alibaba

Translate China’s submissions to the WMT 2022

Metrics Shared Task to deliver a more adequate

evaluation solution at the level of semantics.

Pre-trained language models (PLMs) like

BERT (Devlin et al.,2019) and XLM-R (Conneau

∗

Equal contribution. Work was done when Yu Wan

and Keqin Bao were interning at DAMO Academy, Alibaba

Group.

et al.,2020) have shown promising results in iden-

tifying the quality of translation outputs. Com-

pared to conventional statistical- (e.g., BLEU, Pa-

pineni et al.,2002 and representation-based meth-

ods (e.g., BERTSCORE,Zhang et al.,2020), the

model-based approaches (e.g., BLEURT, Sellam

et al.,2020; COMET, Rei et al.,2020; UNITE, Wan

et al.,2022a) show their strong ability on delivering

more accurate quality predictions, especially those

approaches which apply source sentences as addi-

tional input for the metric model (Rei et al.,2020;

Takahashi et al.,2020;Wan et al.,2021,2022a).

Speciﬁcally, those metric models are designed as

a combination of PLM and feedforward network,

where the former is in charge of deriving represen-

tations on input sequence, and the latter predicts

the translation quality based on the representation.

The metric model, which is trained on synthetic

or human annotations following a regressive objec-

tive, learns to mimic human predictions to identify

the translation quality of the hypothesis sentence.

Although those model-based metrics have shown

promising results in modern applications and trans-

lation quality estimation, they still show their own

shortcomings as follows. First, they often han-

dle one speciﬁc evaluation scenario, e.g., COMET

serves source-reference-only evaluation, where the

source and reference sentence should be concur-

rently fed to the model for prediction. For the other

evaluation scenarios, they hardly give accurate pre-

dictions, showing the straits of metric models due

to the disagreement between training and inference.

Besides, recent studies have investigated the feasi-

bility of unifying those evaluation scenarios into

one single model, which can further improve the

evaluation correlation with human ratings in any

scenario among source-only, reference-only, and

source-reference-combined evaluation (Wan et al.,

2021,2022a). This indicates that, training with

multiple input formats than a speciﬁc one can de-

liver more appropriate predictions for translation

arXiv:2210.09683v2 [cs.CL] 17 Feb 2023

quality identiﬁcation. More importantly, unifying

all translation evaluation functionalities into one

single model can serve as a more convenient toolkit

in real-world applications.

Following the idea of Wan et al. (2022a) and

the experience in previous competition (Wan et al.,

2021), we directly use the pipeline of UNITE (Wan

et al.,2022a) to build models for this year’s met-

ric task. Each of our models can integrate the

functionalities of source-only, reference-only, and

source-reference-combined translation evaluation

into itself. When collecting the system outputs

for the WMT 2022 Metrics Shared Task, we em-

ploy our UNITE models to predict the transla-

tion quality scores following the source-reference-

combined setting. Compared to the previous ver-

sion of UNITE (Wan et al.,2022a), we reform

the synthetic training set for the continuous pre-

training phase, raising the ratio of training ex-

amples consisting of high-quality hypothesis sen-

tences. Also, during ﬁne-tuning our metric model,

we apply available Direct Assessment (DA, Bo-

jar et al.,2017;Ma et al.,2018,2019;Mathur

et al.,2020) and Multidimensional Quality Met-

rics datasets (MQM, Freitag et al.,2021a,b) from

previous WMT competitions to further improve the

performance of our model. Speciﬁcally, for each

translation direction among English to German (En-

De), English to Russian (En-Ru), and Chinese to

English (Zh-En) directions, we applied different en-

sembling strategies to achieve a better correlation

with human ratings on MQM 2021 dataset. Results

on WMT 2021 MQM dataset further demonstrate

the effectiveness of our method.

2 Method

As outlined in §1, we apply the UNITE frame-

work (Wan et al.,2022a) to obtain metric mod-

els. We use three types of input formats (i.e.,

source-only, reference-only, and source-reference-

combined) during training. While during infer-

ence, we only use the source-reference-combined

paradigm to collect evaluation scores. In this sec-

tion, we introduce the applied model architecture

(§2.1), synthetic data construction method (§2.2),

and model training strategy (§2.3) for this year’s

metric competition.

2.1 Model architecture

Input Format

Following Wan et al. (2022a),

we construct the input sequence for source-only,

reference-only, and source-reference-combined in-

put formats as follows:

xSRC =[BOS]h[DEL]s[EOS],(1)

xREF =[BOS]h[DEL]r[EOS],(2)

xSRC+REF =[BOS]h[DEL]s[DEL]r[EOS],

(3)

where

[BOS]

[DEL]

and

[EOS]

represent the

beginning, the delimeter, and the ending of se-

quence,

and

, and

are hypothesis, source,

and reference sentence, respectively. During the

pre-training phase, we applied all input formats to

enhance the performance of UNITE models.

Model Backbone Selection

Aside from the ref-

erence sentence which is written in the same

language as the hypothesis sentence, the source

is in another different language. We believe

that, cross-lingual semantic alignments can ease

the model training on source-only and source-

reference-combined scenarios. Referring to the set-

ting of existing methods (Ranasinghe et al.,2020;

Rei et al.,2020;Sellam et al.,2020;Wan et al.,

2022a), they apply XLM-R (Conneau et al.,2020)

as the backbone of evaluation models for better

multilingual support. In this competition, we addi-

tionally use INFOXLM (Chi et al.,2021), which en-

hances the XLM-R model with cross-lingual align-

ments, as the backbone of our UNITE models.

Model Training

Following Wan et al. (2022a),

we ﬁrst equally split all examples into three parts,

each of which only serves one input format training.

As to each training example, after concatenating

the required input sentences into one sequence and

feeding it to PLM, we collect the corresponding

representations –

HREF ,HSRC ,HSRC+REF

for each

input format, respectively. After that, we use the

output embedding assigned with CLS token

the sequence representation. Finally, a feedforward

network takes

as input and gives a scalar

as a

prediction. Taking xSRC as an example:

HSRC =PLM(xSRC)∈R(lh+ls)×d,(4)

hSRC =CLS(HSRC)∈Rd,(5)

pSRC =FeedForward(hSRC)∈R1,(6)

where

and

are the lengths of

and

, respec-

tively.

Those symbols may vary if we use different PLMs, e.g.,

“[BOS]”, “[SEP]”, and “[SEP]” for English BERT (Devlin

et al.,2019), and “<s>”, “</s> </s>”, and “</s>” for XLM-

R (Conneau et al.,2020).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Alibaba-TranslateChina'sSubmissionforWMT2022MetricsSharedTaskYuWan1;2KeqinBao1;3DayihengLiu1BaosongYang1DerekF.Wong2LidiaS.Chao2WenqiangLei4JunXie11DAMOAcademy,AlibabaGroup2NLP2CTLab,UniversityofMacau3UniversityofScienceandTechnologyofChina4NationalUniversityofSingaporenlp2ct.ywan@gmail.combaokq@m...

展开>> 收起<<

Alibaba-Translate Chinas Submission for WMT 2022 Metrics Shared Task Yu Wan12Keqin Bao13Dayiheng Liu1Baosong Yang1Derek F. Wong2.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Alibaba-Translate Chinas Submission for WMT 2022 Metrics Shared Task Yu Wan12Keqin Bao13Dayiheng Liu1Baosong Yang1Derek F. Wong2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: