Alibaba-Translate Chinas Submission for WMT 2022 Metrics Shared Task Yu Wan12Keqin Bao13Dayiheng Liu1Baosong Yang1Derek F. Wong2

2025-04-30 0 0 261.85KB 7 页 10玖币
侵权投诉
Alibaba-Translate China’s Submission for
WMT 2022 Metrics Shared Task
Yu Wan1,2Keqin Bao1,3Dayiheng Liu1Baosong Yang1Derek F. Wong2
Lidia S. Chao2Wenqiang Lei4Jun Xie1
1DAMO Academy, Alibaba Group 2NLP2CT Lab, University of Macau
3University of Science and Technology of China 4National University of Singapore
nlp2ct.ywan@gmail.com baokq@mail.ustc.edu.cn
{liudayiheng.ldyh,yangbaosong.ybs,qingjing.xj}@alibaba-inc.com
{derekfw,lidiasc}@um.edu.mo wenqianglei@gmail.com
Abstract
In this report, we present our submission to the
WMT 2022 Metrics Shared Task. We build
our system based on the core idea of UNITE
(Unified Translation Evaluation), which uni-
fies source-only, reference-only, and source-
reference-combined evaluation scenarios into
one single model. Specifically, during the
model pre-training phase, we first apply the
pseudo-labeled data examples to continuously
pre-train UNITE. Notably, to reduce the gap
between pre-training and fine-tuning, we use
data cropping and a ranking-based score nor-
malization strategy. During the fine-tuning
phase, we use both Direct Assessment (DA)
and Multidimensional Quality Metrics (MQM)
data from past years’ WMT competitions. Spe-
cially, we collect the results from models with
different pre-trained language model back-
bones, and use different ensembling strategies
for involved translation directions.
1 Introduction
Translation metric aims at delivering accurate and
convincing predictions to identify the translation
quality of outputs with access to one or many gold-
standard reference translations (Ma et al.,2018,
2019;Mathur et al.,2020;Freitag et al.,2021b).
As the development of neural machine translation
research (Vaswani et al.,2017;Wei et al.,2022), the
metric methods should be capable of evaluating the
high-quality translations at the level of semantics
rather than surfance-level features (Sellam et al.,
2020;Ranasinghe et al.,2020;Rei et al.,2020;Wan
et al.,2022a). In this paper, we describe Alibaba
Translate China’s submissions to the WMT 2022
Metrics Shared Task to deliver a more adequate
evaluation solution at the level of semantics.
Pre-trained language models (PLMs) like
BERT (Devlin et al.,2019) and XLM-R (Conneau
Equal contribution. Work was done when Yu Wan
and Keqin Bao were interning at DAMO Academy, Alibaba
Group.
et al.,2020) have shown promising results in iden-
tifying the quality of translation outputs. Com-
pared to conventional statistical- (e.g., BLEU, Pa-
pineni et al.,2002 and representation-based meth-
ods (e.g., BERTSCORE,Zhang et al.,2020), the
model-based approaches (e.g., BLEURT, Sellam
et al.,2020; COMET, Rei et al.,2020; UNITE, Wan
et al.,2022a) show their strong ability on delivering
more accurate quality predictions, especially those
approaches which apply source sentences as addi-
tional input for the metric model (Rei et al.,2020;
Takahashi et al.,2020;Wan et al.,2021,2022a).
Specifically, those metric models are designed as
a combination of PLM and feedforward network,
where the former is in charge of deriving represen-
tations on input sequence, and the latter predicts
the translation quality based on the representation.
The metric model, which is trained on synthetic
or human annotations following a regressive objec-
tive, learns to mimic human predictions to identify
the translation quality of the hypothesis sentence.
Although those model-based metrics have shown
promising results in modern applications and trans-
lation quality estimation, they still show their own
shortcomings as follows. First, they often han-
dle one specific evaluation scenario, e.g., COMET
serves source-reference-only evaluation, where the
source and reference sentence should be concur-
rently fed to the model for prediction. For the other
evaluation scenarios, they hardly give accurate pre-
dictions, showing the straits of metric models due
to the disagreement between training and inference.
Besides, recent studies have investigated the feasi-
bility of unifying those evaluation scenarios into
one single model, which can further improve the
evaluation correlation with human ratings in any
scenario among source-only, reference-only, and
source-reference-combined evaluation (Wan et al.,
2021,2022a). This indicates that, training with
multiple input formats than a specific one can de-
liver more appropriate predictions for translation
arXiv:2210.09683v2 [cs.CL] 17 Feb 2023
quality identification. More importantly, unifying
all translation evaluation functionalities into one
single model can serve as a more convenient toolkit
in real-world applications.
Following the idea of Wan et al. (2022a) and
the experience in previous competition (Wan et al.,
2021), we directly use the pipeline of UNITE (Wan
et al.,2022a) to build models for this year’s met-
ric task. Each of our models can integrate the
functionalities of source-only, reference-only, and
source-reference-combined translation evaluation
into itself. When collecting the system outputs
for the WMT 2022 Metrics Shared Task, we em-
ploy our UNITE models to predict the transla-
tion quality scores following the source-reference-
combined setting. Compared to the previous ver-
sion of UNITE (Wan et al.,2022a), we reform
the synthetic training set for the continuous pre-
training phase, raising the ratio of training ex-
amples consisting of high-quality hypothesis sen-
tences. Also, during fine-tuning our metric model,
we apply available Direct Assessment (DA, Bo-
jar et al.,2017;Ma et al.,2018,2019;Mathur
et al.,2020) and Multidimensional Quality Met-
rics datasets (MQM, Freitag et al.,2021a,b) from
previous WMT competitions to further improve the
performance of our model. Specifically, for each
translation direction among English to German (En-
De), English to Russian (En-Ru), and Chinese to
English (Zh-En) directions, we applied different en-
sembling strategies to achieve a better correlation
with human ratings on MQM 2021 dataset. Results
on WMT 2021 MQM dataset further demonstrate
the effectiveness of our method.
2 Method
As outlined in §1, we apply the UNITE frame-
work (Wan et al.,2022a) to obtain metric mod-
els. We use three types of input formats (i.e.,
source-only, reference-only, and source-reference-
combined) during training. While during infer-
ence, we only use the source-reference-combined
paradigm to collect evaluation scores. In this sec-
tion, we introduce the applied model architecture
2.1), synthetic data construction method (§2.2),
and model training strategy (§2.3) for this year’s
metric competition.
2.1 Model architecture
Input Format
Following Wan et al. (2022a),
we construct the input sequence for source-only,
reference-only, and source-reference-combined in-
put formats as follows:
xSRC =[BOS]h[DEL]s[EOS],(1)
xREF =[BOS]h[DEL]r[EOS],(2)
xSRC+REF =[BOS]h[DEL]s[DEL]r[EOS],
(3)
where
[BOS]
,
[DEL]
and
[EOS]
represent the
beginning, the delimeter, and the ending of se-
quence,
1
and
h
,
s
, and
r
are hypothesis, source,
and reference sentence, respectively. During the
pre-training phase, we applied all input formats to
enhance the performance of UNITE models.
Model Backbone Selection
Aside from the ref-
erence sentence which is written in the same
language as the hypothesis sentence, the source
is in another different language. We believe
that, cross-lingual semantic alignments can ease
the model training on source-only and source-
reference-combined scenarios. Referring to the set-
ting of existing methods (Ranasinghe et al.,2020;
Rei et al.,2020;Sellam et al.,2020;Wan et al.,
2022a), they apply XLM-R (Conneau et al.,2020)
as the backbone of evaluation models for better
multilingual support. In this competition, we addi-
tionally use INFOXLM (Chi et al.,2021), which en-
hances the XLM-R model with cross-lingual align-
ments, as the backbone of our UNITE models.
Model Training
Following Wan et al. (2022a),
we first equally split all examples into three parts,
each of which only serves one input format training.
As to each training example, after concatenating
the required input sentences into one sequence and
feeding it to PLM, we collect the corresponding
representations –
HREF ,HSRC ,HSRC+REF
for each
input format, respectively. After that, we use the
output embedding assigned with CLS token
h
as
the sequence representation. Finally, a feedforward
network takes
h
as input and gives a scalar
p
as a
prediction. Taking xSRC as an example:
HSRC =PLM(xSRC)R(lh+ls)×d,(4)
hSRC =CLS(HSRC)Rd,(5)
pSRC =FeedForward(hSRC)R1,(6)
where
lh
and
ls
are the lengths of
h
and
s
, respec-
tively.
1
Those symbols may vary if we use different PLMs, e.g.,
“[BOS]”, “[SEP]”, and “[SEP]” for English BERT (Devlin
et al.,2019), and “<s>”, “</s> </s>”, and “</s>” for XLM-
R (Conneau et al.,2020).
摘要:

Alibaba-TranslateChina'sSubmissionforWMT2022MetricsSharedTaskYuWan1;2KeqinBao1;3DayihengLiu1BaosongYang1DerekF.Wong2LidiaS.Chao2WenqiangLei4JunXie11DAMOAcademy,AlibabaGroup2NLP2CTLab,UniversityofMacau3UniversityofScienceandTechnologyofChina4NationalUniversityofSingaporenlp2ct.ywan@gmail.combaokq@m...

展开>> 收起<<
Alibaba-Translate Chinas Submission for WMT 2022 Metrics Shared Task Yu Wan12Keqin Bao13Dayiheng Liu1Baosong Yang1Derek F. Wong2.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:261.85KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注