Domain Adaptation of Machine Translation with Crowdworkers Makoto Morishita1 Jun Suzuki2 Masaaki Nagata1

2025-08-18 1 0 795.79KB 13 页 10玖币
侵权投诉
Domain Adaptation of Machine Translation with Crowdworkers
Makoto Morishita1, Jun Suzuki2, Masaaki Nagata1
NTT Communication Science Laboratories, NTT Corporation1
Tohoku University2
{makoto.morishita.gr, masaaki.nagata.et}@hco.ntt.co.jp
jun.suzuki@tohoku.ac.jp
Abstract
Although a machine translation model trained
with a large in-domain parallel corpus achieves
remarkable results, it still works poorly when
no in-domain data are available. This sit-
uation restricts the applicability of machine
translation when the target domain’s data are
limited. However, there is great demand for
high-quality domain-specific machine transla-
tion models for many domains. We propose a
framework that efficiently and effectively col-
lects parallel sentences in a target domain from
the web with the help of crowdworkers. With
the collected parallel data, we can quickly
adapt a machine translation model to the target
domain. Our experiments show that the pro-
posed method can collect target-domain paral-
lel data over a few days at a reasonable cost.
We tested it with five domains, and the domain-
adapted model improved the BLEU scores to
+19.7 by an average of +7.8 points compared
to a general-purpose translation model.
1 Introduction
Although recent Neural Machine Translation
(NMT) methods have achieved remarkable perfor-
mance, their translation quality drastically drops
when the input domain is not covered by training
data (Müller et al.,2020). One typical approach
for translating such inputs is adapting the machine
translation model to a domain with a small portion
of in-domain parallel sentences (Chu and Wang,
2018). Such sentences are normally extracted from
a large existing parallel corpus (Wang et al.,2017;
van der Wees et al.,2017) or created synthetically
from a monolingual corpus (Chinea-Ríos et al.,
2017). However, the existing parallel/monolingual
data may not include enough sentences relevant to
the target domain.
There is a real-world need for a method that can
adapt a machine translation model to any domain.
For example, users reading or writing in such spe-
cific fields as scientific, medical or patent domains,
General domain parallel corpus
Domain-adapted MT model
General-purpose MT model
Crowdworkers
Target domain parallel URLs
Target domain
parallel sentences
Report Reward
Crawl and extract
parallel sentences
Train
Fine-tune
Figure 1: Overview of proposed domain-adaptation
method with crowdworkers who collected URLs that
included parallel sentences of target domain. We then
fine-tuned a general-purpose model with the collected
target domain parallel sentences. See Section 3for de-
tails.
may experience satisfaction if they have access to a
domain-adapted machine translation model. Unfor-
tunately, the often limited availability of in-domain
parallel data complicates this task. For example, it
is difficult to adapt a model to the COVID-19 do-
main because this issue is too new, and the current
available data do not sufficiently cover it.
To alleviate the issue, we propose a method that
rapidly adapts a machine translation model to many
domains at reasonable costs and time periods with
crowdworkers. Fig. 1shows an overview of our
framework. We hypothesize that a small number of
in-domain parallel sentences of the target domain
are available on the web, and we ask crowdworkers
to report these web URLs as a web mining task.
Our task does not require translation skills, un-
like some previous research (Zaidan and Callison-
Burch,2011;Behnke et al.,2018;Kalimuthu et al.,
2019) that attempted manual translations of in-
arXiv:2210.15861v1 [cs.CL] 28 Oct 2022
domain monolingual sentences by crowdworkers.
Thus, workers who are not professional translators
can participate.
Furthermore, to collect effective parallel sen-
tences, we also vary the crowdworkers’ rewards
based on the quality of their reported URLs. After
collecting parallel sentences by our method, we
adapted the machine translation model with the
collected, target-domain parallel sentences. Our
method has the advantage of being applicable to
many domains, in contrast to previous works that
use existing parallel/monolingual data.
We experimentally show that our method quickly
collects in-domain parallel sentences and improves
the translation performance of the target domains
in a few days and at a reasonable cost.
Our contributions can be summarized as follows:
We proposed a new domain-adaptation
method that quickly collects in-domain paral-
lel sentences from the web with crowdwork-
ers.
We empirically showed that crowdworkers are
motivated by variable rewards to find more
valuable web sites and achieved better perfor-
mance than under the fixed reward system.
2 Related Work
2.1 Domain Adaptation
Domain adaptation is a method that improves the
performance of a machine translation model for
a specific domain. The most common method
for neural machine translation models is to fine-
tune the model with target-domain parallel sen-
tences (Chu and Wang,2018). Kiyono et al. (2020),
who ranked first in the WMT 2020 news shared
task (Barrault et al.,2020), fine-tuned a model with
a news domain parallel corpus and improved the
BLEU scores by +2.2 points. Since the availabil-
ity of a target-domain parallel corpus is limited,
we typically select similar domain sentences from
a large parallel corpus (Moore and Lewis,2010;
Axelrod et al.,2011). However, its applicability
remains limited because some domains are not cov-
ered by existing parallel corpora.
We take a different approach that freshly col-
lects target-domain parallel sentences from the web.
Since we do not rely on an existing corpus, our
method can be applied to many domains.
2.2 Collecting Parallel Sentences from the
Web
Recently, some works successfully built a large-
scale parallel corpus by collecting parallel sen-
tences from the web. The BUCC workshop orga-
nized shared-tasks of extracting parallel sentences
from the web (Sharoff et al.,2015;Zweigenbaum
et al.,2017). The ParaCrawl project successfully
created a large-scale parallel corpus between En-
glish and other European languages by extensively
crawling the web (Bañón et al.,2020). Typical
bitext-mining projects, including ParaCrawl, took
the following steps to identify parallel sentences
from the web (Resnik and Smith,2003): (1) find
multilingual websites, which may contain paral-
lel sentences, from the web (Papavassiliou et al.,
2018;Bañón et al.,2020); (2) find parallel docu-
ments from websites (Thompson and Koehn,2020;
El-Kishky and Guzmán,2020); (3) extract paral-
lel sentences from parallel web URLs (Thompson
and Koehn,2019;Chousa et al.,2020). Our work
focuses on the first step: finding bilingual target-
domain web URLs. Bañón et al. (2020) analyzed
all of the CommonCrawl data to find crawl candi-
date websites that contain a certain amount of both
source and target language texts. Their method ef-
ficiently collected parallel sentences from the web.
However, since CommonCrawl only covers a small
portion of the web, it may overlook websites that
contain valuable resources. Thus, the current web-
based corpora (Bañón et al.,2020;Morishita et al.,
2020) may not cover all the domains we want to
adapt. It is also difficult to focus on a specific topic.
In contrast, our work does not rely on Common-
Crawl but on crowdworkers who can search the
whole web and focus on specific domains.
2.3 Creating Parallel Corpus with
Crowdworkers
Some researchers have used crowdsourcing plat-
forms to create new language resources (Roit et al.,
2020;Jiang et al.,2018). Some work created
a parallel corpus for domain-adaptation by ask-
ing crowdworkers to translate in-domain monolin-
gual sentences (Zaidan and Callison-Burch,2011;
Behnke et al.,2018;Kalimuthu et al.,2019). Al-
though this approach is straightforward, it does
suffer from several drawbacks. For example, it is
often difficult to find a sufficient amount of crowd-
workers since translation tasks often require an
understanding of both the languages that are actu-
ally being used. Note that although we also use
a crowdsourcing platform, our approach entirely
differs from the approach introduced in this sec-
tion, such as asking crowdworkers to do translation
tasks.
3 Collecting Parallel URLs with
Crowdworkers
Fig. 1shows an overview of our collecting protocol.
Our method asks workers to find URLs that are
related to the target domain and written in parallel.
We then extract the parallel sentences from these
URLs and fine-tune the general-purpose machine
translation model with the collected data.
This section is organized as follows: In Sec-
tion 3.1, we explain why we focus on collecting
parallel URLs and describe their advantages. We
overview the details of our crowdsourcing task def-
inition in Section 3.2. In Section 3.3, we describe
how we extract parallel sentences from the reported
URLs. We describe the details of our reward setting
in Section 3.4.
3.1 Advantages
Previous works, which adapted a machine transla-
tion model to a specific domain, created resources
by asking crowdworkers to translate text (Lewis
et al.,2011;Anastasopoulos et al.,2020;Zaidan
and Callison-Burch,2011;Behnke et al.,2018;
Kalimuthu et al.,2019). In contrast, our method
asks workers to find web URLs (instead of trans-
lating sentences) that have parallel sentences in the
target domain. This method has two advantages.
The first concerns task difficulty. To achieve rapid
domain adaptation, the task must be easy enough
that many crowdworkers can participate. Thus,
we do not assume that the workers fluently under-
stand both the source and target languages. Finding
potential web URLs that have parallel sentences
is relatively easy and can be done by any crowd-
worker.
The other advantage involves task efficiency. We
asked workers to collect the URLs of parallel web
pages instead of parallel sentences because recent
previous works successfully extracted parallel sen-
tences from parallel URLs (Bañón et al.,2020).
Efficiency is important for our method, since we
focus on speed to create a domain-specific model.
3.2 Crowdsourcing Task Definition
We focus on collecting the parallel sentences of
languages
e
and
f
. We created a web application
to accept reports from the crowdworkers and ex-
tracted parallel sentences from the reported web
URLs. We prepared a development set (a small
portion of the parallel sentences) of the target do-
main and distribute it to the workers as examples of
the type of sentences we want them to collect. The
crowdworkers are asked to find pairs of web URLs
that contain parallel sentences of the target domain.
We call this URL pair a parallel URL. Note that
we collect the URLs of pages written in parallel;
this means that workers act as parallel document
aligners. We do not accept parallel URLs that have
already been reported by others.
3.3 Parallel Sentence Extraction
After obtaining parallel URLs from workers, we
extract parallel sentences from the reported URLs.
First, we downloaded the reported web URLs and
extracted the texts
1
and removed the sentences that
are not in the
e
or
f
language based on
CLD22
.
Then we used
vecalign
(Thompson and Koehn,
2019) to extract the parallel sentences, a step that
aligns them based on the multi-lingual sentence
embeddings LASER (Artetxe and Schwenk,2019).
We discard noisy sentence pairs based on sentence
alignment scores
3
and do not use them for model
training.
3.4 Reward Settings
To bolster the crowdworkers’ motivation, reward
setting is one of the most important issues (Posch
et al.,2019). In this paper, we tested two types of
rewards: fixed or variable. In the following, we
describe both reward settings.
3.4.1 Fixed Reward
Fixed reward pays a set amount for each reported
URL if we can extract at least one parallel sentence
from it. This fixed reward setting is one very typical
setting for crowdsourcing.
1
Since we expect the workers to act as document aligners,
we focus on the reported URLs and do not crawl the links in
the reported URLs.
2https://github.com/CLD2Owners/cld2
3
Since
vecalign
outputs a scoring cost where a lower
score means better alignment, our implementation removes a
sentence pair if its cost exceeds 0.7.
摘要:

DomainAdaptationofMachineTranslationwithCrowdworkersMakotoMorishita1,JunSuzuki2,MasaakiNagata1NTTCommunicationScienceLaboratories,NTTCorporation1TohokuUniversity2{makoto.morishita.gr,masaaki.nagata.et}@hco.ntt.co.jpjun.suzuki@tohoku.ac.jpAbstractAlthoughamachinetranslationmodeltrainedwithalargein-do...

展开>> 收起<<
Domain Adaptation of Machine Translation with Crowdworkers Makoto Morishita1 Jun Suzuki2 Masaaki Nagata1.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:795.79KB 格式:PDF 时间:2025-08-18

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注