Domain Adaptation of Machine Translation with Crowdworkers Makoto Morishita1 Jun Suzuki2 Masaaki Nagata1

2025-08-18 2 0 795.79KB 13 页 10玖币

侵权投诉

Domain Adaptation of Machine Translation with Crowdworkers

Makoto Morishita1, Jun Suzuki2, Masaaki Nagata1

NTT Communication Science Laboratories, NTT Corporation1

Tohoku University2

{makoto.morishita.gr, masaaki.nagata.et}@hco.ntt.co.jp

jun.suzuki@tohoku.ac.jp

Abstract

Although a machine translation model trained

with a large in-domain parallel corpus achieves

remarkable results, it still works poorly when

no in-domain data are available. This sit-

uation restricts the applicability of machine

translation when the target domain’s data are

limited. However, there is great demand for

high-quality domain-speciﬁc machine transla-

tion models for many domains. We propose a

framework that efﬁciently and effectively col-

lects parallel sentences in a target domain from

the web with the help of crowdworkers. With

the collected parallel data, we can quickly

adapt a machine translation model to the target

domain. Our experiments show that the pro-

posed method can collect target-domain paral-

lel data over a few days at a reasonable cost.

We tested it with ﬁve domains, and the domain-

adapted model improved the BLEU scores to

+19.7 by an average of +7.8 points compared

to a general-purpose translation model.

1 Introduction

Although recent Neural Machine Translation

(NMT) methods have achieved remarkable perfor-

mance, their translation quality drastically drops

when the input domain is not covered by training

data (Müller et al.,2020). One typical approach

for translating such inputs is adapting the machine

translation model to a domain with a small portion

of in-domain parallel sentences (Chu and Wang,

2018). Such sentences are normally extracted from

a large existing parallel corpus (Wang et al.,2017;

van der Wees et al.,2017) or created synthetically

from a monolingual corpus (Chinea-Ríos et al.,

2017). However, the existing parallel/monolingual

data may not include enough sentences relevant to

the target domain.

There is a real-world need for a method that can

adapt a machine translation model to any domain.

For example, users reading or writing in such spe-

ciﬁc ﬁelds as scientiﬁc, medical or patent domains,

General domain parallel corpus

Domain-adapted MT model

General-purpose MT model

Crowdworkers

Target domain parallel URLs

Target domain

parallel sentences

Report Reward

Crawl and extract

parallel sentences

Train

Fine-tune

Figure 1: Overview of proposed domain-adaptation

method with crowdworkers who collected URLs that

included parallel sentences of target domain. We then

ﬁne-tuned a general-purpose model with the collected

target domain parallel sentences. See Section 3for de-

tails.

may experience satisfaction if they have access to a

domain-adapted machine translation model. Unfor-

tunately, the often limited availability of in-domain

parallel data complicates this task. For example, it

is difﬁcult to adapt a model to the COVID-19 do-

main because this issue is too new, and the current

available data do not sufﬁciently cover it.

To alleviate the issue, we propose a method that

rapidly adapts a machine translation model to many

domains at reasonable costs and time periods with

crowdworkers. Fig. 1shows an overview of our

framework. We hypothesize that a small number of

in-domain parallel sentences of the target domain

are available on the web, and we ask crowdworkers

to report these web URLs as a web mining task.

Our task does not require translation skills, un-

like some previous research (Zaidan and Callison-

Burch,2011;Behnke et al.,2018;Kalimuthu et al.,

2019) that attempted manual translations of in-

arXiv:2210.15861v1 [cs.CL] 28 Oct 2022

domain monolingual sentences by crowdworkers.

Thus, workers who are not professional translators

can participate.

Furthermore, to collect effective parallel sen-

tences, we also vary the crowdworkers’ rewards

based on the quality of their reported URLs. After

collecting parallel sentences by our method, we

adapted the machine translation model with the

collected, target-domain parallel sentences. Our

method has the advantage of being applicable to

many domains, in contrast to previous works that

use existing parallel/monolingual data.

We experimentally show that our method quickly

collects in-domain parallel sentences and improves

the translation performance of the target domains

in a few days and at a reasonable cost.

Our contributions can be summarized as follows:

•

We proposed a new domain-adaptation

method that quickly collects in-domain paral-

lel sentences from the web with crowdwork-

ers.

•

We empirically showed that crowdworkers are

motivated by variable rewards to ﬁnd more

valuable web sites and achieved better perfor-

mance than under the ﬁxed reward system.

2 Related Work

2.1 Domain Adaptation

Domain adaptation is a method that improves the

performance of a machine translation model for

a speciﬁc domain. The most common method

for neural machine translation models is to ﬁne-

tune the model with target-domain parallel sen-

tences (Chu and Wang,2018). Kiyono et al. (2020),

who ranked ﬁrst in the WMT 2020 news shared

task (Barrault et al.,2020), ﬁne-tuned a model with

a news domain parallel corpus and improved the

BLEU scores by +2.2 points. Since the availabil-

ity of a target-domain parallel corpus is limited,

we typically select similar domain sentences from

a large parallel corpus (Moore and Lewis,2010;

Axelrod et al.,2011). However, its applicability

remains limited because some domains are not cov-

ered by existing parallel corpora.

We take a different approach that freshly col-

lects target-domain parallel sentences from the web.

Since we do not rely on an existing corpus, our

method can be applied to many domains.

2.2 Collecting Parallel Sentences from the

Web

Recently, some works successfully built a large-

scale parallel corpus by collecting parallel sen-

tences from the web. The BUCC workshop orga-

nized shared-tasks of extracting parallel sentences

from the web (Sharoff et al.,2015;Zweigenbaum

et al.,2017). The ParaCrawl project successfully

created a large-scale parallel corpus between En-

glish and other European languages by extensively

crawling the web (Bañón et al.,2020). Typical

bitext-mining projects, including ParaCrawl, took

the following steps to identify parallel sentences

from the web (Resnik and Smith,2003): (1) ﬁnd

multilingual websites, which may contain paral-

lel sentences, from the web (Papavassiliou et al.,

2018;Bañón et al.,2020); (2) ﬁnd parallel docu-

ments from websites (Thompson and Koehn,2020;

El-Kishky and Guzmán,2020); (3) extract paral-

lel sentences from parallel web URLs (Thompson

and Koehn,2019;Chousa et al.,2020). Our work

focuses on the ﬁrst step: ﬁnding bilingual target-

domain web URLs. Bañón et al. (2020) analyzed

all of the CommonCrawl data to ﬁnd crawl candi-

date websites that contain a certain amount of both

source and target language texts. Their method ef-

ﬁciently collected parallel sentences from the web.

However, since CommonCrawl only covers a small

portion of the web, it may overlook websites that

contain valuable resources. Thus, the current web-

based corpora (Bañón et al.,2020;Morishita et al.,

2020) may not cover all the domains we want to

adapt. It is also difﬁcult to focus on a speciﬁc topic.

In contrast, our work does not rely on Common-

Crawl but on crowdworkers who can search the

whole web and focus on speciﬁc domains.

2.3 Creating Parallel Corpus with

Crowdworkers

Some researchers have used crowdsourcing plat-

forms to create new language resources (Roit et al.,

2020;Jiang et al.,2018). Some work created

a parallel corpus for domain-adaptation by ask-

ing crowdworkers to translate in-domain monolin-

gual sentences (Zaidan and Callison-Burch,2011;

Behnke et al.,2018;Kalimuthu et al.,2019). Al-

though this approach is straightforward, it does

suffer from several drawbacks. For example, it is

often difﬁcult to ﬁnd a sufﬁcient amount of crowd-

workers since translation tasks often require an

understanding of both the languages that are actu-

ally being used. Note that although we also use

a crowdsourcing platform, our approach entirely

differs from the approach introduced in this sec-

tion, such as asking crowdworkers to do translation

tasks.

3 Collecting Parallel URLs with

Crowdworkers

Fig. 1shows an overview of our collecting protocol.

Our method asks workers to ﬁnd URLs that are

related to the target domain and written in parallel.

We then extract the parallel sentences from these

URLs and ﬁne-tune the general-purpose machine

translation model with the collected data.

This section is organized as follows: In Sec-

tion 3.1, we explain why we focus on collecting

parallel URLs and describe their advantages. We

overview the details of our crowdsourcing task def-

inition in Section 3.2. In Section 3.3, we describe

how we extract parallel sentences from the reported

URLs. We describe the details of our reward setting

in Section 3.4.

3.1 Advantages

Previous works, which adapted a machine transla-

tion model to a speciﬁc domain, created resources

by asking crowdworkers to translate text (Lewis

et al.,2011;Anastasopoulos et al.,2020;Zaidan

and Callison-Burch,2011;Behnke et al.,2018;

Kalimuthu et al.,2019). In contrast, our method

asks workers to ﬁnd web URLs (instead of trans-

lating sentences) that have parallel sentences in the

target domain. This method has two advantages.

The ﬁrst concerns task difﬁculty. To achieve rapid

domain adaptation, the task must be easy enough

that many crowdworkers can participate. Thus,

we do not assume that the workers ﬂuently under-

stand both the source and target languages. Finding

potential web URLs that have parallel sentences

is relatively easy and can be done by any crowd-

worker.

The other advantage involves task efﬁciency. We

asked workers to collect the URLs of parallel web

pages instead of parallel sentences because recent

previous works successfully extracted parallel sen-

tences from parallel URLs (Bañón et al.,2020).

Efﬁciency is important for our method, since we

focus on speed to create a domain-speciﬁc model.

3.2 Crowdsourcing Task Deﬁnition

We focus on collecting the parallel sentences of

languages

and

. We created a web application

to accept reports from the crowdworkers and ex-

tracted parallel sentences from the reported web

URLs. We prepared a development set (a small

portion of the parallel sentences) of the target do-

main and distribute it to the workers as examples of

the type of sentences we want them to collect. The

crowdworkers are asked to ﬁnd pairs of web URLs

that contain parallel sentences of the target domain.

We call this URL pair a parallel URL. Note that

we collect the URLs of pages written in parallel;

this means that workers act as parallel document

aligners. We do not accept parallel URLs that have

already been reported by others.

3.3 Parallel Sentence Extraction

After obtaining parallel URLs from workers, we

extract parallel sentences from the reported URLs.

First, we downloaded the reported web URLs and

extracted the texts

and removed the sentences that

are not in the

language based on

CLD22

Then we used

vecalign

(Thompson and Koehn,

2019) to extract the parallel sentences, a step that

aligns them based on the multi-lingual sentence

embeddings LASER (Artetxe and Schwenk,2019).

We discard noisy sentence pairs based on sentence

alignment scores

and do not use them for model

training.

3.4 Reward Settings

To bolster the crowdworkers’ motivation, reward

setting is one of the most important issues (Posch

et al.,2019). In this paper, we tested two types of

rewards: ﬁxed or variable. In the following, we

describe both reward settings.

3.4.1 Fixed Reward

Fixed reward pays a set amount for each reported

URL if we can extract at least one parallel sentence

from it. This ﬁxed reward setting is one very typical

setting for crowdsourcing.

Since we expect the workers to act as document aligners,

we focus on the reported URLs and do not crawl the links in

the reported URLs.

2https://github.com/CLD2Owners/cld2

Since

vecalign

outputs a scoring cost where a lower

score means better alignment, our implementation removes a

sentence pair if its cost exceeds 0.7.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DomainAdaptationofMachineTranslationwithCrowdworkersMakotoMorishita1,JunSuzuki2,MasaakiNagata1NTTCommunicationScienceLaboratories,NTTCorporation1TohokuUniversity2{makoto.morishita.gr,masaaki.nagata.et}@hco.ntt.co.jpjun.suzuki@tohoku.ac.jpAbstractAlthoughamachinetranslationmodeltrainedwithalargein-do...

展开>> 收起<<

Domain Adaptation of Machine Translation with Crowdworkers Makoto Morishita1 Jun Suzuki2 Masaaki Nagata1.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Domain Adaptation of Machine Translation with Crowdworkers Makoto Morishita1 Jun Suzuki2 Masaaki Nagata1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: