domain monolingual sentences by crowdworkers.
Thus, workers who are not professional translators
can participate.
Furthermore, to collect effective parallel sen-
tences, we also vary the crowdworkers’ rewards
based on the quality of their reported URLs. After
collecting parallel sentences by our method, we
adapted the machine translation model with the
collected, target-domain parallel sentences. Our
method has the advantage of being applicable to
many domains, in contrast to previous works that
use existing parallel/monolingual data.
We experimentally show that our method quickly
collects in-domain parallel sentences and improves
the translation performance of the target domains
in a few days and at a reasonable cost.
Our contributions can be summarized as follows:
•
We proposed a new domain-adaptation
method that quickly collects in-domain paral-
lel sentences from the web with crowdwork-
ers.
•
We empirically showed that crowdworkers are
motivated by variable rewards to find more
valuable web sites and achieved better perfor-
mance than under the fixed reward system.
2 Related Work
2.1 Domain Adaptation
Domain adaptation is a method that improves the
performance of a machine translation model for
a specific domain. The most common method
for neural machine translation models is to fine-
tune the model with target-domain parallel sen-
tences (Chu and Wang,2018). Kiyono et al. (2020),
who ranked first in the WMT 2020 news shared
task (Barrault et al.,2020), fine-tuned a model with
a news domain parallel corpus and improved the
BLEU scores by +2.2 points. Since the availabil-
ity of a target-domain parallel corpus is limited,
we typically select similar domain sentences from
a large parallel corpus (Moore and Lewis,2010;
Axelrod et al.,2011). However, its applicability
remains limited because some domains are not cov-
ered by existing parallel corpora.
We take a different approach that freshly col-
lects target-domain parallel sentences from the web.
Since we do not rely on an existing corpus, our
method can be applied to many domains.
2.2 Collecting Parallel Sentences from the
Web
Recently, some works successfully built a large-
scale parallel corpus by collecting parallel sen-
tences from the web. The BUCC workshop orga-
nized shared-tasks of extracting parallel sentences
from the web (Sharoff et al.,2015;Zweigenbaum
et al.,2017). The ParaCrawl project successfully
created a large-scale parallel corpus between En-
glish and other European languages by extensively
crawling the web (Bañón et al.,2020). Typical
bitext-mining projects, including ParaCrawl, took
the following steps to identify parallel sentences
from the web (Resnik and Smith,2003): (1) find
multilingual websites, which may contain paral-
lel sentences, from the web (Papavassiliou et al.,
2018;Bañón et al.,2020); (2) find parallel docu-
ments from websites (Thompson and Koehn,2020;
El-Kishky and Guzmán,2020); (3) extract paral-
lel sentences from parallel web URLs (Thompson
and Koehn,2019;Chousa et al.,2020). Our work
focuses on the first step: finding bilingual target-
domain web URLs. Bañón et al. (2020) analyzed
all of the CommonCrawl data to find crawl candi-
date websites that contain a certain amount of both
source and target language texts. Their method ef-
ficiently collected parallel sentences from the web.
However, since CommonCrawl only covers a small
portion of the web, it may overlook websites that
contain valuable resources. Thus, the current web-
based corpora (Bañón et al.,2020;Morishita et al.,
2020) may not cover all the domains we want to
adapt. It is also difficult to focus on a specific topic.
In contrast, our work does not rely on Common-
Crawl but on crowdworkers who can search the
whole web and focus on specific domains.
2.3 Creating Parallel Corpus with
Crowdworkers
Some researchers have used crowdsourcing plat-
forms to create new language resources (Roit et al.,
2020;Jiang et al.,2018). Some work created
a parallel corpus for domain-adaptation by ask-
ing crowdworkers to translate in-domain monolin-
gual sentences (Zaidan and Callison-Burch,2011;
Behnke et al.,2018;Kalimuthu et al.,2019). Al-
though this approach is straightforward, it does
suffer from several drawbacks. For example, it is
often difficult to find a sufficient amount of crowd-
workers since translation tasks often require an
understanding of both the languages that are actu-