Exploring Document-Level Literary Machine Translation with Parallel Paragraphs from World Literature Katherine ThaiFMarzena KarpinskaFKalpesh KrishnaWilliam Ray

2025-05-06 0 0 1.16MB 21 页 10玖币
侵权投诉
Exploring Document-Level Literary Machine Translation
with Parallel Paragraphs from World Literature
Katherine ThaiFMarzena KarpinskaFKalpesh KrishnaWilliam Ray
Moira InghilleriJohn WietingMohit Iyyer
Manning College of Information and Computer Sciences, UMass Amherst
Department of Languages, Literatures, and Cultures; UMass Amherst
Google Research
{kbthai,mkarpinska,kalpesh,miyyer}@cs.umass.edu
minghilleri@complit.umass.edu,jwieting@google.com
Abstract
Literary translation is a culturally significant
task, but it is bottlenecked by the small num-
ber of qualified literary translators relative to
the many untranslated works published around
the world. Machine translation (MT) holds po-
tential to complement the work of human trans-
lators by improving both training procedures
and their overall efficiency. Literary transla-
tion is less constrained than more traditional
MT settings since translators must balance
meaning equivalence, readability, and critical
interpretability in the target language. This
property, along with the complex discourse-
level context present in literary texts, also
makes literary MT more challenging to com-
putationally model and evaluate. To explore
this task, we collect a dataset (PAR3) of non-
English language novels in the public domain,
each aligned at the paragraph level to both hu-
man and automatic English translations. Us-
ing PAR3, we discover that expert literary
translators prefer reference human translations
over machine-translated paragraphs at a rate
of 84%, while state-of-the-art automatic MT
metrics do not correlate with those preferences.
The experts note that MT outputs contain
not only mistranslations, but also discourse-
disrupting errors and stylistic inconsistencies.
To address these problems, we train a post-
editing model whose output is preferred over
normal MT output at a rate of 69% by experts.
We publicly release PAR3 to spur future re-
search into literary MT.1
1 Introduction
While the quality of machine translation (MT) sys-
tems has greatly improved with recent advances in
modeling and dataset collection, the application of
1https://github.com/katherinethai/par3/
FAuthors contributed equally.
these new technologies to the task of automatically
translating literary text (e.g., novels, short stories)
has remained limited to small-scale studies (Gen-
zel et al.,2010;Jones and Irvine,2013;Toral et al.,
2018). Translating literary works differs from trans-
lating standard MT corpora (e.g., news articles or
parliamentary proceedings) in several key ways.
For one, it is much more difficult to evaluate. The
techniques
2
used by literary translators differ fun-
damentally from those applied in more standard
MT domains (see Table 8in the Appendix). Liter-
ary translators have the freedom (or burden) of both
semantic and critical interpretation, as they must
solve the problem of equivalence, often beyond
the word level (Neubert,1983;Baker,2018;Baker
and Saldanha,2021). The task of conveying an
author’s ideas highlights yet another difference be-
tween literary and traditional MT: document-level
context is especially critical for the literary domain
due to the presence of complex discourse structure,
rendering the typical sentence-level MT pipeline
insufficient for this task (Voigt and Jurafsky,2012;
Taivalkoski-Shilov,2019).
In this work, we seek to understand how both
state-of-the-art MT systems and MT evaluation
metrics fail in the literary domain, and we also
leverage large pretrained language models to im-
prove literary MT. To facilitate our experiments,
we introduce PAR3, a large-scale dataset to study
paragraph-level literary translation into English.
PAR3 consists of 121K paragraphs taken from 118
novels originally written in a non-English language,
where each paragraph is aligned to multiple human-
written English translations of that paragraph as
2
Many terms have been employed by translation scholars
to refer to various operations used by translators (Chesterman,
2005). Here, we employ the term “techniques” argued for
by Molina and Hurtado Albir (2004) and recently used in the
field of NLP (Zhai et al.,2018,2020).
arXiv:2210.14250v1 [cs.CL] 25 Oct 2022
well as a machine-translated paragraph produced
by Google Translate (see Table 2).
We show that MT evaluation metrics such as
BLEU and BLEURT are not effective for literary
MT. In fact, we discover that two of our tested met-
rics (BLEU and the document-level BLONDE) show
a preference for Google Translate outputs over ref-
erence translations in PAR3. In reality, MT outputs
are much worse than reference translations: our hu-
man evaluation reveals that professional translators
prefer reference translations at a rate of 85%.
While the translators in our study identified
overly literal translations and discourse-level er-
rors (e.g., coreference, pronoun consistency) as the
main faults of modern MT systems, a monolin-
gual human evaluation comparing human reference
translations and MT outputs reveals additional hur-
dles in readability and fluency. To tackle these
issues, we fine-tune GPT-3 (Brown et al.,2020) on
an automatic post-editing task in which the model
attempts to transform an MT output into a human
reference translation. Human translators prefer the
post-edited translations at a rate of 69% and also
observe a lower incidence of the above errors.
Overall, we identify critical roadblocks in evalu-
ation towards meaningful progress in literary MT,
and we also show through expert human evalua-
tions that pretrained language models can improve
the quality of existing MT systems on this domain.
We release PAR3 to spur more meaningful future
research in literary MT.
2 The PAR3 Dataset: Parallel
Paragraph-Level Paraphrases
To study literary MT, we collect a dataset of
par
allel
par
agraph-level
par
aphrases (PAR3) from
public domain non-English-language (source) nov-
els with their corresponding English translations
generated by both humans and Google Translate.
PAR3 is a step up in both scale and linguistic di-
versity compared to prior studies in literary MT,
which generally focus on one novel (Toral et al.,
2018) or a small set of poems or short stories (Jones
and Irvine,2013). PAR3 contains at least two hu-
man translations for every source paragraph (Ta-
ble 2). In Table 1, we report corpus statistics by
the 19 unique source languages
4
represented in
3
The Chinese texts in PAR3 were written in Classical Chi-
nese, an archaic and very different form of the language cur-
rently used today.
4
Languages in PAR3 represent different language fam-
ilies (Romance, Germanic, Slavic, Japonic, Sino-Tibetan,
Src lang #texts #src paras sents/para
French (fr) 32 50,070 2.7
Russian (ru) 27 36,117 3.3
German (de) 16 9,170 4.3
Spanish (es) 1 3,279 2.0
Czech (cs) 4 2,930 3.0
Norwegian (no) 2 2,655 3.4
Swedish (sv) 3 2,620 3.2
Portuguese (pt) 4 2,288 3.7
Italian (it) 2 1,931 2.6
Japanese (ja) 9 1,857 4.4
Bengali (bn) 2 1,499 3.3
Tamil (ta) 1 1,489 3.1
Danish (da) 1 1,384 3.6
Chinese3(zh) 7 1,320 8.8
Dutch (nl) 1 963 3.4
Hungarian (hu) 1 892 3.7
Polish (pl) 1 399 3.9
Sesotho (st) 1 374 4.2
Persian (fa) 1 148 4.2
All 118 121,385 3.2
Table 1: Corpus statistics for Version 2 of PAR3 by
each of the 19 source languages. The average num-
ber of sentences per paragraph refers to only the En-
glish human and Google translations of the source para-
graphs. We did not count tokens or sentences for source
paragraphs because of the lack of a reliable tokenizer
and sentence segmenter for all source languages.
PAR3. PAR3 was curated in four stages: selec-
tion of source texts, machine translation of source
texts, paragraph alignment, and final filtering. This
process closely resembles the paraphrase mining
methodology described by Barzilay and McKeown
(2001); the major distinctions are (1) our collec-
tion of literary works that is
20
times the size of
the previous work, (2) our inclusion of the aligned
source text to enable translation study, and (3) our
alignment at the paragraph, not sentence, level. In
this section, we describe the data collection process
and disclose choices we made during curation of
Version 1 of PAR3. See Section Ain the Appendix
for more details on the different versions of PAR3.
2.1 Selecting works of literature
For a source text to be included in PAR3, it must
be (1) a literary work that has entered the public
domain of its country of publication by 2022 with
(2) a published electronic version along with (3)
multiple versions of human-written, English trans-
lations. The first requirement skews our corpus
towards older works of fiction. The second require-
ment ensures the preservation of the source texts’
paragraph breaks. The third requirement limits us
Iranian, Dravidian, Ugric, and Bantu), with different mor-
phological traits (synthetic, fusional, agglutinative), and use
different writing systems (Latin alphabet, Cyrillic alphabet,
Bengali script, Persian alphabet, Tamil script, Hanzi, and
Kanji/Hiragana/Katakana).
SRC (ru): — Извините меня: я, увидевши издали, как вы вошли в лавку, решился вас побеспокоить. Если вам будет
после свободно и по дороге мимо моего дома, так сделайте милость, зайдите на малость времени. Мне с вами нужно
будет переговорить
GTr: “Excuse me; seeing from a dis-
tance how you entered the shop, I de-
cided to disturb you. If you will be
free after and on the way past my
house, so do yourself a favour, stop
by for a little time. I will need to
speak with you.
HUM1: “Pardon me, I saw you from a
distance going into the shop and ven-
tured to disturb you. If you will be
free in a little while and will be pass-
ing by my house, do me the favour to
come in for a few minutes. I want to
have a talk with you.
HUM2: “I saw you enter the shop,
he said, “and therefore followed you,
for I have something important for
your ear. Could you spare me a
minute or two?”
HUM3: ‘Excuse me: I saw you from
far off going into the shop, and de-
cided to trouble you. If you’re free
afterwards and my house is not out
of your way, kindly stop by for a
short while. I must have a talk with
you.
SRC (st): Ho bile jwalo ho fela ha Chaka, mora wa Senzangakhona. Mazulu le kajeno a bokajeno ha a hopola kamoo a kileng ya eba batho kateng,
mehleng ya Chaka, kamoo ditjhaba di neng di jela kgwebeleng ke ho ba tshoha, leha ba hopola borena ba bona bo weleng, eba ba sekisa mahlong, ba re:
"Di a bela, di a hlweba! Madiba ho pjha a maholo!"
GTr: Such was the end of Chaka, son of Senzan-
gakhona. The Zulus of today when they remem-
ber how they once became people, in the days
of Chaka, how the nations ate in the sun because
of fear of them, even when they remember their
fallen kingdom, they wince in their eyes, saying:
"They’re boiling, they’re boiling! The springs are
big!"
HUM1: So it came about, the end of Chaka, son
of Senzangakhona. Even to this very day the
Zulus, when they think how they were once a
strong nation in the days of Chaka, and how other
nations dreaded them so much that they could
hardly swallow their food, and when they remem-
ber their kingdom which has fallen, tears well up
in their eyes, and they say: “They ferment, they
curdle! Even great pools dry away!”
HUM2: And this was the last of Chaka, the son of
Senzangakona. Even to-day the Mazulu remem-
ber how that they were men once, in the time
of Chaka, and how the tribes in fear and trem-
bling came to them for protection. And when
they think of their lost empire the tears pour down
their cheeks and they say: ‘Kingdoms wax and
wane. Springs that once were mighty dry away.
Table 2: An example of one source paragraph in PAR3, from Nikolai Gogol’s Dead Souls (upper example) and
from Thomas Mofolo’s Chaka (lower example) with their corresponding Google translation to English and aligned
paragraphs from human-written translations.
to texts that had achieved enough mainstream pop-
ularity to warrant (re)translations in English. Our
most-recently published source text, The Book of
Disquietude, was published posthumously in 1982,
47 years after the author’s death. The oldest source
text in our dataset, Romance of the Three King-
doms, was written in the 14th-century. The full
list of literary works with source language, author
information, and publication year is available in
Table 5in the Appendix.
2.2 Translating works using Google
Translate
Before being fed to Google Translate, the data was
preprocessed to convert ebooks to lists of plain
text paragraphs and to remove tables of contexts,
translator notes, and text-specific artifacts.
5
Each
paragraph was passed to the default model of the
Google Translate API between April 20 and April
27, 2022. The total cost of source text translation
was about 900 USD.6
2.3 Aligning paragraphs
All English translations, both human and Google
Translate-generated, were separated into sentences
using spaCy’s Sentencizer.
7
The sentences of each
5
From Japanese texts, we removed artifacts of furigana, a
reading aid placed above difficult Japanese characters in order
to help readers unfamiliar with higher-level ideograms.
6
Google charges 20 USD per 1M characters of translation.
7https://spacy.io/usage/linguistic-features#
sbd
human translation were aligned to the sentences
of the Google translation of the corresponding
source text using the Needleman-Wunsch algo-
rithm (Needleman and Wunsch,1970) for global
alignment. Since this algorithm requires scores be-
tween each pair of human-Google sentences, we
compute scores using the embedding-based SIM
measure developed by Wieting et al. (2019), which
performs well on semantic textual similarity (STS)
benchmarks (Agirre et al.,2016). Final paragraph-
level alignments were computed using the para-
graph segmentations in the original source text.
2.4 Post-processing and filtering
We considered alignments to be “short” if any En-
glish paragraph, human or Google generated, con-
tained fewer than 4 tokens or 20 characters. We
discarded any alignments that were “short” and
contained the word “chapter” or a Roman numeral,
as these were overwhelmingly chapter titles. We
also discarded any alignments where one English
paragraph contained more than 3 times the number
of words than another, reasoning that these were
actually misalignments. Thus, we also discarded
any alignments with a BLEU score of less than
5. Alignments were sampled for the final version
of PAR3 such that no more than 50% of the para-
graphs for any human translation were included.
Finally, alignments for each source text were then
shuffled, at the paragraph level, to prevent recon-
struction of the human translations, which may not
be in the public domain.
2.5 Train, test, and validation splits
Instead of randomly creating splits of the 121K
paragraphs in PAR3, we define train, test, and val-
idation splits at the document level. Each literary
text belongs to one split, and all translations associ-
ated with its source paragraphs belong to that split
as well. This decision allows us to better test the
generalization ability of systems trained on PAR3,
and avoid cases where an MT model memorizes
entities or stylistic patterns located within a partic-
ular book to artificially inflate its evaluation scores.
The training split contains around 80% of the total
number of source paragraphs (97,611), the test split
contains around 10% (11,606), and the validation
split contains around 10% (11,606). Appendix 5
shows the texts belonging to each split.
3 How good are existing MT systems for
literary translation?
Armed with our PAR3 dataset, we next turn to eval-
uating the ability of commercial-grade MT systems
for literary translation. First, we describe a study
in which we hired both professional literary trans-
lators and monolingual English experts to compare
reference translations to those produced by Google
Translate at a paragraph-level. In an A/B test, the
translators showed a strong preference (on 84% of
examples) for human-written translations, finding
MT output to be far too literal and riddled with
discourse-level errors (e.g., pronoun consistency or
contextual word sense issues). The monolingual
raters preferred the human-written translations over
the Google Translate outputs 85% of the time, sug-
gesting that discourse-level errors made by MT
systems are prevalent and noticeable when the MT
outputs are evaluated independently of the source
texts. Finally, we address deficiencies in existing
automatic MT evaluation metrics, including BLEU,
BLEURT, and the document-level BLONDEmetric.
These metrics failed to distinguish human from ma-
chine translation, even preferring the MT outputs
on average.
3.1 Diagnosing literary MT with judgments
from expert translators
As literary MT is understudied (especially at a doc-
ument level), it is unclear how state-of-the-art MT
systems perform on this task and what systematic
errors they make. To shed light on this issue, we
hire human experts (both monolingual English ex-
perts as well as literary translators fluent in both
languages) to perform A/B tests on PAR3 which
indicates their preference of a Google Translate out-
put paragraph (
GTr
) versus a reference translation
written by a human (
HUM
). We additionally solicit
detailed free-form comments for each example ex-
plaining the raters’ justifications. We find that both
monolingual raters and literary translators strongly
prefer
HUM
over
GTr
paragraphs, noting that overly
literal translation and discourse errors are the main
error sources with GTr.
Experimental setup:
We administer A/B tests
to two sets of raters: (1) monolingual English ex-
perts (e.g., creative writers or copy editors), and
(2) professional literary translators. For the lat-
ter group, we first provided a source paragraph in
German, French, or Russian. Under the source
paragraph, we showed two English translations of
the source paragraph: one produced by Google
Translate and one from a published, human-written
translation.
8
We asked each rater to choose the
“better” translation and also to give written justifi-
cation for their choice (2-3 sentences). While all
raters knew that the texts were translations, they
did
NOT
know that one paragraph was machine-
generated. Each translator completed 50 tasks in
their language of expertise. For the monolingual
task, the set up was similar except for two impor-
tant distinctions: (1)
NO
source paragraph was
provided and (2) each monolingual rater rated all
150 examples (50 from each of 3 language-specific
tasks). Tasks were designed and administered via
Label Studio,
9
an open-source data-labeling tool,
and raters
10
were hired using Upwork, an online
platform for freelancers.
11
For the completion of
50 language-specific tasks, translators were paid
$200 each. For the set of 150 monolingual tasks,
raters were paid $250 each. All raters were given
at least 4 days to complete their tasks.
Common MT errors:
We roughly categorize
the errors highlighted by the professional literary
translators into five groups. The most pervasive er-
8Each English paragraph was 130-180 words long.
9https://labelstud.io/
10
For the language-specific task, raters were required to
be professional literary translators with experience translat-
ing German, French, or Russian to English. We hired one
translator for each language. For the monolingual task, we
hired three raters with extensive experience in creative writing,
copy-editing, or English literature.
11https://www.upwork.com/
ror (constituting nearly half of all translation errors
identified) is the
overly literal
translation of the
source text, where a translator adheres too closely
to the syntax of the source language, resulting in
awkward phrasing or the mistranslation of idioms.
The second most prevalent errors are
discourse
er-
rors, such as pronoun inconsistency or coreference
issues, which occur when context is ignored–these
errors are exacerbated at the paragraph and docu-
ment levels. We define the rest of the categories
and report their the distribution in Table 3.
Monolingual vs translator ratings:
Though the
source text is essential to the practice of translation,
the monolingual setting of our A/B testing allows
us to identify attributes other than translation er-
rors that distinguish the MT system outputs from
human-written text. Both monolingual and bilin-
gual raters strongly preferred
HUM
to
GTr
across all
three tested languages
12
, as shown in Figure 1, al-
though their preference fell on Russian examples.
In a case where all 3 monolingual raters chose
HUM
while the translator chose
GTr
, their comments re-
veal that the monolingual raters prioritized clarity
and readability:
[
HUM
] “is preferable because it flows better and
makes better sense” and “made complete sense
and was much easier to read”
while the translator diagnosed
HUM
with a catas-
trophic error:
“[
HUM
] contains several mistakes, mainly small
omissions that change the meaning of the sen-
tence, but also wrong translations (‘trained Euro-
pean chef’ instead of ‘European-educated chef’).
For an example where all 3 monolingual raters
chose [
GTr
] while the translator chose [
HUM
], the
monolingual raters much preferred the contempo-
rary language in [GTr]:
[
GTr
] was “much easier for me to grasp because
of its structure compared to the similar sentence
in [
HUM
]” and praised for its “use of commonplace
vocabulary that is understandable to the reader.
However, the translator, with access to the source
text, identified a precision error in
GTr
, and ulti-
mately declared HUM to be the better translation:
12
We report Krippendorff‘s alpha (Krippendorff,2011) as
the measure of inter-annotator agreement (IAA). The IAA
between the monolingual raters was 0.546 (0.437 for Russian,
0.494 for German, and 0.707 for French). The IAA between
the aggregated votes of monolingual raters (majority vote)
and the translator was 0.524 for Russian, 0.683 for German,
and 0.681 for French. These numbers suggest moderate to
substantial agreement (Artstein and Poesio,2008).
lord from [
HUM
] is the exact translation of the
Russian бари while bard from [
GTr
] doesn’t con-
vey a necessary meaning.13
Figure 1: The percentage of cases in which raters
preferred the human-written translation to the Google
translation by source language. Note that the value for
monolingual raters is the average of 3 percentages for
3 monolingual raters.
3.2 Can automatic MT metrics evaluate
literary translation?
Expert human evaluation, while insightful, is also
time-consuming and expensive, which precludes its
use in most model development scenarios. The MT
community thus relies extensively on automatic
metrics that score candidate translations against
references. In this section, we explore the usage
of three metrics (BLEU, BLEURT, and BLONDE)
on literary MT evaluation, and we discover that
none of them can accurately distinguish
GTr
text
from
HUM
. Regardless of their performance, we also
note that most automatic metrics are designed to
work with sentence-level alignments, which are
rarely available for literary translations because
translators merge and combine sentences. Thus,
developing domain-specific evaluation metrics is
crucial to make meaningful progress in literary MT.
MT Metrics:
To study the ability of MT metrics
to distinguish between machine and human transla-
tions, we compute three metrics on PAR3:
BLEU
(Papineni et al.,2002)
14
is a string-based
multi-reference metric originally proposed to eval-
uate sentence-level translations but also used for
document-level MT (Liu et al.,2020).
BLEURT
(Sellam et al.,2020) is a pretrained lan-
guage model fine-tuned on human judgments of
13
To view the
SRC
,
HUM
, and
GTr
texts for these examples,
see Tables 13 and 14 in the Appendix.
14
We compute the default, case-sensitive implementation
of BLEU from https://github.com/mjpost/sacrebleu.
摘要:

ExploringDocument-LevelLiteraryMachineTranslationwithParallelParagraphsfromWorldLiteratureKatherineThaiF}MarzenaKarpinskaF}KalpeshKrishna}WilliamRay}MoiraInghilleriJohnWieting|MohitIyyer}}ManningCollegeofInformationandComputerSciences,UMassAmherstDepartmentofLanguages,Literatures,andCultures;UMass...

展开>> 收起<<
Exploring Document-Level Literary Machine Translation with Parallel Paragraphs from World Literature Katherine ThaiFMarzena KarpinskaFKalpesh KrishnaWilliam Ray.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:21 页 大小:1.16MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注