Exploring Document-Level Literary Machine Translation with Parallel Paragraphs from World Literature Katherine ThaiFMarzena KarpinskaFKalpesh KrishnaWilliam Ray

2025-05-06 1 0 1.16MB 21 页 10玖币

侵权投诉

Exploring Document-Level Literary Machine Translation

with Parallel Paragraphs from World Literature

Katherine ThaiF♦Marzena KarpinskaF♦Kalpesh Krishna♦William Ray♦

Moira Inghilleri♠John Wieting♣Mohit Iyyer♦

♦Manning College of Information and Computer Sciences, UMass Amherst

♠Department of Languages, Literatures, and Cultures; UMass Amherst

♣Google Research

{kbthai,mkarpinska,kalpesh,miyyer}@cs.umass.edu

minghilleri@complit.umass.edu,jwieting@google.com

Abstract

Literary translation is a culturally signiﬁcant

task, but it is bottlenecked by the small num-

ber of qualiﬁed literary translators relative to

the many untranslated works published around

the world. Machine translation (MT) holds po-

tential to complement the work of human trans-

lators by improving both training procedures

and their overall efﬁciency. Literary transla-

tion is less constrained than more traditional

MT settings since translators must balance

meaning equivalence, readability, and critical

interpretability in the target language. This

property, along with the complex discourse-

level context present in literary texts, also

makes literary MT more challenging to com-

putationally model and evaluate. To explore

this task, we collect a dataset (PAR3) of non-

English language novels in the public domain,

each aligned at the paragraph level to both hu-

man and automatic English translations. Us-

ing PAR3, we discover that expert literary

translators prefer reference human translations

over machine-translated paragraphs at a rate

of 84%, while state-of-the-art automatic MT

metrics do not correlate with those preferences.

The experts note that MT outputs contain

not only mistranslations, but also discourse-

disrupting errors and stylistic inconsistencies.

To address these problems, we train a post-

editing model whose output is preferred over

normal MT output at a rate of 69% by experts.

We publicly release PAR3 to spur future re-

search into literary MT.1

1 Introduction

While the quality of machine translation (MT) sys-

tems has greatly improved with recent advances in

modeling and dataset collection, the application of

1https://github.com/katherinethai/par3/

FAuthors contributed equally.

these new technologies to the task of automatically

translating literary text (e.g., novels, short stories)

has remained limited to small-scale studies (Gen-

zel et al.,2010;Jones and Irvine,2013;Toral et al.,

2018). Translating literary works differs from trans-

lating standard MT corpora (e.g., news articles or

parliamentary proceedings) in several key ways.

For one, it is much more difﬁcult to evaluate. The

techniques

used by literary translators differ fun-

damentally from those applied in more standard

MT domains (see Table 8in the Appendix). Liter-

ary translators have the freedom (or burden) of both

semantic and critical interpretation, as they must

solve the problem of equivalence, often beyond

the word level (Neubert,1983;Baker,2018;Baker

and Saldanha,2021). The task of conveying an

author’s ideas highlights yet another difference be-

tween literary and traditional MT: document-level

context is especially critical for the literary domain

due to the presence of complex discourse structure,

rendering the typical sentence-level MT pipeline

insufﬁcient for this task (Voigt and Jurafsky,2012;

Taivalkoski-Shilov,2019).

In this work, we seek to understand how both

state-of-the-art MT systems and MT evaluation

metrics fail in the literary domain, and we also

leverage large pretrained language models to im-

prove literary MT. To facilitate our experiments,

we introduce PAR3, a large-scale dataset to study

paragraph-level literary translation into English.

PAR3 consists of 121K paragraphs taken from 118

novels originally written in a non-English language,

where each paragraph is aligned to multiple human-

written English translations of that paragraph as

Many terms have been employed by translation scholars

to refer to various operations used by translators (Chesterman,

2005). Here, we employ the term “techniques” argued for

by Molina and Hurtado Albir (2004) and recently used in the

ﬁeld of NLP (Zhai et al.,2018,2020).

arXiv:2210.14250v1 [cs.CL] 25 Oct 2022

well as a machine-translated paragraph produced

by Google Translate (see Table 2).

We show that MT evaluation metrics such as

BLEU and BLEURT are not effective for literary

MT. In fact, we discover that two of our tested met-

rics (BLEU and the document-level BLONDE) show

a preference for Google Translate outputs over ref-

erence translations in PAR3. In reality, MT outputs

are much worse than reference translations: our hu-

man evaluation reveals that professional translators

prefer reference translations at a rate of 85%.

While the translators in our study identiﬁed

overly literal translations and discourse-level er-

rors (e.g., coreference, pronoun consistency) as the

main faults of modern MT systems, a monolin-

gual human evaluation comparing human reference

translations and MT outputs reveals additional hur-

dles in readability and ﬂuency. To tackle these

issues, we ﬁne-tune GPT-3 (Brown et al.,2020) on

an automatic post-editing task in which the model

attempts to transform an MT output into a human

reference translation. Human translators prefer the

post-edited translations at a rate of 69% and also

observe a lower incidence of the above errors.

Overall, we identify critical roadblocks in evalu-

ation towards meaningful progress in literary MT,

and we also show through expert human evalua-

tions that pretrained language models can improve

the quality of existing MT systems on this domain.

We release PAR3 to spur more meaningful future

research in literary MT.

2 The PAR3 Dataset: Parallel

Paragraph-Level Paraphrases

To study literary MT, we collect a dataset of

par

allel

par

agraph-level

par

aphrases (PAR3) from

public domain non-English-language (source) nov-

els with their corresponding English translations

generated by both humans and Google Translate.

PAR3 is a step up in both scale and linguistic di-

versity compared to prior studies in literary MT,

which generally focus on one novel (Toral et al.,

2018) or a small set of poems or short stories (Jones

and Irvine,2013). PAR3 contains at least two hu-

man translations for every source paragraph (Ta-

ble 2). In Table 1, we report corpus statistics by

the 19 unique source languages

represented in

The Chinese texts in PAR3 were written in Classical Chi-

nese, an archaic and very different form of the language cur-

rently used today.

Languages in PAR3 represent different language fam-

ilies (Romance, Germanic, Slavic, Japonic, Sino-Tibetan,

Src lang #texts #src paras sents/para

French (fr) 32 50,070 2.7

Russian (ru) 27 36,117 3.3

German (de) 16 9,170 4.3

Spanish (es) 1 3,279 2.0

Czech (cs) 4 2,930 3.0

Norwegian (no) 2 2,655 3.4

Swedish (sv) 3 2,620 3.2

Portuguese (pt) 4 2,288 3.7

Italian (it) 2 1,931 2.6

Japanese (ja) 9 1,857 4.4

Bengali (bn) 2 1,499 3.3

Tamil (ta) 1 1,489 3.1

Danish (da) 1 1,384 3.6

Chinese3(zh) 7 1,320 8.8

Dutch (nl) 1 963 3.4

Hungarian (hu) 1 892 3.7

Polish (pl) 1 399 3.9

Sesotho (st) 1 374 4.2

Persian (fa) 1 148 4.2

All 118 121,385 3.2

Table 1: Corpus statistics for Version 2 of PAR3 by

each of the 19 source languages. The average num-

ber of sentences per paragraph refers to only the En-

glish human and Google translations of the source para-

graphs. We did not count tokens or sentences for source

paragraphs because of the lack of a reliable tokenizer

and sentence segmenter for all source languages.

PAR3. PAR3 was curated in four stages: selec-

tion of source texts, machine translation of source

texts, paragraph alignment, and ﬁnal ﬁltering. This

process closely resembles the paraphrase mining

methodology described by Barzilay and McKeown

(2001); the major distinctions are (1) our collec-

tion of literary works that is

∼20

times the size of

the previous work, (2) our inclusion of the aligned

source text to enable translation study, and (3) our

alignment at the paragraph, not sentence, level. In

this section, we describe the data collection process

and disclose choices we made during curation of

Version 1 of PAR3. See Section Ain the Appendix

for more details on the different versions of PAR3.

2.1 Selecting works of literature

For a source text to be included in PAR3, it must

be (1) a literary work that has entered the public

domain of its country of publication by 2022 with

(2) a published electronic version along with (3)

multiple versions of human-written, English trans-

lations. The ﬁrst requirement skews our corpus

towards older works of ﬁction. The second require-

ment ensures the preservation of the source texts’

paragraph breaks. The third requirement limits us

Iranian, Dravidian, Ugric, and Bantu), with different mor-

phological traits (synthetic, fusional, agglutinative), and use

different writing systems (Latin alphabet, Cyrillic alphabet,

Bengali script, Persian alphabet, Tamil script, Hanzi, and

Kanji/Hiragana/Katakana).

SRC (ru): — Извините меня: я, увидевши издали, как вы вошли в лавку, решился вас побеспокоить. Если вам будет

после свободно и по дороге мимо моего дома, так сделайте милость, зайдите на малость времени. Мне с вами нужно

будет переговорить

GTr: “Excuse me; seeing from a dis-

tance how you entered the shop, I de-

cided to disturb you. If you will be

free after and on the way past my

house, so do yourself a favour, stop

by for a little time. I will need to

speak with you.

HUM1: “Pardon me, I saw you from a

distance going into the shop and ven-

tured to disturb you. If you will be

free in a little while and will be pass-

ing by my house, do me the favour to

come in for a few minutes. I want to

have a talk with you.”

HUM2: “I saw you enter the shop,”

he said, “and therefore followed you,

for I have something important for

your ear. Could you spare me a

minute or two?”

HUM3: ‘Excuse me: I saw you from

far off going into the shop, and de-

cided to trouble you. If you’re free

afterwards and my house is not out

of your way, kindly stop by for a

short while. I must have a talk with

you.”

SRC (st): Ho bile jwalo ho fela ha Chaka, mora wa Senzangakhona. Mazulu le kajeno a bokajeno ha a hopola kamoo a kileng ya eba batho kateng,

mehleng ya Chaka, kamoo ditjhaba di neng di jela kgwebeleng ke ho ba tshoha, leha ba hopola borena ba bona bo weleng, eba ba sekisa mahlong, ba re:

"Di a bela, di a hlweba! Madiba ho pjha a maholo!"

GTr: Such was the end of Chaka, son of Senzan-

gakhona. The Zulus of today when they remem-

ber how they once became people, in the days

of Chaka, how the nations ate in the sun because

of fear of them, even when they remember their

fallen kingdom, they wince in their eyes, saying:

"They’re boiling, they’re boiling! The springs are

big!"

HUM1: So it came about, the end of Chaka, son

of Senzangakhona. Even to this very day the

Zulus, when they think how they were once a

strong nation in the days of Chaka, and how other

nations dreaded them so much that they could

hardly swallow their food, and when they remem-

ber their kingdom which has fallen, tears well up

in their eyes, and they say: “They ferment, they

curdle! Even great pools dry away!”

HUM2: And this was the last of Chaka, the son of

Senzangakona. Even to-day the Mazulu remem-

ber how that they were men once, in the time

of Chaka, and how the tribes in fear and trem-

bling came to them for protection. And when

they think of their lost empire the tears pour down

their cheeks and they say: ‘Kingdoms wax and

wane. Springs that once were mighty dry away.’

Table 2: An example of one source paragraph in PAR3, from Nikolai Gogol’s Dead Souls (upper example) and

from Thomas Mofolo’s Chaka (lower example) with their corresponding Google translation to English and aligned

paragraphs from human-written translations.

to texts that had achieved enough mainstream pop-

ularity to warrant (re)translations in English. Our

most-recently published source text, The Book of

Disquietude, was published posthumously in 1982,

47 years after the author’s death. The oldest source

text in our dataset, Romance of the Three King-

doms, was written in the 14th-century. The full

list of literary works with source language, author

information, and publication year is available in

Table 5in the Appendix.

2.2 Translating works using Google

Translate

Before being fed to Google Translate, the data was

preprocessed to convert ebooks to lists of plain

text paragraphs and to remove tables of contexts,

translator notes, and text-speciﬁc artifacts.

Each

paragraph was passed to the default model of the

Google Translate API between April 20 and April

27, 2022. The total cost of source text translation

was about 900 USD.6

2.3 Aligning paragraphs

All English translations, both human and Google

Translate-generated, were separated into sentences

using spaCy’s Sentencizer.

The sentences of each

From Japanese texts, we removed artifacts of furigana, a

reading aid placed above difﬁcult Japanese characters in order

to help readers unfamiliar with higher-level ideograms.

Google charges 20 USD per 1M characters of translation.

7https://spacy.io/usage/linguistic-features#

sbd

human translation were aligned to the sentences

of the Google translation of the corresponding

source text using the Needleman-Wunsch algo-

rithm (Needleman and Wunsch,1970) for global

alignment. Since this algorithm requires scores be-

tween each pair of human-Google sentences, we

compute scores using the embedding-based SIM

measure developed by Wieting et al. (2019), which

performs well on semantic textual similarity (STS)

benchmarks (Agirre et al.,2016). Final paragraph-

level alignments were computed using the para-

graph segmentations in the original source text.

2.4 Post-processing and ﬁltering

We considered alignments to be “short” if any En-

glish paragraph, human or Google generated, con-

tained fewer than 4 tokens or 20 characters. We

discarded any alignments that were “short” and

contained the word “chapter” or a Roman numeral,

as these were overwhelmingly chapter titles. We

also discarded any alignments where one English

paragraph contained more than 3 times the number

of words than another, reasoning that these were

actually misalignments. Thus, we also discarded

any alignments with a BLEU score of less than

5. Alignments were sampled for the ﬁnal version

of PAR3 such that no more than 50% of the para-

graphs for any human translation were included.

Finally, alignments for each source text were then

shufﬂed, at the paragraph level, to prevent recon-

struction of the human translations, which may not

be in the public domain.

2.5 Train, test, and validation splits

Instead of randomly creating splits of the 121K

paragraphs in PAR3, we deﬁne train, test, and val-

idation splits at the document level. Each literary

text belongs to one split, and all translations associ-

ated with its source paragraphs belong to that split

as well. This decision allows us to better test the

generalization ability of systems trained on PAR3,

and avoid cases where an MT model memorizes

entities or stylistic patterns located within a partic-

ular book to artiﬁcially inﬂate its evaluation scores.

The training split contains around 80% of the total

number of source paragraphs (97,611), the test split

contains around 10% (11,606), and the validation

split contains around 10% (11,606). Appendix 5

shows the texts belonging to each split.

3 How good are existing MT systems for

literary translation?

Armed with our PAR3 dataset, we next turn to eval-

uating the ability of commercial-grade MT systems

for literary translation. First, we describe a study

in which we hired both professional literary trans-

lators and monolingual English experts to compare

reference translations to those produced by Google

Translate at a paragraph-level. In an A/B test, the

translators showed a strong preference (on 84% of

examples) for human-written translations, ﬁnding

MT output to be far too literal and riddled with

discourse-level errors (e.g., pronoun consistency or

contextual word sense issues). The monolingual

raters preferred the human-written translations over

the Google Translate outputs 85% of the time, sug-

gesting that discourse-level errors made by MT

systems are prevalent and noticeable when the MT

outputs are evaluated independently of the source

texts. Finally, we address deﬁciencies in existing

automatic MT evaluation metrics, including BLEU,

BLEURT, and the document-level BLONDEmetric.

These metrics failed to distinguish human from ma-

chine translation, even preferring the MT outputs

on average.

3.1 Diagnosing literary MT with judgments

from expert translators

As literary MT is understudied (especially at a doc-

ument level), it is unclear how state-of-the-art MT

systems perform on this task and what systematic

errors they make. To shed light on this issue, we

hire human experts (both monolingual English ex-

perts as well as literary translators ﬂuent in both

languages) to perform A/B tests on PAR3 which

indicates their preference of a Google Translate out-

put paragraph (

GTr

) versus a reference translation

written by a human (

HUM

). We additionally solicit

detailed free-form comments for each example ex-

plaining the raters’ justiﬁcations. We ﬁnd that both

monolingual raters and literary translators strongly

prefer

HUM

over

GTr

paragraphs, noting that overly

literal translation and discourse errors are the main

error sources with GTr.

Experimental setup:

We administer A/B tests

to two sets of raters: (1) monolingual English ex-

perts (e.g., creative writers or copy editors), and

(2) professional literary translators. For the lat-

ter group, we ﬁrst provided a source paragraph in

German, French, or Russian. Under the source

paragraph, we showed two English translations of

the source paragraph: one produced by Google

Translate and one from a published, human-written

translation.

We asked each rater to choose the

“better” translation and also to give written justiﬁ-

cation for their choice (2-3 sentences). While all

raters knew that the texts were translations, they

did

NOT

know that one paragraph was machine-

generated. Each translator completed 50 tasks in

their language of expertise. For the monolingual

task, the set up was similar except for two impor-

tant distinctions: (1)

source paragraph was

provided and (2) each monolingual rater rated all

150 examples (50 from each of 3 language-speciﬁc

tasks). Tasks were designed and administered via

Label Studio,

an open-source data-labeling tool,

and raters

were hired using Upwork, an online

platform for freelancers.

For the completion of

50 language-speciﬁc tasks, translators were paid

$200 each. For the set of 150 monolingual tasks,

raters were paid $250 each. All raters were given

at least 4 days to complete their tasks.

Common MT errors:

We roughly categorize

the errors highlighted by the professional literary

translators into ﬁve groups. The most pervasive er-

8Each English paragraph was 130-180 words long.

9https://labelstud.io/

For the language-speciﬁc task, raters were required to

be professional literary translators with experience translat-

ing German, French, or Russian to English. We hired one

translator for each language. For the monolingual task, we

hired three raters with extensive experience in creative writing,

copy-editing, or English literature.

11https://www.upwork.com/

ror (constituting nearly half of all translation errors

identiﬁed) is the

overly literal

translation of the

source text, where a translator adheres too closely

to the syntax of the source language, resulting in

awkward phrasing or the mistranslation of idioms.

The second most prevalent errors are

discourse

er-

rors, such as pronoun inconsistency or coreference

issues, which occur when context is ignored–these

errors are exacerbated at the paragraph and docu-

ment levels. We deﬁne the rest of the categories

and report their the distribution in Table 3.

Monolingual vs translator ratings:

Though the

source text is essential to the practice of translation,

the monolingual setting of our A/B testing allows

us to identify attributes other than translation er-

rors that distinguish the MT system outputs from

human-written text. Both monolingual and bilin-

gual raters strongly preferred

HUM

GTr

across all

three tested languages

, as shown in Figure 1, al-

though their preference fell on Russian examples.

In a case where all 3 monolingual raters chose

HUM

while the translator chose

GTr

, their comments re-

veal that the monolingual raters prioritized clarity

and readability:

[

HUM

] “is preferable because it ﬂows better and

makes better sense” and “made complete sense

and was much easier to read”

while the translator diagnosed

HUM

with a catas-

trophic error:

“[

HUM

] contains several mistakes, mainly small

omissions that change the meaning of the sen-

tence, but also wrong translations (‘trained Euro-

pean chef’ instead of ‘European-educated chef’).”

For an example where all 3 monolingual raters

chose [

GTr

] while the translator chose [

HUM

], the

monolingual raters much preferred the contempo-

rary language in [GTr]:

[

GTr

] was “much easier for me to grasp because

of its structure compared to the similar sentence

in [

HUM

]” and praised for its “use of commonplace

vocabulary that is understandable to the reader.”

However, the translator, with access to the source

text, identiﬁed a precision error in

GTr

, and ulti-

mately declared HUM to be the better translation:

We report Krippendorff‘s alpha (Krippendorff,2011) as

the measure of inter-annotator agreement (IAA). The IAA

between the monolingual raters was 0.546 (0.437 for Russian,

0.494 for German, and 0.707 for French). The IAA between

the aggregated votes of monolingual raters (majority vote)

and the translator was 0.524 for Russian, 0.683 for German,

and 0.681 for French. These numbers suggest moderate to

substantial agreement (Artstein and Poesio,2008).

“lord from [

HUM

] is the exact translation of the

Russian бари while bard from [

GTr

] doesn’t con-

vey a necessary meaning.”13

Figure 1: The percentage of cases in which raters

preferred the human-written translation to the Google

translation by source language. Note that the value for

monolingual raters is the average of 3 percentages for

3 monolingual raters.

3.2 Can automatic MT metrics evaluate

literary translation?

Expert human evaluation, while insightful, is also

time-consuming and expensive, which precludes its

use in most model development scenarios. The MT

community thus relies extensively on automatic

metrics that score candidate translations against

references. In this section, we explore the usage

of three metrics (BLEU, BLEURT, and BLONDE)

on literary MT evaluation, and we discover that

none of them can accurately distinguish

GTr

text

from

HUM

. Regardless of their performance, we also

note that most automatic metrics are designed to

work with sentence-level alignments, which are

rarely available for literary translations because

translators merge and combine sentences. Thus,

developing domain-speciﬁc evaluation metrics is

crucial to make meaningful progress in literary MT.

MT Metrics:

To study the ability of MT metrics

to distinguish between machine and human transla-

tions, we compute three metrics on PAR3:

BLEU

(Papineni et al.,2002)

is a string-based

multi-reference metric originally proposed to eval-

uate sentence-level translations but also used for

document-level MT (Liu et al.,2020).

BLEURT

(Sellam et al.,2020) is a pretrained lan-

guage model ﬁne-tuned on human judgments of

To view the

SRC

HUM

, and

GTr

texts for these examples,

see Tables 13 and 14 in the Appendix.

We compute the default, case-sensitive implementation

of BLEU from https://github.com/mjpost/sacrebleu.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ExploringDocument-LevelLiteraryMachineTranslationwithParallelParagraphsfromWorldLiteratureKatherineThaiF}MarzenaKarpinskaF}KalpeshKrishna}WilliamRay}MoiraInghilleriJohnWieting|MohitIyyer}}ManningCollegeofInformationandComputerSciences,UMassAmherstDepartmentofLanguages,Literatures,andCultures;UMass...

展开>> 收起<<

Exploring Document-Level Literary Machine Translation with Parallel Paragraphs from World Literature Katherine ThaiFMarzena KarpinskaFKalpesh KrishnaWilliam Ray.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Exploring Document-Level Literary Machine Translation with Parallel Paragraphs from World Literature Katherine ThaiFMarzena KarpinskaFKalpesh KrishnaWilliam Ray

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: