
SRC (ru): — Извините меня: я, увидевши издали, как вы вошли в лавку, решился вас побеспокоить. Если вам будет
после свободно и по дороге мимо моего дома, так сделайте милость, зайдите на малость времени. Мне с вами нужно
будет переговорить
GTr: “Excuse me; seeing from a dis-
tance how you entered the shop, I de-
cided to disturb you. If you will be
free after and on the way past my
house, so do yourself a favour, stop
by for a little time. I will need to
speak with you.
HUM1: “Pardon me, I saw you from a
distance going into the shop and ven-
tured to disturb you. If you will be
free in a little while and will be pass-
ing by my house, do me the favour to
come in for a few minutes. I want to
have a talk with you.”
HUM2: “I saw you enter the shop,”
he said, “and therefore followed you,
for I have something important for
your ear. Could you spare me a
minute or two?”
HUM3: ‘Excuse me: I saw you from
far off going into the shop, and de-
cided to trouble you. If you’re free
afterwards and my house is not out
of your way, kindly stop by for a
short while. I must have a talk with
you.”
SRC (st): Ho bile jwalo ho fela ha Chaka, mora wa Senzangakhona. Mazulu le kajeno a bokajeno ha a hopola kamoo a kileng ya eba batho kateng,
mehleng ya Chaka, kamoo ditjhaba di neng di jela kgwebeleng ke ho ba tshoha, leha ba hopola borena ba bona bo weleng, eba ba sekisa mahlong, ba re:
"Di a bela, di a hlweba! Madiba ho pjha a maholo!"
GTr: Such was the end of Chaka, son of Senzan-
gakhona. The Zulus of today when they remem-
ber how they once became people, in the days
of Chaka, how the nations ate in the sun because
of fear of them, even when they remember their
fallen kingdom, they wince in their eyes, saying:
"They’re boiling, they’re boiling! The springs are
big!"
HUM1: So it came about, the end of Chaka, son
of Senzangakhona. Even to this very day the
Zulus, when they think how they were once a
strong nation in the days of Chaka, and how other
nations dreaded them so much that they could
hardly swallow their food, and when they remem-
ber their kingdom which has fallen, tears well up
in their eyes, and they say: “They ferment, they
curdle! Even great pools dry away!”
HUM2: And this was the last of Chaka, the son of
Senzangakona. Even to-day the Mazulu remem-
ber how that they were men once, in the time
of Chaka, and how the tribes in fear and trem-
bling came to them for protection. And when
they think of their lost empire the tears pour down
their cheeks and they say: ‘Kingdoms wax and
wane. Springs that once were mighty dry away.’
Table 2: An example of one source paragraph in PAR3, from Nikolai Gogol’s Dead Souls (upper example) and
from Thomas Mofolo’s Chaka (lower example) with their corresponding Google translation to English and aligned
paragraphs from human-written translations.
to texts that had achieved enough mainstream pop-
ularity to warrant (re)translations in English. Our
most-recently published source text, The Book of
Disquietude, was published posthumously in 1982,
47 years after the author’s death. The oldest source
text in our dataset, Romance of the Three King-
doms, was written in the 14th-century. The full
list of literary works with source language, author
information, and publication year is available in
Table 5in the Appendix.
2.2 Translating works using Google
Translate
Before being fed to Google Translate, the data was
preprocessed to convert ebooks to lists of plain
text paragraphs and to remove tables of contexts,
translator notes, and text-specific artifacts.
5
Each
paragraph was passed to the default model of the
Google Translate API between April 20 and April
27, 2022. The total cost of source text translation
was about 900 USD.6
2.3 Aligning paragraphs
All English translations, both human and Google
Translate-generated, were separated into sentences
using spaCy’s Sentencizer.
7
The sentences of each
5
From Japanese texts, we removed artifacts of furigana, a
reading aid placed above difficult Japanese characters in order
to help readers unfamiliar with higher-level ideograms.
6
Google charges 20 USD per 1M characters of translation.
7https://spacy.io/usage/linguistic-features#
sbd
human translation were aligned to the sentences
of the Google translation of the corresponding
source text using the Needleman-Wunsch algo-
rithm (Needleman and Wunsch,1970) for global
alignment. Since this algorithm requires scores be-
tween each pair of human-Google sentences, we
compute scores using the embedding-based SIM
measure developed by Wieting et al. (2019), which
performs well on semantic textual similarity (STS)
benchmarks (Agirre et al.,2016). Final paragraph-
level alignments were computed using the para-
graph segmentations in the original source text.
2.4 Post-processing and filtering
We considered alignments to be “short” if any En-
glish paragraph, human or Google generated, con-
tained fewer than 4 tokens or 20 characters. We
discarded any alignments that were “short” and
contained the word “chapter” or a Roman numeral,
as these were overwhelmingly chapter titles. We
also discarded any alignments where one English
paragraph contained more than 3 times the number
of words than another, reasoning that these were
actually misalignments. Thus, we also discarded
any alignments with a BLEU score of less than
5. Alignments were sampled for the final version
of PAR3 such that no more than 50% of the para-
graphs for any human translation were included.
Finally, alignments for each source text were then
shuffled, at the paragraph level, to prevent recon-
struction of the human translations, which may not