DEMETR Diagnosing Evaluation Metrics for Translation Marzena KarpinskaNishant RajKatherine Thai Yixiao SongAnkita GuptaMohit Iyyer_2

2025-05-06 0 0 690.85KB 22 页 10玖币
侵权投诉
DEMETR: Diagnosing Evaluation Metrics for Translation
Marzena KarpinskaNishant RajKatherine Thai
Yixiao SongAnkita GuptaMohit Iyyer
Manning College of Information and Computer Sciences, UMass Amherst
Department of Linguistics, UMass Amherst
{mkarpinska,kbthai,ankitagupta,miyyer}@cs.umass.edu
{nishantraj,yixiaosong}@umass.edu
Abstract
While machine translation evaluation metrics
based on string overlap (e.g., BLEU) have their
limitations, their computations are transparent:
the BLEU score assigned to a particular candi-
date translation can be traced back to the pres-
ence or absence of certain words. The opera-
tions of newer learned metrics (e.g., BLEURT,
COMET), which leverage pretrained language
models to achieve higher correlations with hu-
man quality judgments than BLEU, are opaque
in comparison. In this paper, we shed light on
the behavior of these learned metrics by cre-
ating DEMETR, a diagnostic dataset with 31K
English examples (translated from 10 source
languages) for evaluating the sensitivity of
MT evaluation metrics to 35 different linguis-
tic perturbations spanning semantic, syntactic,
and morphological error categories. All pertur-
bations were carefully designed to form mini-
mal pairs with the actual translation (i.e., differ
in only one aspect). We find that learned met-
rics perform substantially better than string-
based metrics on DEMETR. Additionally,
learned metrics differ in their sensitivity to var-
ious phenomena (e.g., BERTSCORE is sensi-
tive to untranslated words but relatively insen-
sitive to gender manipulation, while COMET
is much more sensitive to word repetition than
to aspectual changes). We publicly release
DEMETR to spur more informed future devel-
opment of machine translation evaluation met-
rics1.
1 Introduction
Automatically evaluating the output quality of ma-
chine translation (MT) systems remains a difficult
challenge. The BLEU metric (Papineni et al.,2002),
which is a function of
n
-gram overlap between sys-
tem and reference outputs, is still used widely to-
day despite its obvious limitations in measuring
1https://github.com/marzenakrp/demetr
SOURCE (de) : Murray verlor den ersten Satz im Tiebreak, nachdem
beide Männer jeden einzelnen Aufschlag im Satz gehalten hatten.
REF : Murray lost the first set in a tie break after both men held each
and every serve in the set.
MT : Murray lost the first set in the tiebreak after both men held every
single serve in the set.
PERTURBED MT : Murray won the first set in the tiebreak after both
men held every single serve in the set.
BLEURT (Ref, MT) > BLEURT (Ref, Pert)
BERTScore (Ref, MT) > BERTScore (Ref, Pert)
COMET-QE (Source, MT) < COMET-QE (Source, Pert)
Figure 1: An example perturbation (antonym replace-
ment) from our DEMETR dataset. We measure whether
different MT evaluation metrics score the unperturbed
translation higher than the perturbed translation; in this
case, BLEURT and BERTSCORE accurately identify
the perturbation, while COMET-QE fails to do so.
semantic similarity (Fomicheva and Specia,2019;
Marie et al.,2021;Kocmi et al.,2021;Freitag et al.,
2021). Recently-developed learned evaluation met-
rics such as BLEURT (Sellam et al.,2020a), COMET
(Rei et al.,2020), MOVERSCORE (Zhao et al.,
2019), or BARTSCORE (Yuan et al.,2021a) seek to
address these limitations by either fine-tuning pre-
trained language models directly on human judg-
ments of translation quality or by simply utilizing
contextualized word embeddings. While learned
metrics exhibit higher correlation with human judg-
ments than BLEU (Barrault et al.,2021), their rel-
ative lack of interpretability leaves it unclear as
to why they assign a particular score to a given
translation. This is a major reason why some MT
researchers are reluctant to employ learned metrics
in order to evaluate their MT systems (Marie et al.,
2021;Gehrmann et al.,2022;Leiter et al.,2022).
In this paper, we build on previous metric ex-
plainability work (Specia et al.,2010;Macketanz
arXiv:2210.13746v1 [cs.CL] 25 Oct 2022
et al.,2018;Fomicheva and Specia,2019;Kaster
et al.,2021;Sai et al.,2021a;Barrault et al.,2021;
Fomicheva et al.,2021;Leiter et al.,2022) by
introducing DEMETR, a dataset for
D
iagnosing
E
valuation
METR
ics for machine translation, that
measures the sensitivity of an MT metric to 35 dif-
ferent types of linguistic perturbations spanning
common syntactic (e.g., incorrect word order), se-
mantic (e.g., undertranslation), and morphological
(e.g., incorrect suffix) translation error categories.
Each example in DEMETR is a tuple containing
{source,reference,machine translation,
perturbed machine translation}
, as shown in
Figure 1. The entire dataset contains of 31K total
examples across 10 different source languages (the
target language is always English). The perturba-
tions in DEMETR are produced semi-automatically
by manipulating translations produced by commer-
cial MT systems such as Google Translate, and they
are manually validated to ensure the only source
of variation is associated with the desired perturba-
tion.
We measure the accuracy of a suite of 14 evalu-
ation metrics on DEMETR (as shown in Figure 1),
discovering that learned metrics perform far better
than string-based ones. We also analyze the rel-
ative sensitivity of metrics to different grades of
perturbation severity. We find that metrics strug-
gle at times to differentiate between minor errors
(e.g., punctuation removal or word repetition) with
semantics-warping errors such as incorrect gender
or numeracy. We also observe that the reference-
free
2
COMET-QE learned metric is more sensitive
to word repetition and misspelled words than se-
vere errors such as entirely unrelated translations
or named entity replacement. We publicly release
DEMETR and associated code to facilitate more
principled research into MT evaluation.
2 Diagnosing MT evaluation metrics
Most existing MT evaluation metrics compute a
score for a candidate translation
t
against a ref-
erence sentence
r
.
3
These scores can be either a
simple function of character or token overlap be-
tween
t
and
r
(e.g., BLEU), or they can be the result
of a complex neural network model that embeds
t
and
r
(e.g., BLEURT). While the latter class of
2
While prior work uses also terms such as “reference-less”
and “quality estimation,” we employ the term “reference-free"
as it is more self-explanatory.
3
Some metrics, such as COMET, additionally condition the
score on the source sentence.
learned metrics
4
provides more meaningful judg-
ments of translation quality than the former, they
are also relatively uninterpretable: the reason for
a particular translation
t
receiving a high or low
score is difficult to discern. In this section, we
first explain our perturbation-based methodology
to better understand MT metrics before describing
the collection of DEMETR, a dataset of linguistic
perturbations.
2.1 Using translation perturbations to
diagnose MT metrics
Inspired by prior work in minimal pair-based lin-
guistic evaluation of pretrained language models
such as BLIMP (Warstadt et al.,2020), we inves-
tigate how sensitive MT evaluation metrics are to
various perturbations of the candidate translation
t
. Consider the following example, which is de-
signed to evaluate the impact of word order in the
candidate translation:
reference translation r
: Pronunciation is rel-
atively easy in Italian since most words are pro-
nounced exactly how they are written.
machine translation t
: Pronunciation is rel-
atively easy in Italian, as most words are pro-
nounced exactly as they are spelled.
perturbed machine translation t0
: Spelled
pronunciation as Italian, relatively are most is as
they pronounced exactly in words easy.
If a particular evaluation metric
SCORE
is sensi-
tive to this shuffling perturbation,
SCORE(r, t0)
, the
score of the perturbed translation, should be lower
than
SCORE(r, t)
.
5
Note that while other minor
translation errors may be present in
t
, the perturbed
translation
t0
differs only in a specific, controlled
perturbation (in this case, shuffling).
2.2 Creating the DEMETR dataset
To explore the above methodology at scale, we
create DEMETR, a dataset that evaluates MT met-
rics on 35 different linguistic phenomena with 1K
perturbations per phenomenon.
6
Each example in
DEMETR consists of (1) a sentence in one of 10
source languages, (2) an English translation writ-
ten by a human translator, (3) a machine transla-
4
We define learned metrics as any metric which uses a
machine learning model (including both pretrained and super-
vised methods).
5
For reference-free metrics like COMET-QE, we include
the source sentence
s
as an input to the scoring function instead
of the reference.
6
As some perturbations require presence of specific items
(e.g., to omit a named entity, one has to be present) not all
perturbations include exactly 1k sentences.
ID Category Description Error severity
1
accuracy
word repetition (twice) minor
2 word repetition (four times) minor
3 too general word (undertranslation) major
4 untranslated word (codemix) major
5 omitted perpositional phrase major
6 incorrect word added critical
7 change to antonym critical
8 change to negation critical
9 replaced named entity critical
10 incorrect numeric critical
11 incorrect gender pronoun critical
12
fluency
omitted conjunction minor
13 part of speech shift minor
14 switched word order (word swap) minor
15 incorrect case (pronouns) minor
16 incorrect preposition or article minor-major
17 incorrect tense major
18 incorrect aspect major
19 change to interrogative major
20
mixed
omitted adj/adv minor-major
21 omitted content verb critical
22 omitted noun critical
23 omitted subject critical
24 omitted named entity critical
25
typography
misspelled word minor
26 deleted character minor
27 omitted final punctuation minor
28 added punctuation minor
29 tokenized sentence minor
30 lowercased sentence minor
31 first word lowercased minor
32
baseline
empty string base
33 unrelated translation base
34 shuffled words base
35 reference as translation base
Table 1: List of perturbations included in DEMETR
with their corresponding error severity. Details can be
found in Appendix A
tion produced by Google Translate,
7
and (4) a
perturbed version of the Google Translate output
which introduces exactly one mistake (semantic,
syntactic, or typographical).
Data sources and filtering:
We utilize X-to-
English translation pairs from two different
datasets, WMT (Callison-Burch et al.,2009;Bojar
et al.,2013,2015,2014;Akhbardeh et al.,2021;
Barrault et al.,2020) and FLORES (Guzmán et al.,
2019), aiming at a wide coverage of topics from
different sources. WMT has been widely used
over the years as a popular MT shared task, while
FLORES was recently curated to aid MT evalua-
tion. We consider only the test split of each dataset
to prevent possible leaks, as both current and fu-
ture metrics are likely to be trained on these two
datasets. We sample 100 sentences (50 from each
of the two datasets) for each of the following 10
7
We edit the machine translation to assure a satisfactory
quality. In cases where the Google Translate output is excep-
tionally poor, we either replace the sentence or replace the
translation with one produced by DeepL (Frahling, 2022) or
GPT-3 (Brown et al.,2020).
languages: French (fr), Italian (it), Spanish (es),
German (de), Czech (cs), Polish (pl), Russian (ru),
Hindi (hi), Chinese (zh), and Japanese (ja).
8
We
pay special attention to the language selection, as
newer MT evaluation metrics, such as COMET-QE
or PRISM-QE, employ only the source text and
the candidate translation. We control for sentence
length by including only sentences between 15 and
25 words long, measured by the length of the tok-
enized reference translation. Since we re-use the
same sentences across multiple perturbations, we
did not include shorter sentences because they are
less likely to contain multiple linguistic phenomena
of interest.
9
As the quality of sampled sentences
varies, we manually check each source sentence
and its translation to make sure they are of satisfac-
tory quality.10
Translating the data:
Given the filtered collec-
tion of source sentences, we next translate them
into English using the Google Translate API.
11
We manually verify each translation, editing or
resampling the instances where the machine trans-
lation contains critical errors. Through this process,
we obtain 1K curated examples per perturbation
(100 sentences
×
10 languages) that each consist
of source and reference sentences along with a ma-
chine translation of reasonable quality.
8
We choose languages that represent different families
(Romance, Germanic, Slavic, Indo-Iranian, Sino-Tibetan, and
Japonic) with different morphological traits (fusional, aggluti-
native, and analytic) and wide range of writing systems (Latin
alphabet, Cyrillic alphabet, Devanagari script, Hanzi, and
Kanji/Hiragana/Katakana).
9
Similarly, we do not include sentences over 25 words long
in DEMETR as some languages may naturally allow longer
sentences than others, and we wanted to control the length
distribution.
10
In the sentences sampled from WMT, we notice multiple
translation and grammar errors, such as translating Japanese
7
れています
as (the biggest being Honshu), making Japan
the 7th largest island in the world, which would suggest that
Japan is an island, instead of the largest of which is the Hon-
shu island, considered to be the seventh largest island in the
world. or "kakao" ("cacao") incorrectly declined as "kakaa"
in Polish. These sentences were rejected, and new ones were
sampled in their place. We also resampled sentences which
translations contained artifacts from neighboring sentences
due to partial splits and merges, and sentences which exhibit
translationese, that is sentences with source artifacts (Koppel
and Ordan,2011). Finally, we omit or edit sentences with
translation artifacts due to the direction of translation. Both
WMT and FLORES contain sentences translated from En-
glish to another languages. Since the translation process is not
always fully reversible, we omit sentences where translation
from the give language to English would not be possible in
the form included in these datasets (e.g., due to addition or
omission of information).
11All sentences were translated in May, 2022.
2.3 Perturbations in DEMETR
We perturb the machine translations obtained above
in order to create minimal pairs, which allow us
to investigate the sensitivity of MT evaluation met-
rics to different types of errors. Our perturbations
are loosely based on the Multidimensional Quality
Metrics (Burchardt,2013, MQM) framework de-
veloped to identify and categorize MT errors. Most
perturbations were performed semi-automatically
by utilizing STANZA (Qi et al.,2020), SPACY
12
or GPT-3 (Brown et al.,2020), applying hand-
crafted rules and then manually correcting any er-
rors. Some of the more elaborate perturbations
(e.g., translation by a too general term, where one
had to be sure that a better, more precise term ex-
ists) were performed manually by the authors or
linguistically-savvy freelancers hired on the Up-
work platform.
13
Special care was given to the
plausibility of perturbations (e.g., numbers for re-
placement were selected from a probable range,
such as 1-12 for months). See Table 2for descrip-
tions and examples of most perturbations; full list
in Appendix A.
We roughly categorize our perturbations into the
following four categories:
ACCURACY
: Perturbations in the accuracy
category modify the semantics of the transla-
tion by either incorporating misleading infor-
mation (e.g., by adding plausible yet inade-
quate text or changing a word to its antonym)
or omitting information (e.g., by leaving a
word untranslated).
FLUENCY
: Perturbations in the fluency cat-
egory focus on grammatical accuracy (e.g.,
word form agreement, tense, or aspect) and on
overall cohesion. Compared to the mistakes
in the accuracy category, the true meaning of
the sentence can be usually recovered from
the context more easily.
MIXED
: Certain perturbations can be classi-
fied as both accuracy and fluency errors. Con-
cretely, this category consists of omission er-
rors that not only obscure the meaning but
also affect the grammaticality of the sentence.
One such error is subject removal, which will
result not only in an ungrammatical sentence,
12https://spacy.io/usage/linguistic-features
13
See
https://www.upwork.com/
. Freelancers were paid
an equivalent of $15 per hour.
leaving a gap where the subject should come,
but also in information loss.
TYPOGRAPHY
: This category concerns
punctuation and minor orthographic errors.
Examples of mistakes in this category include
punctuation removal, tokenization, lowercas-
ing, and common spelling mistakes.
BASELINE
: Finally, we include both up-
per and lower bounds, since learned metrics
such as BLEURT and COMET do not have a
specified range that their scores can fall into.
Specifically, we provide three baselines: as
lower bounds, we either change the transla-
tion to an unrelated one or provide an empty
string,
14
while as an upper bound, we set the
perturbed translation
t0
equal to the reference
translation
r
, which should return the highest
possible score for reference-based metrics.
Error severity:
Our perturbations can also be
categorized by their severity (see Table 1). We
use the following categorization scheme for our
analysis experiments:
MINOR
: In this type of error, which includes
perturbations such as dropping punctuation or
using the wrong article, the meaning of the
source sentence can be easily and correctly
interpreted by human readers.
MAJOR
: Errors in this category may not
affect the overall fluency of the sentence but
will result in some missing details. Examples
of major errors include undertranslation (e.g.,
translating “church” as “building”), or leaving
a word in the source language untranslated.
CRITICAL
: These are catastrophic errors
that result in crucial pieces of information go-
ing missing or incorrect information being
added in a way unrecognizable for the reader,
and are also likely to suffer from severe flu-
ency issues. Errors in this category include
subject deletion or replacement of a named
entity.
3 Performance of MT evaluation metrics
on DEMETR
We test the accuracy and sensitivity of 14 pop-
ular MT evaluation metrics on the perturbations
14
Since most of the metrics will not accept an empty string,
we pass a full stop instead.
Category Type Example Description Implementation Error Severity
ACCURACY
repetition
I don’t know if you realize that most of the goods imported into this country from Central
America are duty free.
I don’t know if you realize that most of the goods imported into this country from Central
America are duty free free.
The last word is being repeated twice. Punctuation is added after the last
repeated word.
automatic minor
repetition
Gordon Johndroe, Bush’s spokesman, referred to the North Korean commitment as
"an important advance towards the goal of achieving verifiable denuclearization of the
Korean penisula."
Gordon Johndroe, Bush’s spokesman, referred to the North Korean commitment as
"an important advance towards the goal of achieving verifiable denuclearization of the
Korean penisula penisula penisula penisula."
The last word is being repeated four times. Punctuation is added after
the last repeated word.
automatic minor
hypernym
The language most of the people working in the Vatican City use on a daily basis is
Italian, and Latin is often used in religious ceremonies.
The language most of the people working in the Vatican City use on a daily basis is
Italian, and Latin is often used in religious activities.
A word translated by a too general term (undertranslation). Special care
was given in order to assure the word used in perturbed text is more
general, and incorrect, translation of the original word.
manual with
suggestions
from GPT-3
major
untranslated
The Polish Air Force will eventually be equipped with 32 F-35 Lightning II fighters
manufactured by Lockheed Martin.
The Polish Air Force will eventually be equipped with 32 F-35 Lightning II fighters
produkowane by Lockheed Martin.
One word is being left untranslated. We manually assure that each time
only one word is left untranslated.
manual major
completeness
She is
in custody
pending prosecution and trial; but any witness evidence could be
negatively impacted because her image has been widely published.
She is
_____
pending prosecution and trial; but any witness evidence could be negatively
impacted because her image has been widely published.
One prepositional phrase is being removed. Whenever possible, we
remove the shortest prepositional phrase in order to assure that the
perturbed sentence is not much shorter than the original translation.
automatic
(Stanza) with
manual check
major
addition _____
Plants look their best when they are in a natural environment, so resist the
temptation to remove "just one."
Power
plants look their best when they are in a natural environment, so resist the
temptation to remove "just one."
One word is being added. We make sure that the added word does not
disturb the grammaticality of the sentence but changes the meaning in a
significant way.
manual critical
antonym
He has been unable to relieve the
pain
with medication, which the competition prohibits
competitors from taking.
He has been unable to relieve the
pleasure
with medication, which the competition
prohibits competitors from taking.
One word (noun, verb, adj., or adv.) is being changed to its antonym.
manual with
suggestions
from GPT-3
critical
mistranslation
negation
Last month, a presidential committee
recommended
the resignation of the former CEP
as part of measures to push the country toward new elections.
Last month, a presidential committee
didn’t recommend
the resignation of the former
CEP as part of measures to push the country toward new elections.
Affirmative sentences are being changed into negations. Rare negations
are being changed to affirmative sentences.
manual critical
mistranslation
named entity
Late night presenter
Stephen Colbert
welcomed 17-year-old Thunberg to his show on
Tuesday and conducted a lengthy interview with the Swede.
Late night presenter
John Oliver
welcomed 17-year-old Thunberg to his show on
Tuesday and conducted a lengthy interview with the Swede.
Named entity is replaced with another named entity from the same
category (person, geographic location, and organization).
automatic
(Stanza) with
manual check
critical
mistranslation
numbers
The Chinese Consulate General in Houston was established in
1979
and is the first
Chinese consulate in the United States.
The Chinese Consulate General in Houston was established in
1997
and is the first
Chinese consulate in the United States.
A number is being replaced with an incorrect one. Special attention was
given to keep the numerals with resonable/common range for the given
category (e.g., 0-100 for percentages; 1-12 for months). We also assure
that the replacement will not create an illogical sentence (e.g., replacing
“1920” with “1940” in “from 1920 to 1930”)
manual critical
mistranslation
gender He
has been unable to relieve the pain with medication, which the competition prohibits
competitors from taking.
She
has been unable to relieve the pain with medication, which the competition prohibits
competitors from taking.
Exactly one feminine pronoun in the sentence (such as “she” or “her”) is
being with a masculine pronouns (such as “he” or “him”) or vice-versa.
This includes reflexive pronouns (i.e., “him/herself”) and possessive
adjectives (i.e., “his/her”).
automatic with
manual check
critical
FLUENCY
cohesion
Scientists want to understand how planets have formed
since
a comet collided with Earth
long ago, and especially how Earth has formed.
Scientists want to understand how planets have formed
_____
a comet collided with
Earth long ago, and especially how Earth has formed.
A conjunction, such as “thus” or “therefore” is removed. Special atten-
tion was given to keep the rest of the sentence unperturbed.
automatic
(spaCy) with
manual check
minor
grammar
pos shift
The U.S. Supreme Court last year blocked the Trump
administration
from including
the citizenship question on the 2020 census form.
The U.S. Supreme Court last year blocked the Trump
administrate
from including the
citizenship question on the 2020 census form.
Affix of the word is being changed keeping the stem kept constant (e.g.,
“bad” to “badly”) which results in the part-of-speech shift. The degree
to which the original meaning is affected varies, however, the intended
meaning is easily retrivable from the stem and context.
manual minor
grammar
swap order
I don’t know if you realize that most of the goods imported
into this
country from
Central America are duty free.
I don’t know if you realize that most of the goods imported
this into
country from
Central America are duty free.
Two neighboring words are being swapped to mimic word order error.
automatic
(spaCy)
minor
grammar
case She
announced that after a break of several years, a Rakoczy horse show will take place
again in 2021.
Her
announced that after a break of several years, a Rakoczy horse show will take place
again in 2021.
One pronoun in the sentence is being changed into a different, incorrect,
case (e.g., “he” to “him”).
automatic
(spaCy) with
manual check
minor
grammar
function word
Last month,
a
presidential committee recommended the resignation of the former CEP
as part of measures to push the country toward new elections.
Last month,
an
presidential committee recommended the resignation of the former CEP
as part of measures to push the country toward new elections.
A preposition or article is being changed into an incorrect one to mimic
mistake in function words usage. While most perturbations result in
minor mistakes (i.e., the original meaning is easily retrivable) some may
be more severe.
automatic with
manual check
minor-major
grammar
tense
Cyanuric acid and melamine
were
both found in urine samples of pets who died after
eating contaminated pet food.
Cyanuric acid and melamine
are
both found in urine samples of pets who died after
eating contaminated pet food.
A tense is being change into an incorrect one. We consider past, present,
as well as the future tense (although this may be classified as modal verb
in English)
manual major
grammar
aspect
He
has been
unable to relieve the pain with medication, which the competition prohibits
competitors from taking.
He
is being
unable to relieve the pain with medication, which the competition prohibits
competitors from taking.
Aspect is being changed to an incorrect one (e.g., perfective to progres-
sive) without changing the tense.
manual major
grammar
interrogative This is
the tenth time since the start of the pandemic that Florida’s daily death toll has
surpassed 100.
Is this
the tenth time since the start of the pandemic that Florida’s daily death toll has
surpassed 100?
Affirmative mood is being changed to interrogative mood. manual major
MIXED
omission
adj/adv
Rangers
closely
monitor shooters participating in supplemental pest control trials as the
trials are monitored and their effectiveness assessed.
Rangers
_____
monitor shooters participating in supplemental pest control trials as the
trials are monitored and their effectiveness assessed.
An adjective or adverb is being removed. While in most cases this leads
to
automatic
(spaCy) with
manual check
minor-major
omission
content verb
Catri
said
that 85% of new coronavirus cases in Belgium last week were under the age
of 60.
Catri
_____
that 85% of new coronavirus cases in Belgium last week were under the age
of 60.
Content verb is being removed (this excludes auxilary verbs and copu-
lae).
Automatic with
manual check
critical
omission
noun
In 1940 he stood up to other government
aristocrats
who wanted to discuss an "agree-
ment" with the Nazis and he very ably won.
In 1940 he stood up to other government
_____
who wanted to discuss an "agreement"
with the Nazis and he very ably won.
Noun, which is not a named entity or a subject, is being removed. We
remove the head of the noun phrase including compound nouns.
automatic
(spaCy) with
manual check
critical
omission
subject
His
research
shows that the administration of hormones can accelerate the maturation
of the baby’s fetal lungs.
His
_____
shows that the administration of hormones can accelerate the maturation of
the baby’s fetal lungs.
Subject is being removed. We remove the head of the noun phrase
including compound nouns.
automatic
(spaCy) with
manual check
critical
omission
named entry
I don’t know if you realize that most of the goods imported into this country from
Central America are duty free.
I don’t know if you realize that most of the goods imported into this country from
_____
are duty free.
Named entity, which is not a subject, is being removed.
automatic
(Stanza) with
manual check
critical
Table 2: A subset of perturbations in DEMETR along with examples (detailed changes are highlighted in purple).
A full list of perturbations is provided in Table A1 and Table A2 in Appendix A.
摘要:

DEMETR:DiagnosingEvaluationMetricsforTranslationMarzenaKarpinska}NishantRaj}KatherineThai}YixiaoSongAnkitaGupta}MohitIyyer}}ManningCollegeofInformationandComputerSciences,UMassAmherstDepartmentofLinguistics,UMassAmherst{mkarpinska,kbthai,ankitagupta,miyyer}@cs.umass.edu{nishantraj,yixiaosong}@umas...

展开>> 收起<<
DEMETR Diagnosing Evaluation Metrics for Translation Marzena KarpinskaNishant RajKatherine Thai Yixiao SongAnkita GuptaMohit Iyyer_2.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:22 页 大小:690.85KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注