DEMETR Diagnosing Evaluation Metrics for Translation Marzena KarpinskaNishant RajKatherine Thai Yixiao SongAnkita GuptaMohit Iyyer_2

2025-05-06 0 0 690.85KB 22 页 10玖币

侵权投诉

DEMETR: Diagnosing Evaluation Metrics for Translation

Marzena Karpinska♦Nishant Raj♦Katherine Thai♦

Yixiao Song♠Ankita Gupta♦Mohit Iyyer♦

♦Manning College of Information and Computer Sciences, UMass Amherst

♠Department of Linguistics, UMass Amherst

{mkarpinska,kbthai,ankitagupta,miyyer}@cs.umass.edu

{nishantraj,yixiaosong}@umass.edu

Abstract

While machine translation evaluation metrics

based on string overlap (e.g., BLEU) have their

limitations, their computations are transparent:

the BLEU score assigned to a particular candi-

date translation can be traced back to the pres-

ence or absence of certain words. The opera-

tions of newer learned metrics (e.g., BLEURT,

COMET), which leverage pretrained language

models to achieve higher correlations with hu-

man quality judgments than BLEU, are opaque

in comparison. In this paper, we shed light on

the behavior of these learned metrics by cre-

ating DEMETR, a diagnostic dataset with 31K

English examples (translated from 10 source

languages) for evaluating the sensitivity of

MT evaluation metrics to 35 different linguis-

tic perturbations spanning semantic, syntactic,

and morphological error categories. All pertur-

bations were carefully designed to form mini-

mal pairs with the actual translation (i.e., differ

in only one aspect). We ﬁnd that learned met-

rics perform substantially better than string-

based metrics on DEMETR. Additionally,

learned metrics differ in their sensitivity to var-

ious phenomena (e.g., BERTSCORE is sensi-

tive to untranslated words but relatively insen-

sitive to gender manipulation, while COMET

is much more sensitive to word repetition than

to aspectual changes). We publicly release

DEMETR to spur more informed future devel-

opment of machine translation evaluation met-

rics1.

1 Introduction

Automatically evaluating the output quality of ma-

chine translation (MT) systems remains a difﬁcult

challenge. The BLEU metric (Papineni et al.,2002),

which is a function of

-gram overlap between sys-

tem and reference outputs, is still used widely to-

day despite its obvious limitations in measuring

1https://github.com/marzenakrp/demetr

SOURCE (de) : Murray verlor den ersten Satz im Tiebreak, nachdem

beide Männer jeden einzelnen Aufschlag im Satz gehalten hatten.

REF : Murray lost the first set in a tie break after both men held each

and every serve in the set.

MT : Murray lost the first set in the tiebreak after both men held every

single serve in the set.

PERTURBED MT : Murray won the first set in the tiebreak after both

men held every single serve in the set.

BLEURT (Ref, MT) > BLEURT (Ref, Pert)

BERTScore (Ref, MT) > BERTScore (Ref, Pert)

COMET-QE (Source, MT) < COMET-QE (Source, Pert)

Figure 1: An example perturbation (antonym replace-

ment) from our DEMETR dataset. We measure whether

different MT evaluation metrics score the unperturbed

translation higher than the perturbed translation; in this

case, BLEURT and BERTSCORE accurately identify

the perturbation, while COMET-QE fails to do so.

semantic similarity (Fomicheva and Specia,2019;

Marie et al.,2021;Kocmi et al.,2021;Freitag et al.,

2021). Recently-developed learned evaluation met-

rics such as BLEURT (Sellam et al.,2020a), COMET

(Rei et al.,2020), MOVERSCORE (Zhao et al.,

2019), or BARTSCORE (Yuan et al.,2021a) seek to

address these limitations by either ﬁne-tuning pre-

trained language models directly on human judg-

ments of translation quality or by simply utilizing

contextualized word embeddings. While learned

metrics exhibit higher correlation with human judg-

ments than BLEU (Barrault et al.,2021), their rel-

ative lack of interpretability leaves it unclear as

to why they assign a particular score to a given

translation. This is a major reason why some MT

researchers are reluctant to employ learned metrics

in order to evaluate their MT systems (Marie et al.,

2021;Gehrmann et al.,2022;Leiter et al.,2022).

In this paper, we build on previous metric ex-

plainability work (Specia et al.,2010;Macketanz

arXiv:2210.13746v1 [cs.CL] 25 Oct 2022

et al.,2018;Fomicheva and Specia,2019;Kaster

et al.,2021;Sai et al.,2021a;Barrault et al.,2021;

Fomicheva et al.,2021;Leiter et al.,2022) by

introducing DEMETR, a dataset for

iagnosing

valuation

METR

ics for machine translation, that

measures the sensitivity of an MT metric to 35 dif-

ferent types of linguistic perturbations spanning

common syntactic (e.g., incorrect word order), se-

mantic (e.g., undertranslation), and morphological

(e.g., incorrect sufﬁx) translation error categories.

Each example in DEMETR is a tuple containing

{source,reference,machine translation,

perturbed machine translation}

, as shown in

Figure 1. The entire dataset contains of 31K total

examples across 10 different source languages (the

target language is always English). The perturba-

tions in DEMETR are produced semi-automatically

by manipulating translations produced by commer-

cial MT systems such as Google Translate, and they

are manually validated to ensure the only source

of variation is associated with the desired perturba-

tion.

We measure the accuracy of a suite of 14 evalu-

ation metrics on DEMETR (as shown in Figure 1),

discovering that learned metrics perform far better

than string-based ones. We also analyze the rel-

ative sensitivity of metrics to different grades of

perturbation severity. We ﬁnd that metrics strug-

gle at times to differentiate between minor errors

(e.g., punctuation removal or word repetition) with

semantics-warping errors such as incorrect gender

or numeracy. We also observe that the reference-

free

COMET-QE learned metric is more sensitive

to word repetition and misspelled words than se-

vere errors such as entirely unrelated translations

or named entity replacement. We publicly release

DEMETR and associated code to facilitate more

principled research into MT evaluation.

2 Diagnosing MT evaluation metrics

Most existing MT evaluation metrics compute a

score for a candidate translation

against a ref-

erence sentence

These scores can be either a

simple function of character or token overlap be-

tween

and

(e.g., BLEU), or they can be the result

of a complex neural network model that embeds

and

(e.g., BLEURT). While the latter class of

While prior work uses also terms such as “reference-less”

and “quality estimation,” we employ the term “reference-free"

as it is more self-explanatory.

Some metrics, such as COMET, additionally condition the

score on the source sentence.

learned metrics

provides more meaningful judg-

ments of translation quality than the former, they

are also relatively uninterpretable: the reason for

a particular translation

receiving a high or low

score is difﬁcult to discern. In this section, we

ﬁrst explain our perturbation-based methodology

to better understand MT metrics before describing

the collection of DEMETR, a dataset of linguistic

perturbations.

2.1 Using translation perturbations to

diagnose MT metrics

Inspired by prior work in minimal pair-based lin-

guistic evaluation of pretrained language models

such as BLIMP (Warstadt et al.,2020), we inves-

tigate how sensitive MT evaluation metrics are to

various perturbations of the candidate translation

. Consider the following example, which is de-

signed to evaluate the impact of word order in the

candidate translation:

reference translation r

: Pronunciation is rel-

atively easy in Italian since most words are pro-

nounced exactly how they are written.

machine translation t

: Pronunciation is rel-

atively easy in Italian, as most words are pro-

nounced exactly as they are spelled.

perturbed machine translation t0

: Spelled

pronunciation as Italian, relatively are most is as

they pronounced exactly in words easy.

If a particular evaluation metric

SCORE

is sensi-

tive to this shufﬂing perturbation,

SCORE(r, t0)

, the

score of the perturbed translation, should be lower

than

SCORE(r, t)

Note that while other minor

translation errors may be present in

, the perturbed

translation

differs only in a speciﬁc, controlled

perturbation (in this case, shufﬂing).

2.2 Creating the DEMETR dataset

To explore the above methodology at scale, we

create DEMETR, a dataset that evaluates MT met-

rics on 35 different linguistic phenomena with 1K

perturbations per phenomenon.

Each example in

DEMETR consists of (1) a sentence in one of 10

source languages, (2) an English translation writ-

ten by a human translator, (3) a machine transla-

We deﬁne learned metrics as any metric which uses a

machine learning model (including both pretrained and super-

vised methods).

For reference-free metrics like COMET-QE, we include

the source sentence

as an input to the scoring function instead

of the reference.

As some perturbations require presence of speciﬁc items

(e.g., to omit a named entity, one has to be present) not all

perturbations include exactly 1k sentences.

ID Category Description Error severity

accuracy

word repetition (twice) minor

2 word repetition (four times) minor

3 too general word (undertranslation) major

4 untranslated word (codemix) major

5 omitted perpositional phrase major

6 incorrect word added critical

7 change to antonym critical

8 change to negation critical

9 replaced named entity critical

10 incorrect numeric critical

11 incorrect gender pronoun critical

ﬂuency

omitted conjunction minor

13 part of speech shift minor

14 switched word order (word swap) minor

15 incorrect case (pronouns) minor

16 incorrect preposition or article minor-major

17 incorrect tense major

18 incorrect aspect major

19 change to interrogative major

mixed

omitted adj/adv minor-major

21 omitted content verb critical

22 omitted noun critical

23 omitted subject critical

24 omitted named entity critical

typography

misspelled word minor

26 deleted character minor

27 omitted ﬁnal punctuation minor

28 added punctuation minor

29 tokenized sentence minor

30 lowercased sentence minor

31 ﬁrst word lowercased minor

baseline

empty string base

33 unrelated translation base

34 shufﬂed words base

35 reference as translation base

Table 1: List of perturbations included in DEMETR

with their corresponding error severity. Details can be

found in Appendix A

tion produced by Google Translate,

and (4) a

perturbed version of the Google Translate output

which introduces exactly one mistake (semantic,

syntactic, or typographical).

Data sources and ﬁltering:

We utilize X-to-

English translation pairs from two different

datasets, WMT (Callison-Burch et al.,2009;Bojar

et al.,2013,2015,2014;Akhbardeh et al.,2021;

Barrault et al.,2020) and FLORES (Guzmán et al.,

2019), aiming at a wide coverage of topics from

different sources. WMT has been widely used

over the years as a popular MT shared task, while

FLORES was recently curated to aid MT evalua-

tion. We consider only the test split of each dataset

to prevent possible leaks, as both current and fu-

ture metrics are likely to be trained on these two

datasets. We sample 100 sentences (50 from each

of the two datasets) for each of the following 10

We edit the machine translation to assure a satisfactory

quality. In cases where the Google Translate output is excep-

tionally poor, we either replace the sentence or replace the

translation with one produced by DeepL (Frahling, 2022) or

GPT-3 (Brown et al.,2020).

languages: French (fr), Italian (it), Spanish (es),

German (de), Czech (cs), Polish (pl), Russian (ru),

Hindi (hi), Chinese (zh), and Japanese (ja).

pay special attention to the language selection, as

newer MT evaluation metrics, such as COMET-QE

or PRISM-QE, employ only the source text and

the candidate translation. We control for sentence

length by including only sentences between 15 and

25 words long, measured by the length of the tok-

enized reference translation. Since we re-use the

same sentences across multiple perturbations, we

did not include shorter sentences because they are

less likely to contain multiple linguistic phenomena

of interest.

As the quality of sampled sentences

varies, we manually check each source sentence

and its translation to make sure they are of satisfac-

tory quality.10

Translating the data:

Given the ﬁltered collec-

tion of source sentences, we next translate them

into English using the Google Translate API.

We manually verify each translation, editing or

resampling the instances where the machine trans-

lation contains critical errors. Through this process,

we obtain 1K curated examples per perturbation

(100 sentences

10 languages) that each consist

of source and reference sentences along with a ma-

chine translation of reasonable quality.

We choose languages that represent different families

(Romance, Germanic, Slavic, Indo-Iranian, Sino-Tibetan, and

Japonic) with different morphological traits (fusional, aggluti-

native, and analytic) and wide range of writing systems (Latin

alphabet, Cyrillic alphabet, Devanagari script, Hanzi, and

Kanji/Hiragana/Katakana).

Similarly, we do not include sentences over 25 words long

in DEMETR as some languages may naturally allow longer

sentences than others, and we wanted to control the length

distribution.

In the sentences sampled from WMT, we notice multiple

translation and grammar errors, such as translating Japanese

その最大は本州列島で、世界で

番目に大きい島とさ

れています。

as (the biggest being Honshu), making Japan

the 7th largest island in the world, which would suggest that

Japan is an island, instead of the largest of which is the Hon-

shu island, considered to be the seventh largest island in the

world. or "kakao" ("cacao") incorrectly declined as "kakaa"

in Polish. These sentences were rejected, and new ones were

sampled in their place. We also resampled sentences which

translations contained artifacts from neighboring sentences

due to partial splits and merges, and sentences which exhibit

translationese, that is sentences with source artifacts (Koppel

and Ordan,2011). Finally, we omit or edit sentences with

translation artifacts due to the direction of translation. Both

WMT and FLORES contain sentences translated from En-

glish to another languages. Since the translation process is not

always fully reversible, we omit sentences where translation

from the give language to English would not be possible in

the form included in these datasets (e.g., due to addition or

omission of information).

11All sentences were translated in May, 2022.

2.3 Perturbations in DEMETR

We perturb the machine translations obtained above

in order to create minimal pairs, which allow us

to investigate the sensitivity of MT evaluation met-

rics to different types of errors. Our perturbations

are loosely based on the Multidimensional Quality

Metrics (Burchardt,2013, MQM) framework de-

veloped to identify and categorize MT errors. Most

perturbations were performed semi-automatically

by utilizing STANZA (Qi et al.,2020), SPACY

or GPT-3 (Brown et al.,2020), applying hand-

crafted rules and then manually correcting any er-

rors. Some of the more elaborate perturbations

(e.g., translation by a too general term, where one

had to be sure that a better, more precise term ex-

ists) were performed manually by the authors or

linguistically-savvy freelancers hired on the Up-

work platform.

Special care was given to the

plausibility of perturbations (e.g., numbers for re-

placement were selected from a probable range,

such as 1-12 for months). See Table 2for descrip-

tions and examples of most perturbations; full list

in Appendix A.

We roughly categorize our perturbations into the

following four categories:

•ACCURACY

: Perturbations in the accuracy

category modify the semantics of the transla-

tion by either incorporating misleading infor-

mation (e.g., by adding plausible yet inade-

quate text or changing a word to its antonym)

or omitting information (e.g., by leaving a

word untranslated).

•FLUENCY

: Perturbations in the ﬂuency cat-

egory focus on grammatical accuracy (e.g.,

word form agreement, tense, or aspect) and on

overall cohesion. Compared to the mistakes

in the accuracy category, the true meaning of

the sentence can be usually recovered from

the context more easily.

•MIXED

: Certain perturbations can be classi-

ﬁed as both accuracy and ﬂuency errors. Con-

cretely, this category consists of omission er-

rors that not only obscure the meaning but

also affect the grammaticality of the sentence.

One such error is subject removal, which will

result not only in an ungrammatical sentence,

12https://spacy.io/usage/linguistic-features

See

https://www.upwork.com/

. Freelancers were paid

an equivalent of $15 per hour.

leaving a gap where the subject should come,

but also in information loss.

•TYPOGRAPHY

: This category concerns

punctuation and minor orthographic errors.

Examples of mistakes in this category include

punctuation removal, tokenization, lowercas-

ing, and common spelling mistakes.

•BASELINE

: Finally, we include both up-

per and lower bounds, since learned metrics

such as BLEURT and COMET do not have a

speciﬁed range that their scores can fall into.

Speciﬁcally, we provide three baselines: as

lower bounds, we either change the transla-

tion to an unrelated one or provide an empty

string,

while as an upper bound, we set the

perturbed translation

equal to the reference

translation

, which should return the highest

possible score for reference-based metrics.

Error severity:

Our perturbations can also be

categorized by their severity (see Table 1). We

use the following categorization scheme for our

analysis experiments:

•MINOR

: In this type of error, which includes

perturbations such as dropping punctuation or

using the wrong article, the meaning of the

source sentence can be easily and correctly

interpreted by human readers.

•MAJOR

: Errors in this category may not

affect the overall ﬂuency of the sentence but

will result in some missing details. Examples

of major errors include undertranslation (e.g.,

translating “church” as “building”), or leaving

a word in the source language untranslated.

•CRITICAL

: These are catastrophic errors

that result in crucial pieces of information go-

ing missing or incorrect information being

added in a way unrecognizable for the reader,

and are also likely to suffer from severe ﬂu-

ency issues. Errors in this category include

subject deletion or replacement of a named

entity.

3 Performance of MT evaluation metrics

on DEMETR

We test the accuracy and sensitivity of 14 pop-

ular MT evaluation metrics on the perturbations

Since most of the metrics will not accept an empty string,

we pass a full stop instead.

Category Type Example Description Implementation Error Severity

ACCURACY

repetition

I don’t know if you realize that most of the goods imported into this country from Central

America are duty free.

I don’t know if you realize that most of the goods imported into this country from Central

America are duty free free.

The last word is being repeated twice. Punctuation is added after the last

repeated word.

automatic minor

repetition

Gordon Johndroe, Bush’s spokesman, referred to the North Korean commitment as

"an important advance towards the goal of achieving veriﬁable denuclearization of the

Korean penisula."

Gordon Johndroe, Bush’s spokesman, referred to the North Korean commitment as

"an important advance towards the goal of achieving veriﬁable denuclearization of the

Korean penisula penisula penisula penisula."

The last word is being repeated four times. Punctuation is added after

the last repeated word.

automatic minor

hypernym

The language most of the people working in the Vatican City use on a daily basis is

Italian, and Latin is often used in religious ceremonies.

The language most of the people working in the Vatican City use on a daily basis is

Italian, and Latin is often used in religious activities.

A word translated by a too general term (undertranslation). Special care

was given in order to assure the word used in perturbed text is more

general, and incorrect, translation of the original word.

manual with

suggestions

from GPT-3

major

untranslated

The Polish Air Force will eventually be equipped with 32 F-35 Lightning II ﬁghters

manufactured by Lockheed Martin.

The Polish Air Force will eventually be equipped with 32 F-35 Lightning II ﬁghters

produkowane by Lockheed Martin.

One word is being left untranslated. We manually assure that each time

only one word is left untranslated.

manual major

completeness

She is

in custody

pending prosecution and trial; but any witness evidence could be

negatively impacted because her image has been widely published.

She is

_____

pending prosecution and trial; but any witness evidence could be negatively

impacted because her image has been widely published.

One prepositional phrase is being removed. Whenever possible, we

remove the shortest prepositional phrase in order to assure that the

perturbed sentence is not much shorter than the original translation.

automatic

(Stanza) with

manual check

major

addition _____

Plants look their best when they are in a natural environment, so resist the

temptation to remove "just one."

Power

plants look their best when they are in a natural environment, so resist the

temptation to remove "just one."

One word is being added. We make sure that the added word does not

disturb the grammaticality of the sentence but changes the meaning in a

signiﬁcant way.

manual critical

antonym

He has been unable to relieve the

pain

with medication, which the competition prohibits

competitors from taking.

He has been unable to relieve the

pleasure

with medication, which the competition

prohibits competitors from taking.

One word (noun, verb, adj., or adv.) is being changed to its antonym.

manual with

suggestions

from GPT-3

critical

mistranslation

negation

Last month, a presidential committee

recommended

the resignation of the former CEP

as part of measures to push the country toward new elections.

Last month, a presidential committee

didn’t recommend

the resignation of the former

CEP as part of measures to push the country toward new elections.

Afﬁrmative sentences are being changed into negations. Rare negations

are being changed to afﬁrmative sentences.

manual critical

mistranslation

named entity

Late night presenter

Stephen Colbert

welcomed 17-year-old Thunberg to his show on

Tuesday and conducted a lengthy interview with the Swede.

Late night presenter

John Oliver

welcomed 17-year-old Thunberg to his show on

Tuesday and conducted a lengthy interview with the Swede.

Named entity is replaced with another named entity from the same

category (person, geographic location, and organization).

automatic

(Stanza) with

manual check

critical

mistranslation

numbers

The Chinese Consulate General in Houston was established in

1979

and is the ﬁrst

Chinese consulate in the United States.

The Chinese Consulate General in Houston was established in

1997

and is the ﬁrst

Chinese consulate in the United States.

A number is being replaced with an incorrect one. Special attention was

given to keep the numerals with resonable/common range for the given

category (e.g., 0-100 for percentages; 1-12 for months). We also assure

that the replacement will not create an illogical sentence (e.g., replacing

“1920” with “1940” in “from 1920 to 1930”)

manual critical

mistranslation

gender He

has been unable to relieve the pain with medication, which the competition prohibits

competitors from taking.

She

has been unable to relieve the pain with medication, which the competition prohibits

competitors from taking.

Exactly one feminine pronoun in the sentence (such as “she” or “her”) is

being with a masculine pronouns (such as “he” or “him”) or vice-versa.

This includes reﬂexive pronouns (i.e., “him/herself”) and possessive

adjectives (i.e., “his/her”).

automatic with

manual check

critical

FLUENCY

cohesion

Scientists want to understand how planets have formed

since

a comet collided with Earth

long ago, and especially how Earth has formed.

Scientists want to understand how planets have formed

_____

a comet collided with

Earth long ago, and especially how Earth has formed.

A conjunction, such as “thus” or “therefore” is removed. Special atten-

tion was given to keep the rest of the sentence unperturbed.

automatic

(spaCy) with

manual check

minor

grammar

pos shift

The U.S. Supreme Court last year blocked the Trump

administration

from including

the citizenship question on the 2020 census form.

The U.S. Supreme Court last year blocked the Trump

administrate

from including the

citizenship question on the 2020 census form.

Afﬁx of the word is being changed keeping the stem kept constant (e.g.,

“bad” to “badly”) which results in the part-of-speech shift. The degree

to which the original meaning is affected varies, however, the intended

meaning is easily retrivable from the stem and context.

manual minor

grammar

swap order

I don’t know if you realize that most of the goods imported

into this

country from

Central America are duty free.

I don’t know if you realize that most of the goods imported

this into

country from

Central America are duty free.

Two neighboring words are being swapped to mimic word order error.

automatic

(spaCy)

minor

grammar

case She

announced that after a break of several years, a Rakoczy horse show will take place

again in 2021.

Her

announced that after a break of several years, a Rakoczy horse show will take place

again in 2021.

One pronoun in the sentence is being changed into a different, incorrect,

case (e.g., “he” to “him”).

automatic

(spaCy) with

manual check

minor

grammar

function word

Last month,

presidential committee recommended the resignation of the former CEP

as part of measures to push the country toward new elections.

Last month,

presidential committee recommended the resignation of the former CEP

as part of measures to push the country toward new elections.

A preposition or article is being changed into an incorrect one to mimic

mistake in function words usage. While most perturbations result in

minor mistakes (i.e., the original meaning is easily retrivable) some may

be more severe.

automatic with

manual check

minor-major

grammar

tense

Cyanuric acid and melamine

were

both found in urine samples of pets who died after

eating contaminated pet food.

Cyanuric acid and melamine

are

both found in urine samples of pets who died after

eating contaminated pet food.

A tense is being change into an incorrect one. We consider past, present,

as well as the future tense (although this may be classiﬁed as modal verb

in English)

manual major

grammar

aspect

has been

unable to relieve the pain with medication, which the competition prohibits

competitors from taking.

is being

unable to relieve the pain with medication, which the competition prohibits

competitors from taking.

Aspect is being changed to an incorrect one (e.g., perfective to progres-

sive) without changing the tense.

manual major

grammar

interrogative This is

the tenth time since the start of the pandemic that Florida’s daily death toll has

surpassed 100.

Is this

the tenth time since the start of the pandemic that Florida’s daily death toll has

surpassed 100?

Afﬁrmative mood is being changed to interrogative mood. manual major

MIXED

omission

adj/adv

Rangers

closely

monitor shooters participating in supplemental pest control trials as the

trials are monitored and their effectiveness assessed.

Rangers

_____

monitor shooters participating in supplemental pest control trials as the

trials are monitored and their effectiveness assessed.

An adjective or adverb is being removed. While in most cases this leads

automatic

(spaCy) with

manual check

minor-major

omission

content verb

Catri

said

that 85% of new coronavirus cases in Belgium last week were under the age

of 60.

Catri

_____

that 85% of new coronavirus cases in Belgium last week were under the age

of 60.

Content verb is being removed (this excludes auxilary verbs and copu-

lae).

Automatic with

manual check

critical

omission

noun

In 1940 he stood up to other government

aristocrats

who wanted to discuss an "agree-

ment" with the Nazis and he very ably won.

In 1940 he stood up to other government

_____

who wanted to discuss an "agreement"

with the Nazis and he very ably won.

Noun, which is not a named entity or a subject, is being removed. We

remove the head of the noun phrase including compound nouns.

automatic

(spaCy) with

manual check

critical

omission

subject

His

research

shows that the administration of hormones can accelerate the maturation

of the baby’s fetal lungs.

His

_____

shows that the administration of hormones can accelerate the maturation of

the baby’s fetal lungs.

Subject is being removed. We remove the head of the noun phrase

including compound nouns.

automatic

(spaCy) with

manual check

critical

omission

named entry

I don’t know if you realize that most of the goods imported into this country from

Central America are duty free.

I don’t know if you realize that most of the goods imported into this country from

_____

are duty free.

Named entity, which is not a subject, is being removed.

automatic

(Stanza) with

manual check

critical

Table 2: A subset of perturbations in DEMETR along with examples (detailed changes are highlighted in purple).

A full list of perturbations is provided in Table A1 and Table A2 in Appendix A.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DEMETR:DiagnosingEvaluationMetricsforTranslationMarzenaKarpinska}NishantRaj}KatherineThai}YixiaoSongAnkitaGupta}MohitIyyer}}ManningCollegeofInformationandComputerSciences,UMassAmherstDepartmentofLinguistics,UMassAmherst{mkarpinska,kbthai,ankitagupta,miyyer}@cs.umass.edu{nishantraj,yixiaosong}@umas...

展开>> 收起<<

DEMETR Diagnosing Evaluation Metrics for Translation Marzena KarpinskaNishant RajKatherine Thai Yixiao SongAnkita GuptaMohit Iyyer_2.pdf

共22页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

DEMETR Diagnosing Evaluation Metrics for Translation Marzena KarpinskaNishant RajKatherine Thai Yixiao SongAnkita GuptaMohit Iyyer_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: