Perplexity by PLM Is Unreliable for Evaluating Text Quality Yequan Wang1 Jiawen Deng2 Aixin Sun3 Xuying Meng4 1Beijing Academy of Artiﬁcial Intelligence Beijing China

2025-04-26 0 0 1.46MB 7 页 10玖币

侵权投诉

Perplexity by PLM Is Unreliable for Evaluating Text Quality

Yequan Wang1∗, Jiawen Deng2∗, Aixin Sun3, Xuying Meng4

1Beijing Academy of Artiﬁcial Intelligence, Beijing, China

2CoAI Group, DCST, IAI, BNRIST, Tsinghua University, Beijing, China

3School of Computer Science and Engineering, Nanyang Technological University, Singapore

4Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China

tshwangyequan@gmail.com, dengjw2021@mail.tsinghua.edu.cn,

axsun@ntu.edu.sg, mengxuying@ict.ac.cn

Abstract

Recently, perplexity (PPL) has been adopted

to evaluate text quality, or more speciﬁcally

ﬂuency, of generated text in a few studies. A

smaller PPL value means better text quality or

better ﬂuency. Through carefully designed ex-

periments, we show that PPL is an unreliable

measure for text quality. Speciﬁcally, we show

that: (i) PPL of short text is more likely to be

larger than that of long text. (ii) Repeated text

spans lead to lower PPL values although re-

peated spans often do not contribute to better

text quality, and (iii) PPL values can be largely

affected by punctuation marks. Based on the

ﬁndings, we further discuss the key issues in

evaluating text quality using language models.

1 Introduction

The rapid development in natural language pro-

cessing, particularly the success of pre-trained lan-

guage models, has brought tremendous growth and

progress to various text generation tasks. Examples

include machine translation (Tu et al.,2016;Zhang

et al.,2021), question answering (Duan et al.,

2017), and generation-based dialog system (Tu

et al.,2022). How to evaluate the quality of the gen-

erated text in a cost efﬁcient manner has become a

key challenge.

Researchers have adopted various statistical met-

rics to evaluate the generated text. These mea-

sures include word-based measures like BLEU (Pa-

pineni et al.,2002) and ROUGE (Lin,2004)),

character-based metrics like chrF (Popovic,2015),

and embedding-based metrics like Vector Ex-

trema (Forgues et al.,2014) and Greedy Match-

ing (Rus and Lintean,2012). Speciﬁcally, BLEU

reﬂects the ratio of overlapping

-grams to the

total

-grams, denoting a precision-based mea-

sure. ROUGE and its variants, also evaluating text

based on

-grams, are recall-based measures (Sai

*Indicates equal contribution

et al.,2023). Vector Extrema prioritizes informa-

tive words by taking the extreme value along each

dimension. All these measures are widely adopted

in many experiments and tasks. However, such

statistical-based measures cannot well evaluate the

creativeness, diversity, and complexity of texts, par-

ticularly in the scenario that the same semantic is

expressed in different expressions, e.g., different

words/phrases, or different sentence structures.

In addition to the aforementioned statistical-

based measures, perplexity (PPL) has also been

used to evaluate the text quality or ﬂuency in gener-

ation tasks. PPL is an intrinsic measure to quantify

to what extent classical language models, e.g.,

gram models, learn natural language (Meister and

Cotterell,2021a). Considering the large-scale pre-

trained language models (PLMs) e.g., BERT (De-

vlin et al.,2019) and GPT (Radford et al.,2019),

have well captured language knowledge, PPL has

also been used to evaluate quality of generated

text.

Given a PLM model and a sequence of gen-

erated text, perplexity reﬂects how likely the model

is to generate this text sequence. If we assume

a large PLM well captures language knowledge

and is well-behaved, then the PPL value computed

in this way could reﬂect the quality of the input

sequence to some extent.

In this paper, we use PLM to compute PPL

values of high quality sentences, as if these sen-

tences were outputs from some generative mod-

els. Based on the distributions of PPL values, we

claim that PPL computed in this way cannot fairly

evaluate text quality. Speciﬁcally, we used GPT-2

model (Radford et al.,2019) to compute PPL of

sentences in WikiText-2 dataset.

As the sentences

in WikiText dataset were extracted from veriﬁed

good and featured articles on Wikipedia, we trust

1https://huggingface.co/spaces/

evaluate-metric/perplexity

2https://huggingface.co/datasets/

wikitext

arXiv:2210.05892v2 [cs.CL] 15 Mar 2023

these sentences are of high quality. However, our

experiments lead to the following ﬁndings.

(i)

PPL is sensitive to text length, i.e., PPL of

short text is likely to be much larger than that of

long text. On the other hand, the generated texts

to be evaluated may have different lengths (Meis-

ter and Cotterell,2021b). Strictly speaking, text

quality is independent of text length.

(ii)

PPL is lower for text with repeated span(s).

Generative text may contain repeated span(s). Al-

though legitimate repeated text spans can be used

to express emphasis in sentences, PPL cannot dis-

tinguish valid semantic emphasis in sentences from

unreasonable straightforward repetitions.

(iii)

PPL is sensitive to punctuation marks in sen-

tences. Simply removing the last punctuation mark

in a sentence may lead a signiﬁcant increase in its

PPL. On the other hand, removing the last punctua-

tion from a sentence may only lead to a very small

impact to human perception of the sentence.

To the best of our knowledge, this is the ﬁrst at-

tempt to systematically analyze PPL for its suitabil-

ity as a quality measure for generative text. Based

on the ﬁndings, we call for more carefully designed

metrics which are expected to be (i) not sensitive to

length; (ii) sensitive to common mistakes, e.g., un-

necessary repeated text; (iii) not sensitive to minor

punctuation changes. In other words, a measure

of text ﬂuency shall not be much affected by text

length, while penalizing unnecessary text spans and

not attending to non-signiﬁcant punctuation marks.

2 Preliminary and Experiment Setup

In our experiments, we follow the mainstream ap-

proach using GPT-2 (Radford et al.,2019) as the

pre-trained language model to calculate PPL. More

speciﬁcally, we use the GPT2-large model.

Given an input sentence

, we get its token se-

quence

s= [t1, t2, . . . , tm]

with

size by the

PLM. We use GPT2-large to compute the PPL:

P1,P2,...,Pm=GPT-2([t1, t2, . . . , tm]),(1)

where

denotes the predicted probability of the

-th token. PPL of the input sentence

is computed

as the cross entropy of each token:

PPL(s) = exp (1

i=1

cross-entropy(ti,Pi)).

(2)

Figure 1: The PPL of text with different lengths. The

x-axis denotes text length in number of tokens, and the

y-axis is PPL value in log scale.

Here,

PPL(s)

is the perplexity value of the input

sentence s.

The sentences in our experiments are from the

test split of the WikiText-2 dataset. We ﬁlter sen-

tences with fewer than 3 words to avoid extremely

short sentences. As the result, we have 2,786 texts

left in our experiments. The maximum, minimum,

and average lengths are

481

, and

86.52

tokens,

respectively.

As sentences in WikiText were orginally from a

set of carefully selected high quality Wikipedia ar-

ticles, we assume the quality of all these sentences

is high. Accordingly, if PPL computed by PLM

is a suitable text quality measure, we expect a rea-

sonably stable PPL value for all these sentences.

On the other hand, if the PPL values of these high

quality sentences are spread in a large range, then

PPL values may not well reﬂect the text quality of

generative text.

3 Experiments and Findings

3.1 PPL vs Text Length

We ﬁrst evaluate whether PPL is sensitive to text

length of high quality text. In human perception,

text quality is not strongly correlated with text

length. Given the high quality sentences from Wiki-

Text, we expect a stable PPL value for all sentences.

Figure 1plots the PPL values of text against their

lengths in number of tokens. Note that the PPL

values on y-axis are in log scale.

Finding 1

PPL values are unstable for short texts,

and become lower along the increase of text length.

Observe that, a good number of sentences are

short sentences, shorter than 25 tokens. These short

sentences have a very wide range of PPL values

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PerplexitybyPLMIsUnreliableforEvaluatingTextQualityYequanWang1,JiawenDeng2,AixinSun3,XuyingMeng41BeijingAcademyofArticialIntelligence,Beijing,China2CoAIGroup,DCST,IAI,BNRIST,TsinghuaUniversity,Beijing,China3SchoolofComputerScienceandEngineering,NanyangTechnologicalUniversity,Singapore4Instituteof...

展开>> 收起<<

Perplexity by PLM Is Unreliable for Evaluating Text Quality Yequan Wang1 Jiawen Deng2 Aixin Sun3 Xuying Meng4 1Beijing Academy of Artiﬁcial Intelligence Beijing China.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Perplexity by PLM Is Unreliable for Evaluating Text Quality Yequan Wang1 Jiawen Deng2 Aixin Sun3 Xuying Meng4 1Beijing Academy of Artiﬁcial Intelligence Beijing China

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: