Perplexity by PLM Is Unreliable for Evaluating Text Quality Yequan Wang1 Jiawen Deng2 Aixin Sun3 Xuying Meng4 1Beijing Academy of Artificial Intelligence Beijing China

2025-04-26 0 0 1.46MB 7 页 10玖币
侵权投诉
Perplexity by PLM Is Unreliable for Evaluating Text Quality
Yequan Wang1, Jiawen Deng2, Aixin Sun3, Xuying Meng4
1Beijing Academy of Artificial Intelligence, Beijing, China
2CoAI Group, DCST, IAI, BNRIST, Tsinghua University, Beijing, China
3School of Computer Science and Engineering, Nanyang Technological University, Singapore
4Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
tshwangyequan@gmail.com, dengjw2021@mail.tsinghua.edu.cn,
axsun@ntu.edu.sg, mengxuying@ict.ac.cn
Abstract
Recently, perplexity (PPL) has been adopted
to evaluate text quality, or more specifically
fluency, of generated text in a few studies. A
smaller PPL value means better text quality or
better fluency. Through carefully designed ex-
periments, we show that PPL is an unreliable
measure for text quality. Specifically, we show
that: (i) PPL of short text is more likely to be
larger than that of long text. (ii) Repeated text
spans lead to lower PPL values although re-
peated spans often do not contribute to better
text quality, and (iii) PPL values can be largely
affected by punctuation marks. Based on the
findings, we further discuss the key issues in
evaluating text quality using language models.
1 Introduction
The rapid development in natural language pro-
cessing, particularly the success of pre-trained lan-
guage models, has brought tremendous growth and
progress to various text generation tasks. Examples
include machine translation (Tu et al.,2016;Zhang
et al.,2021), question answering (Duan et al.,
2017), and generation-based dialog system (Tu
et al.,2022). How to evaluate the quality of the gen-
erated text in a cost efficient manner has become a
key challenge.
Researchers have adopted various statistical met-
rics to evaluate the generated text. These mea-
sures include word-based measures like BLEU (Pa-
pineni et al.,2002) and ROUGE (Lin,2004)),
character-based metrics like chrF (Popovic,2015),
and embedding-based metrics like Vector Ex-
trema (Forgues et al.,2014) and Greedy Match-
ing (Rus and Lintean,2012). Specifically, BLEU
reflects the ratio of overlapping
n
-grams to the
total
n
-grams, denoting a precision-based mea-
sure. ROUGE and its variants, also evaluating text
based on
n
-grams, are recall-based measures (Sai
*Indicates equal contribution
et al.,2023). Vector Extrema prioritizes informa-
tive words by taking the extreme value along each
dimension. All these measures are widely adopted
in many experiments and tasks. However, such
statistical-based measures cannot well evaluate the
creativeness, diversity, and complexity of texts, par-
ticularly in the scenario that the same semantic is
expressed in different expressions, e.g., different
words/phrases, or different sentence structures.
In addition to the aforementioned statistical-
based measures, perplexity (PPL) has also been
used to evaluate the text quality or fluency in gener-
ation tasks. PPL is an intrinsic measure to quantify
to what extent classical language models, e.g.,
n
-
gram models, learn natural language (Meister and
Cotterell,2021a). Considering the large-scale pre-
trained language models (PLMs) e.g., BERT (De-
vlin et al.,2019) and GPT (Radford et al.,2019),
have well captured language knowledge, PPL has
also been used to evaluate quality of generated
text.
1
Given a PLM model and a sequence of gen-
erated text, perplexity reflects how likely the model
is to generate this text sequence. If we assume
a large PLM well captures language knowledge
and is well-behaved, then the PPL value computed
in this way could reflect the quality of the input
sequence to some extent.
In this paper, we use PLM to compute PPL
values of high quality sentences, as if these sen-
tences were outputs from some generative mod-
els. Based on the distributions of PPL values, we
claim that PPL computed in this way cannot fairly
evaluate text quality. Specifically, we used GPT-2
model (Radford et al.,2019) to compute PPL of
sentences in WikiText-2 dataset.
2
As the sentences
in WikiText dataset were extracted from verified
good and featured articles on Wikipedia, we trust
1https://huggingface.co/spaces/
evaluate-metric/perplexity
2https://huggingface.co/datasets/
wikitext
arXiv:2210.05892v2 [cs.CL] 15 Mar 2023
these sentences are of high quality. However, our
experiments lead to the following findings.
(i)
PPL is sensitive to text length, i.e., PPL of
short text is likely to be much larger than that of
long text. On the other hand, the generated texts
to be evaluated may have different lengths (Meis-
ter and Cotterell,2021b). Strictly speaking, text
quality is independent of text length.
(ii)
PPL is lower for text with repeated span(s).
Generative text may contain repeated span(s). Al-
though legitimate repeated text spans can be used
to express emphasis in sentences, PPL cannot dis-
tinguish valid semantic emphasis in sentences from
unreasonable straightforward repetitions.
(iii)
PPL is sensitive to punctuation marks in sen-
tences. Simply removing the last punctuation mark
in a sentence may lead a significant increase in its
PPL. On the other hand, removing the last punctua-
tion from a sentence may only lead to a very small
impact to human perception of the sentence.
To the best of our knowledge, this is the first at-
tempt to systematically analyze PPL for its suitabil-
ity as a quality measure for generative text. Based
on the findings, we call for more carefully designed
metrics which are expected to be (i) not sensitive to
length; (ii) sensitive to common mistakes, e.g., un-
necessary repeated text; (iii) not sensitive to minor
punctuation changes. In other words, a measure
of text fluency shall not be much affected by text
length, while penalizing unnecessary text spans and
not attending to non-significant punctuation marks.
2 Preliminary and Experiment Setup
In our experiments, we follow the mainstream ap-
proach using GPT-2 (Radford et al.,2019) as the
pre-trained language model to calculate PPL. More
specifically, we use the GPT2-large model.
Given an input sentence
s
, we get its token se-
quence
s= [t1, t2, . . . , tm]
with
m
size by the
PLM. We use GPT2-large to compute the PPL:
P1,P2,...,Pm=GPT-2([t1, t2, . . . , tm]),(1)
where
Pi
denotes the predicted probability of the
i
-th token. PPL of the input sentence
s
is computed
as the cross entropy of each token:
PPL(s) = exp (1
m
m
X
i=1
cross-entropy(ti,Pi)).
(2)
Figure 1: The PPL of text with different lengths. The
x-axis denotes text length in number of tokens, and the
y-axis is PPL value in log scale.
Here,
PPL(s)
is the perplexity value of the input
sentence s.
The sentences in our experiments are from the
test split of the WikiText-2 dataset. We filter sen-
tences with fewer than 3 words to avoid extremely
short sentences. As the result, we have 2,786 texts
left in our experiments. The maximum, minimum,
and average lengths are
481
,
3
, and
86.52
tokens,
respectively.
As sentences in WikiText were orginally from a
set of carefully selected high quality Wikipedia ar-
ticles, we assume the quality of all these sentences
is high. Accordingly, if PPL computed by PLM
is a suitable text quality measure, we expect a rea-
sonably stable PPL value for all these sentences.
On the other hand, if the PPL values of these high
quality sentences are spread in a large range, then
PPL values may not well reflect the text quality of
generative text.
3 Experiments and Findings
3.1 PPL vs Text Length
We first evaluate whether PPL is sensitive to text
length of high quality text. In human perception,
text quality is not strongly correlated with text
length. Given the high quality sentences from Wiki-
Text, we expect a stable PPL value for all sentences.
Figure 1plots the PPL values of text against their
lengths in number of tokens. Note that the PPL
values on y-axis are in log scale.
Finding 1
PPL values are unstable for short texts,
and become lower along the increase of text length.
Observe that, a good number of sentences are
short sentences, shorter than 25 tokens. These short
sentences have a very wide range of PPL values
摘要:

PerplexitybyPLMIsUnreliableforEvaluatingTextQualityYequanWang1,JiawenDeng2,AixinSun3,XuyingMeng41BeijingAcademyofArticialIntelligence,Beijing,China2CoAIGroup,DCST,IAI,BNRIST,TsinghuaUniversity,Beijing,China3SchoolofComputerScienceandEngineering,NanyangTechnologicalUniversity,Singapore4Instituteof...

展开>> 收起<<
Perplexity by PLM Is Unreliable for Evaluating Text Quality Yequan Wang1 Jiawen Deng2 Aixin Sun3 Xuying Meng4 1Beijing Academy of Artificial Intelligence Beijing China.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:1.46MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注