
Perplexity by PLM Is Unreliable for Evaluating Text Quality
Yequan Wang1∗, Jiawen Deng2∗, Aixin Sun3, Xuying Meng4
1Beijing Academy of Artificial Intelligence, Beijing, China
2CoAI Group, DCST, IAI, BNRIST, Tsinghua University, Beijing, China
3School of Computer Science and Engineering, Nanyang Technological University, Singapore
4Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
tshwangyequan@gmail.com, dengjw2021@mail.tsinghua.edu.cn,
axsun@ntu.edu.sg, mengxuying@ict.ac.cn
Abstract
Recently, perplexity (PPL) has been adopted
to evaluate text quality, or more specifically
fluency, of generated text in a few studies. A
smaller PPL value means better text quality or
better fluency. Through carefully designed ex-
periments, we show that PPL is an unreliable
measure for text quality. Specifically, we show
that: (i) PPL of short text is more likely to be
larger than that of long text. (ii) Repeated text
spans lead to lower PPL values although re-
peated spans often do not contribute to better
text quality, and (iii) PPL values can be largely
affected by punctuation marks. Based on the
findings, we further discuss the key issues in
evaluating text quality using language models.
1 Introduction
The rapid development in natural language pro-
cessing, particularly the success of pre-trained lan-
guage models, has brought tremendous growth and
progress to various text generation tasks. Examples
include machine translation (Tu et al.,2016;Zhang
et al.,2021), question answering (Duan et al.,
2017), and generation-based dialog system (Tu
et al.,2022). How to evaluate the quality of the gen-
erated text in a cost efficient manner has become a
key challenge.
Researchers have adopted various statistical met-
rics to evaluate the generated text. These mea-
sures include word-based measures like BLEU (Pa-
pineni et al.,2002) and ROUGE (Lin,2004)),
character-based metrics like chrF (Popovic,2015),
and embedding-based metrics like Vector Ex-
trema (Forgues et al.,2014) and Greedy Match-
ing (Rus and Lintean,2012). Specifically, BLEU
reflects the ratio of overlapping
n
-grams to the
total
n
-grams, denoting a precision-based mea-
sure. ROUGE and its variants, also evaluating text
based on
n
-grams, are recall-based measures (Sai
*Indicates equal contribution
et al.,2023). Vector Extrema prioritizes informa-
tive words by taking the extreme value along each
dimension. All these measures are widely adopted
in many experiments and tasks. However, such
statistical-based measures cannot well evaluate the
creativeness, diversity, and complexity of texts, par-
ticularly in the scenario that the same semantic is
expressed in different expressions, e.g., different
words/phrases, or different sentence structures.
In addition to the aforementioned statistical-
based measures, perplexity (PPL) has also been
used to evaluate the text quality or fluency in gener-
ation tasks. PPL is an intrinsic measure to quantify
to what extent classical language models, e.g.,
n
-
gram models, learn natural language (Meister and
Cotterell,2021a). Considering the large-scale pre-
trained language models (PLMs) e.g., BERT (De-
vlin et al.,2019) and GPT (Radford et al.,2019),
have well captured language knowledge, PPL has
also been used to evaluate quality of generated
text.
1
Given a PLM model and a sequence of gen-
erated text, perplexity reflects how likely the model
is to generate this text sequence. If we assume
a large PLM well captures language knowledge
and is well-behaved, then the PPL value computed
in this way could reflect the quality of the input
sequence to some extent.
In this paper, we use PLM to compute PPL
values of high quality sentences, as if these sen-
tences were outputs from some generative mod-
els. Based on the distributions of PPL values, we
claim that PPL computed in this way cannot fairly
evaluate text quality. Specifically, we used GPT-2
model (Radford et al.,2019) to compute PPL of
sentences in WikiText-2 dataset.
2
As the sentences
in WikiText dataset were extracted from verified
good and featured articles on Wikipedia, we trust
1https://huggingface.co/spaces/
evaluate-metric/perplexity
2https://huggingface.co/datasets/
wikitext
arXiv:2210.05892v2 [cs.CL] 15 Mar 2023