
1
LLMEffiChecker: Understanding and Testing Eiciency
Degradation of Large Language Models
XIAONING FENG,Taiyuan University of Technology, China
XIAOHONG HAN∗,Taiyuan University of Technology, China
SIMIN CHEN†,The University of Texas at Dallas, USA
WEI YANG,The University of Texas at Dallas, USA
Large Language Models (LLMs) have received much recent attention due to their human-level accuracy. While
existing works mostly focus on either improving accuracy or testing accuracy robustness, the computation
eciency of LLMs, which is of paramount importance due to often vast generation demands and real-time
requirements, has surprisingly received little attention. In this paper, we make the rst attempt to understand
and test potential computation eciency robustness in state-of-the-art LLMs. By analyzing the working
mechanism and implementation of 20,543 public-accessible LLMs, we observe a fundamental property in
LLMs that could be manipulated in an adversarial manner to reduce computation eciency signicantly.
Our interesting observation is that the output length determines the computation eciency of LLMs instead
of the input, where the output length depends on two factors: an often suciently large yet pessimistic
pre-congured threshold controlling the max number of iterations and a runtime generated end of sentence
(EOS) token. Our key motivation is to generate test inputs that could suciently delay the generation of
EOS such that LLMs would have to go through enough iterations to satisfy the pre-congured threshold.
We present
LLMEffiChecker
, which can work under both white-box setting and black-box setting. In the
white-box scenario,
LLMEffiChecker
develops a gradient-guided technique that searches for a minimal
and unnoticeable perturbation at character-level, token-level, and structure-level. In the black-box scenario,
LLMEffiChecker
employs a causal inference-based approach to nd critical tokens and similarly applies three
levels of imperceptible perturbation to them. Both the white-box and black-box settings eectively delay the
appearance of EOS, compelling these inputs to reach the naturally-unreachable threshold. To demonstrate the
eectiveness of
LLMEffiChecker
, we conduct a systematic evaluation on nine public-available LLMs: Google
T5, AllenAI WMT14, Helsinki-NLP translator, Facebook FairSeq, UNICAMP-DL translator, MarianMT, Google
FLAN-T5, MBZUAI LaMini-GPT and Salesforce CodeGen. Experimental results show that
LLMEffiChecker
can increase on average LLMs’ response latency and energy consumption by 325% to 3244% and 344% to
3616%, respectively, by perturbing just one character or token in the input sentence. Our case study shows
that inputs generated by
LLMEffiChecker
signicantly aect the battery power in real-world mobile devices
(i.e., drain more than 30 times battery power than normal inputs).
CCS Concepts: •Software and its engineering
→
Search-based software engineering;Software testing
and debugging;Automatic programming;Software evolution.
Additional Key Words and Phrases: Machine learning, software testing, large language model
1 INTRODUCTION
Large Language Model (LLM) is a promising approach that applies neural networks to resolve
various text generation problems. LLMs have received signicant recent attention from both
academia [
4
,
10
,
42
,
53
] and industry [
2
,
36
,
46
,
54
,
66
,
92
,
95
], due to its advantages over traditional
text generation methods (e.g., N-gram language models [
67
]). For instance, due to being capable of
capturing rather long dependencies in sentences, LLMs are seeing a wide adoption in commercial
∗Corresponding author
†Corresponding author
Authors’ addresses: Xiaoning Feng, fengxiaoning1746@link.tyut.edu.cn, Taiyuan University of Technology, Tai Yuan, China;
Xiaohong Han, hanxiaohong@tyut.edu.cn, Taiyuan University of Technology, Tai Yuan, China; Simin Chen, simin.chen@
UTDallas.edu, The University of Texas at Dallas, Dallas, USA; Wei Yang, wei.yang@utdallas.edu, The University of Texas at
Dallas, Dallas, USA.
arXiv:2210.03696v2 [cs.CL] 25 May 2024