LLMEffiChecker Understanding and Testing Efficiency Degradation of Large Language Models

2025-05-06 0 0 2.54MB 37 页 10玖币
侵权投诉
1
LLMEffiChecker: Understanding and Testing Eiciency
Degradation of Large Language Models
XIAONING FENG,Taiyuan University of Technology, China
XIAOHONG HAN,Taiyuan University of Technology, China
SIMIN CHEN,The University of Texas at Dallas, USA
WEI YANG,The University of Texas at Dallas, USA
Large Language Models (LLMs) have received much recent attention due to their human-level accuracy. While
existing works mostly focus on either improving accuracy or testing accuracy robustness, the computation
eciency of LLMs, which is of paramount importance due to often vast generation demands and real-time
requirements, has surprisingly received little attention. In this paper, we make the rst attempt to understand
and test potential computation eciency robustness in state-of-the-art LLMs. By analyzing the working
mechanism and implementation of 20,543 public-accessible LLMs, we observe a fundamental property in
LLMs that could be manipulated in an adversarial manner to reduce computation eciency signicantly.
Our interesting observation is that the output length determines the computation eciency of LLMs instead
of the input, where the output length depends on two factors: an often suciently large yet pessimistic
pre-congured threshold controlling the max number of iterations and a runtime generated end of sentence
(EOS) token. Our key motivation is to generate test inputs that could suciently delay the generation of
EOS such that LLMs would have to go through enough iterations to satisfy the pre-congured threshold.
We present
LLMEffiChecker
, which can work under both white-box setting and black-box setting. In the
white-box scenario,
LLMEffiChecker
develops a gradient-guided technique that searches for a minimal
and unnoticeable perturbation at character-level, token-level, and structure-level. In the black-box scenario,
LLMEffiChecker
employs a causal inference-based approach to nd critical tokens and similarly applies three
levels of imperceptible perturbation to them. Both the white-box and black-box settings eectively delay the
appearance of EOS, compelling these inputs to reach the naturally-unreachable threshold. To demonstrate the
eectiveness of
LLMEffiChecker
, we conduct a systematic evaluation on nine public-available LLMs: Google
T5, AllenAI WMT14, Helsinki-NLP translator, Facebook FairSeq, UNICAMP-DL translator, MarianMT, Google
FLAN-T5, MBZUAI LaMini-GPT and Salesforce CodeGen. Experimental results show that
LLMEffiChecker
can increase on average LLMs’ response latency and energy consumption by 325% to 3244% and 344% to
3616%, respectively, by perturbing just one character or token in the input sentence. Our case study shows
that inputs generated by
LLMEffiChecker
signicantly aect the battery power in real-world mobile devices
(i.e., drain more than 30 times battery power than normal inputs).
CCS Concepts: Software and its engineering
Search-based software engineering;Software testing
and debugging;Automatic programming;Software evolution.
Additional Key Words and Phrases: Machine learning, software testing, large language model
1 INTRODUCTION
Large Language Model (LLM) is a promising approach that applies neural networks to resolve
various text generation problems. LLMs have received signicant recent attention from both
academia [
4
,
10
,
42
,
53
] and industry [
2
,
36
,
46
,
54
,
66
,
92
,
95
], due to its advantages over traditional
text generation methods (e.g., N-gram language models [
67
]). For instance, due to being capable of
capturing rather long dependencies in sentences, LLMs are seeing a wide adoption in commercial
Corresponding author
Corresponding author
Authors’ addresses: Xiaoning Feng, fengxiaoning1746@link.tyut.edu.cn, Taiyuan University of Technology, Tai Yuan, China;
Xiaohong Han, hanxiaohong@tyut.edu.cn, Taiyuan University of Technology, Tai Yuan, China; Simin Chen, simin.chen@
UTDallas.edu, The University of Texas at Dallas, Dallas, USA; Wei Yang, wei.yang@utdallas.edu, The University of Texas at
Dallas, Dallas, USA.
arXiv:2210.03696v2 [cs.CL] 25 May 2024
1:2 Xiaoning Feng, Xiaohong Han, Simin Chen, and Wei Yang
text generation including OpenAI’s GPT products(e.g., ChatGPT) [
6
,
11
,
57
,
60
] and Meta’s LLaMA
products [64,73,74].
Much research has been done on enhancing the accuracy of LLMs [
47
,
86
]. Recently, research [
30
,
33
,
34
,
69
] has been conducted to understand the accuracy robustness of existing LLMs by developing
a series of adversarial test input generation frameworks that reduce the generation accuracy of
existing LLMs. While accuracy robustness is clearly important, we observe that the computation
eciency of LLMs, particularly in terms of the latency and energy spent on generating an input with
a specic length, is an equivalently critical property that has surprisingly received little attention.
A common and unique characteristic of the LLMs domain is the need to process a huge amount of
real-time requests (e.g., OpenAI’s ChatGPT has an average monthly visit volume of 15 billion and
an average daily consultation volume of approximately 270 million times [
28
,
49
,
62
]). The vast
demand for generation requests combined with the real-time requirements naturally makes the
computation eciency of any LLM be one of the most critical optimization goals. In this paper,
we make the rst attempt in understanding and testing potential vulnerabilities in terms of the
computation eciency of existing LLMs.
Key observations revealing vulnerabilities on LLMs computation eciency. Our ndings
are motivated by several observations. Particularly, through analyzing the working mechanisms and
detailed implementation of 20,543 public-accessible LLMs (e.g., Google FLAN-T5 [
19
], BigScience
T0 [
65
]), we observe a fundamental property of LLMs that could be manipulated in an adversarial
manner to signicantly reduce computation eciency. Specically, we observe that the computation
eciency of LLMs is highly sensitive to dierent inputs, even those exhibiting just minor dierences.
For instance, slightly modifying an input could incur an order of magnitude more computation
demand (e.g., as shown in Fig. 2, inserting a character “b” in token “Genäckstück” will increase the
latency of HuggingFace’s LLM from 0.876s to 20.382s, representing an over 20
×
latency increase).
Such dramatic impact on computation eciency may occur fundamentally because LLMs often
need to invoke the underlying decoder with non-deterministic numbers of iterations to generate
outputs [
50
,
76
]. Intuitively, the computation eciency of LLMs is determined by the output length
instead of the input, where the output length depends on two factors: an often suciently large
yet pessimistic pre-congured threshold controlling the max number of iterations (e.g., as shown
in Fig. 3, a dominant number of our studied LLMs set this threshold to be over 300, which is
signicantly larger than the actual output length in most cases), and a runtime generated end of
sentence (EOS) token. By observing such properties, our key motivation is that it may be possible
to generate test inputs that could suciently delay the generation of EOS such that LLMs would
have to go through max iterations to satisfy the pessimistic pre-congured threshold.
This implies an important yet unexplored vulnerability of LLMs: adversarially-designed inputs
that may cause enormous, abnormal computation demand in existing LLMs, thus signicantly
wasting the computational resources and energy and may adversely impair user experience and
even service availability. Such adversarial inputs could result in devastating consequences for many
real-world applications (also proved by our experiments). For example, abusing computational
resources on commercial text generation service providers (e.g., HuggingFace [
84
]) could negatively
impact the quality of service (e.g., enormously long response time or even denial of service). For
application domains that are sensitive to latency or energy, such as mobile and IoT devices, abusing
computational resources might consume battery in an unaordable fast manner.
Motivated by these observations, we aim to systematically develop a framework that generates
inputs to test the robustness w.r.t computation eciency of LLMs. The generated test inputs
may signicantly increase the computational demand and thus hinder the computation eciency
regarding response latency, energy consumption, and availability. To make such testing practical,
any generated LLMs test inputs shall not be attack-obvious. One objective is thus to make trivial
LLMEffiChecker: Understanding and Testing Eiciency Degradation of Large Language Models 1:3
or unnoticeable modications on normal textual inputs to generate such test inputs. We present
LLMEffiChecker
that eectively achieves our objectives.
LLMEffiChecker
is developed based on
the aforementioned observation. Specically, LLMs iteratively compute the output token until either
the system generates an end-of-sentence (EOS) token or a pre-congured threshold controlling the
max number of iterations has been met. For our studied 20,543 LLMs
1
, the appearance of EOS is
computed from the underlying DNNs output probability.
LLMEffiChecker
develops techniques that
could perturb input sentences to change the underlying DNNs output probability and suciently
delay the generation of EOS, thus forcing these inputs to reach the naturally-unreachable threshold.
In the white-box setting,
LLMEffiChecker
further develops a gradient-guided technique that
searches for a minimal perturbation (including both character-level, token-level, and structure-level
ones) that can eectively delay the generation of EOS. In the black-box setting,
LLMEffiChecker
utilizes a causal inference-based method to identify crucial tokens without relying on gradient
information and correspondingly applies three levels of imperceptible perturbation to eectively
degrade the eciency of LLMs. Applying the above minimal perturbation on the seed input would
result in signicantly longer output, costing LLMs more computational resources and thus reducing
computation eciency.
Implementation and evaluation. We have conducted extensive experiments to evaluate the
eectiveness of
LLMEffiChecker
. Particularly, we applied
LLMEffiChecker
on nine real-world
public-available and widely used (e.g., with more than 2,714,275 downloads in Nov 2023) LLMs (i.e.,
Google T5 [
29
,
61
], AllenAI WMT14 [
1
], Helsinki-NLP [
35
], Facebook Fairseq [
55
], UNICAMP-DL
Translator [
51
], MarianMT [
52
], Google FLAN-T5 [
19
], MBZUAI LaMini-GPT [
85
] and Salesforce
CodeGen [
56
]). The selected LLMs are trained with dierent corpus and feature diverse DNN
architectures as well as various congurations. We compare
LLMEffiChecker
against four state-of-
the-art methods that focus on testing LLMs’ accuracy and correctness. Evaluation results show that
LLMEffiChecker
is highly eective in generating test inputs to degrade the computation eciency
of the LLMs under test. Specically,
LLMEffiChecker
generates test inputs that could increase
the LLMs’ CPU latency, CPU energy consumption, GPU latency, and GPU energy consumption
by 322% to 3154%, 366% to 3053%, 327% to 1969%, and 322% to 1966%, respectively, through only
perturbing one character or token in any seed input sentences. Our case study shows that inputs
generated by
LLMEffiChecker
signicantly aect the battery power in real-world mobile devices
(i.e., drain more than 30 times battery power than normal inputs).
Contribution. Our contributions are summarized as follows:
Characterization: We are the rst to study and characterize the computation eciency
vulnerability in state-of-the-art LLMs, which may critically impair latency and energy
performance, as well as user experience and service availability. Such vulnerability is
revealed by conducting extensive empirical studies on 20,543 public-available LLMs, which
have been downloaded more than 3,260,064 times in Nov/2023. The results show that the
revealed vulnerability could widely exist due to a fundamental property of LLMs.
Approach: We design and implement
LLMEffiChecker
, the rst framework for testing
LLMs’ computation eciency. Specically, given a seed input,
LLMEffiChecker
applies
gradient-guided and causal inference-based methods to mutate the seed input to generate
test inputs in white-box and black-box settings respectively. Test inputs generated by
LLMEffiChecker only perturb one to three tokens in any seed inputs.
Evaluation: We evaluate
LLMEffiChecker
on nine real-world public-available LLMs (i.e.,
Google T5, AllenAI WMT14, Helsinki-NLP, Facebook FairSeq, U-DL Translator, MarianMT,
FLAN-T5, LaMini-GPT and CodeGen) against four correctness-based testing methods. In
1https://huggingface.co/models?pipeline_tag=text2text-generation&sort=downloads
1:4 Xiaoning Feng, Xiaohong Han, Simin Chen, and Wei Yang
addition, we propose a series of metrics (Eq.(5)) to quantify the eectiveness of the triggered
computation eciency degradation. Evaluation results suggest existing correctness-based
testing methods cannot generate test inputs that impact computation eciency. In contrast,
LLMEffiChecker
generates test inputs that increase LLMs’ latency and energy consumption
by 291% to 12536% and 207% to 11172%, respectively.
Mitigation: We propose a lightweight method to mitigate possible computation eciency
degradation: running a detector at runtime for input validation. We evaluate the performance
of our proposed mitigation method in terms of accuracy and additional overheads. Results
conrm the ecacy and eciency of our proposed mitigation method.
This article represents a substantial expansion of our prior research featured in ESEC/FSE 2022
[
15
]. This extension encompasses several key advancements: (1) Diversication of Testing Scope:
We have broadened our focus from eciency testing specic to neural machine translation (NMT)
models to encompass a broader range, specically targeting General Large Language Models (LLMs).
The scope of our study is now more inclusive, as detailed in the Sec. 3. (2) Introduction of a Black-
Box Approach: In addition to the original white-box methodology, we have introduced a novel
black-box approach, as explained in Sec. 5.3. This innovative methodology is designed to operate
eectively under realistic scenarios, oering a more robust evaluation of the model’s performance.
(3) Expanded Subject Evaluation: Going beyond the connes of NMT models, our research evaluates
our proposed framework across a wider array of subjects. This includes a comprehensive assessment
of the framework’s applicability to LLMs for diverse applications, such as sentence completion and
code generation.
2 BACKGROUND
2.1 Working Mechanism Of Large Language Models
Encoder Decoder
Input
H
SOS
ILike Reading
ILike Reading
EOS
Output
123 4
(a) The Encoder-Decoder architecture
Decoder
SOS
Like
I
Decoder
SOS ILike
Reading
Decoder
SOS ILike Reading
EOS
Time Step #1 Time Step #2 Time Step #3
ILike Reading EOS
Output
12 3 4
Input
(b) The Decoder-Only architecture
Fig. 1. Working mechanism of LLMs
Much recent research has been done towards developing more accurate and ecient large
language models (LLMs) [
9
,
50
,
59
,
70
,
75
,
76
,
86
]. The language model computes the conditional
probability
𝑃(𝑌|𝑋)
, where
𝑋=[𝑥1, 𝑥2,· · · , 𝑥𝑚]
is the input token sequence and
𝑌=[𝑦1, 𝑦2,· · · , 𝑦𝑛]
is the output token sequence. Modern LLMs apply the neural networks to approximate such
conditional probability
𝑃(𝑌|𝑋)
. As shown in Fig. 1, The structure of LLMs can be broadly categorized
into two types: the Encoder-Decoder architecture (e.g., Google T5 series) and the Decoder-Only
architecture (e.g., OpenAI GPT series). The encoder
𝑓𝑒𝑛 (·)
encodes the source input
𝑋
into hidden
representation
𝐻
, then
𝐻
is fed into the decoder for decoding. Notably, the attention layers in the
encoder possess the capacity to analyze all words within the initial sentence, whereas the attention
layers of the decoder
𝑓𝑑𝑒 (·)
can only access the words positioned before a given word in the input.
LLMEffiChecker: Understanding and Testing Eiciency Degradation of Large Language Models 1:5
Consequently, these two architectures are often chosen for dierent tasks. The Encoder-Decoder
architecture is well-suited for tasks involving sequence-to-sequence mappings, (e.g., translation and
summarization). While the Decoder-Only architecture is more tting for autoregressive generation
tasks, characterized by the sequential generation of output sequences (e.g., text continuation and
dialogue systems), it excels in predicting the next piece of text based on the sequence that has
already been generated (or a given initial text). An implementation example of LLMs’ decoding
process is shown in Listing 1
2
. From the code snippet, we observe that the decoding process starts
with a special token (SOS) and iteratively accesses
𝐻
for an auto-regressive generation of each
token
𝑦𝑖
until the end of sequence token (EOS) or the maximum iteration (e.g.,
max_length
) is
reached (whichever condition is reached earlier). To improve LLMs’ accuracy, a common practice
is to apply the beam search algorithm to search multiple top tokens at each iteration and select the
best one after the whole decoding process.
1'''
2Encodi n g p r ocess
3'''
4decoded_words = ['< SO S > ']
5for di in range (max_length):
6decoder_ o u t p ut , decod e r_h i dde n = de coder ( decoder_inp u t , decoder_hi d d e n ,
encoder_outputs)
7topv , topi = de coder_ outpu t . dat a . to pk (1)
8if to pi . i te m () == EOS _t oken :
9d ecod ed_w ords . append ( '< EOS > ')
10 break
11 else:
12 de coded_ words . a ppend ( ind ex2word [ t opi . item () ])
13 d eco de r _i np u t = topi . s queez e () . d etach ( )
14 return decoded_words
Listing 1. Source Code Example of LLMs Implementation
2.2 Robustness Testing for NLP Systems
Although modern NLP systems demonstrate human-level performance in terms of accuracy, NLP
systems are still far from robust due to the complexity and intractability of the underlying neural
networks. To improve the robustness of NLP systems, a series of testing methods have been
proposed, which focus on accuracy testing. The core idea of existing work is to perturb seed
input sentences with dierent perturbations and detect output inconsistency between perturbed
and seed outputs. At high-level, the perturbations in existing work can be categorized into three
types. (i) character-level: This type of perturbations [
4
,
20
,
21
,
44
,
97
] represents the natural typos
and noises in textual inputs. For example, character swap (e.g., noise
nosie), order random
(e.g., noise
nisoe), character insertions (e.g., noise
noisde), and keyboard typo (e.g., noise
noide); (ii) token-level: This type of perturbations [
18
,
44
,
63
,
69
,
90
,
94
] replaces a few tokens in
the seed sentences with other tokens. However, token replacement sometimes would completely
change the semantic of the input text; thus, this type of perturbation usually appears in adversary
scenarios; (iii) structure-level: Dierent from the above two perturbations, this type of perturbations
[
30
,
33
,
34
,
45
] seeks to generate legal sentences that do not contain lexical or syntactic errors. For
example, [
33
] proposes a structure invariant testing method to perturb seed inputs with
Bert
[
40
],
and the perturbed sentences will exhibit similar sentence structure with the seed sentences.
2The code snippet is downloaded from PyTorch LLM tutorial
摘要:

1LLMEffiChecker:UnderstandingandTestingEfficiencyDegradationofLargeLanguageModelsXIAONINGFENG,TaiyuanUniversityofTechnology,ChinaXIAOHONGHAN∗,TaiyuanUniversityofTechnology,ChinaSIMINCHEN†,TheUniversityofTexasatDallas,USAWEIYANG,TheUniversityofTexasatDallas,USALargeLanguageModels(LLMs)havereceivedmuc...

展开>> 收起<<
LLMEffiChecker Understanding and Testing Efficiency Degradation of Large Language Models.pdf

共37页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:37 页 大小:2.54MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 37
客服
关注