LLMEffiChecker Understanding and Testing Efficiency Degradation of Large Language Models

2025-05-06 0 0 2.54MB 37 页 10玖币

侵权投诉

LLMEffiChecker: Understanding and Testing Eiciency

Degradation of Large Language Models

XIAONING FENG,Taiyuan University of Technology, China

XIAOHONG HAN∗,Taiyuan University of Technology, China

SIMIN CHEN†,The University of Texas at Dallas, USA

WEI YANG,The University of Texas at Dallas, USA

Large Language Models (LLMs) have received much recent attention due to their human-level accuracy. While

existing works mostly focus on either improving accuracy or testing accuracy robustness, the computation

eciency of LLMs, which is of paramount importance due to often vast generation demands and real-time

requirements, has surprisingly received little attention. In this paper, we make the rst attempt to understand

and test potential computation eciency robustness in state-of-the-art LLMs. By analyzing the working

mechanism and implementation of 20,543 public-accessible LLMs, we observe a fundamental property in

LLMs that could be manipulated in an adversarial manner to reduce computation eciency signicantly.

Our interesting observation is that the output length determines the computation eciency of LLMs instead

of the input, where the output length depends on two factors: an often suciently large yet pessimistic

pre-congured threshold controlling the max number of iterations and a runtime generated end of sentence

(EOS) token. Our key motivation is to generate test inputs that could suciently delay the generation of

EOS such that LLMs would have to go through enough iterations to satisfy the pre-congured threshold.

We present

LLMEffiChecker

, which can work under both white-box setting and black-box setting. In the

white-box scenario,

LLMEffiChecker

develops a gradient-guided technique that searches for a minimal

and unnoticeable perturbation at character-level, token-level, and structure-level. In the black-box scenario,

LLMEffiChecker

employs a causal inference-based approach to nd critical tokens and similarly applies three

levels of imperceptible perturbation to them. Both the white-box and black-box settings eectively delay the

appearance of EOS, compelling these inputs to reach the naturally-unreachable threshold. To demonstrate the

eectiveness of

LLMEffiChecker

, we conduct a systematic evaluation on nine public-available LLMs: Google

T5, AllenAI WMT14, Helsinki-NLP translator, Facebook FairSeq, UNICAMP-DL translator, MarianMT, Google

FLAN-T5, MBZUAI LaMini-GPT and Salesforce CodeGen. Experimental results show that

LLMEffiChecker

can increase on average LLMs’ response latency and energy consumption by 325% to 3244% and 344% to

3616%, respectively, by perturbing just one character or token in the input sentence. Our case study shows

that inputs generated by

LLMEffiChecker

signicantly aect the battery power in real-world mobile devices

(i.e., drain more than 30 times battery power than normal inputs).

CCS Concepts: •Software and its engineering

→

Search-based software engineering;Software testing

and debugging;Automatic programming;Software evolution.

Additional Key Words and Phrases: Machine learning, software testing, large language model

1 INTRODUCTION

Large Language Model (LLM) is a promising approach that applies neural networks to resolve

various text generation problems. LLMs have received signicant recent attention from both

academia [

] and industry [

], due to its advantages over traditional

text generation methods (e.g., N-gram language models [

]). For instance, due to being capable of

capturing rather long dependencies in sentences, LLMs are seeing a wide adoption in commercial

∗Corresponding author

†Corresponding author

Authors’ addresses: Xiaoning Feng, fengxiaoning1746@link.tyut.edu.cn, Taiyuan University of Technology, Tai Yuan, China;

Xiaohong Han, hanxiaohong@tyut.edu.cn, Taiyuan University of Technology, Tai Yuan, China; Simin Chen, simin.chen@

UTDallas.edu, The University of Texas at Dallas, Dallas, USA; Wei Yang, wei.yang@utdallas.edu, The University of Texas at

Dallas, Dallas, USA.

arXiv:2210.03696v2 [cs.CL] 25 May 2024

1:2 Xiaoning Feng, Xiaohong Han, Simin Chen, and Wei Yang

text generation including OpenAI’s GPT products(e.g., ChatGPT) [

] and Meta’s LLaMA

products [64,73,74].

Much research has been done on enhancing the accuracy of LLMs [

]. Recently, research [

] has been conducted to understand the accuracy robustness of existing LLMs by developing

a series of adversarial test input generation frameworks that reduce the generation accuracy of

existing LLMs. While accuracy robustness is clearly important, we observe that the computation

eciency of LLMs, particularly in terms of the latency and energy spent on generating an input with

a specic length, is an equivalently critical property that has surprisingly received little attention.

A common and unique characteristic of the LLMs domain is the need to process a huge amount of

real-time requests (e.g., OpenAI’s ChatGPT has an average monthly visit volume of 15 billion and

an average daily consultation volume of approximately 270 million times [

]). The vast

demand for generation requests combined with the real-time requirements naturally makes the

computation eciency of any LLM be one of the most critical optimization goals. In this paper,

we make the rst attempt in understanding and testing potential vulnerabilities in terms of the

computation eciency of existing LLMs.

Key observations revealing vulnerabilities on LLMs computation eciency. Our ndings

are motivated by several observations. Particularly, through analyzing the working mechanisms and

detailed implementation of 20,543 public-accessible LLMs (e.g., Google FLAN-T5 [

], BigScience

T0 [

]), we observe a fundamental property of LLMs that could be manipulated in an adversarial

manner to signicantly reduce computation eciency. Specically, we observe that the computation

eciency of LLMs is highly sensitive to dierent inputs, even those exhibiting just minor dierences.

For instance, slightly modifying an input could incur an order of magnitude more computation

demand (e.g., as shown in Fig. 2, inserting a character “b” in token “Genäckstück” will increase the

latency of HuggingFace’s LLM from 0.876s to 20.382s, representing an over 20

latency increase).

Such dramatic impact on computation eciency may occur fundamentally because LLMs often

need to invoke the underlying decoder with non-deterministic numbers of iterations to generate

outputs [

]. Intuitively, the computation eciency of LLMs is determined by the output length

instead of the input, where the output length depends on two factors: an often suciently large

yet pessimistic pre-congured threshold controlling the max number of iterations (e.g., as shown

in Fig. 3, a dominant number of our studied LLMs set this threshold to be over 300, which is

signicantly larger than the actual output length in most cases), and a runtime generated end of

sentence (EOS) token. By observing such properties, our key motivation is that it may be possible

to generate test inputs that could suciently delay the generation of EOS such that LLMs would

have to go through max iterations to satisfy the pessimistic pre-congured threshold.

This implies an important yet unexplored vulnerability of LLMs: adversarially-designed inputs

that may cause enormous, abnormal computation demand in existing LLMs, thus signicantly

wasting the computational resources and energy and may adversely impair user experience and

even service availability. Such adversarial inputs could result in devastating consequences for many

real-world applications (also proved by our experiments). For example, abusing computational

resources on commercial text generation service providers (e.g., HuggingFace [

]) could negatively

impact the quality of service (e.g., enormously long response time or even denial of service). For

application domains that are sensitive to latency or energy, such as mobile and IoT devices, abusing

computational resources might consume battery in an unaordable fast manner.

Motivated by these observations, we aim to systematically develop a framework that generates

inputs to test the robustness w.r.t computation eciency of LLMs. The generated test inputs

may signicantly increase the computational demand and thus hinder the computation eciency

regarding response latency, energy consumption, and availability. To make such testing practical,

any generated LLMs test inputs shall not be attack-obvious. One objective is thus to make trivial

LLMEffiChecker: Understanding and Testing Eiciency Degradation of Large Language Models 1:3

or unnoticeable modications on normal textual inputs to generate such test inputs. We present

LLMEffiChecker

that eectively achieves our objectives.

LLMEffiChecker

is developed based on

the aforementioned observation. Specically, LLMs iteratively compute the output token until either

the system generates an end-of-sentence (EOS) token or a pre-congured threshold controlling the

max number of iterations has been met. For our studied 20,543 LLMs

, the appearance of EOS is

computed from the underlying DNNs output probability.

LLMEffiChecker

develops techniques that

could perturb input sentences to change the underlying DNNs output probability and suciently

delay the generation of EOS, thus forcing these inputs to reach the naturally-unreachable threshold.

In the white-box setting,

LLMEffiChecker

further develops a gradient-guided technique that

searches for a minimal perturbation (including both character-level, token-level, and structure-level

ones) that can eectively delay the generation of EOS. In the black-box setting,

LLMEffiChecker

utilizes a causal inference-based method to identify crucial tokens without relying on gradient

information and correspondingly applies three levels of imperceptible perturbation to eectively

degrade the eciency of LLMs. Applying the above minimal perturbation on the seed input would

result in signicantly longer output, costing LLMs more computational resources and thus reducing

computation eciency.

Implementation and evaluation. We have conducted extensive experiments to evaluate the

eectiveness of

LLMEffiChecker

. Particularly, we applied

LLMEffiChecker

on nine real-world

public-available and widely used (e.g., with more than 2,714,275 downloads in Nov 2023) LLMs (i.e.,

Google T5 [

], AllenAI WMT14 [

], Helsinki-NLP [

], Facebook Fairseq [

], UNICAMP-DL

Translator [

], MarianMT [

], Google FLAN-T5 [

], MBZUAI LaMini-GPT [

] and Salesforce

CodeGen [

]). The selected LLMs are trained with dierent corpus and feature diverse DNN

architectures as well as various congurations. We compare

LLMEffiChecker

against four state-of-

the-art methods that focus on testing LLMs’ accuracy and correctness. Evaluation results show that

LLMEffiChecker

is highly eective in generating test inputs to degrade the computation eciency

of the LLMs under test. Specically,

LLMEffiChecker

generates test inputs that could increase

the LLMs’ CPU latency, CPU energy consumption, GPU latency, and GPU energy consumption

by 322% to 3154%, 366% to 3053%, 327% to 1969%, and 322% to 1966%, respectively, through only

perturbing one character or token in any seed input sentences. Our case study shows that inputs

generated by

LLMEffiChecker

signicantly aect the battery power in real-world mobile devices

(i.e., drain more than 30 times battery power than normal inputs).

Contribution. Our contributions are summarized as follows:

•

Characterization: We are the rst to study and characterize the computation eciency

vulnerability in state-of-the-art LLMs, which may critically impair latency and energy

performance, as well as user experience and service availability. Such vulnerability is

revealed by conducting extensive empirical studies on 20,543 public-available LLMs, which

have been downloaded more than 3,260,064 times in Nov/2023. The results show that the

revealed vulnerability could widely exist due to a fundamental property of LLMs.

•

Approach: We design and implement

LLMEffiChecker

, the rst framework for testing

LLMs’ computation eciency. Specically, given a seed input,

LLMEffiChecker

applies

gradient-guided and causal inference-based methods to mutate the seed input to generate

test inputs in white-box and black-box settings respectively. Test inputs generated by

LLMEffiChecker only perturb one to three tokens in any seed inputs.

•

Evaluation: We evaluate

LLMEffiChecker

on nine real-world public-available LLMs (i.e.,

Google T5, AllenAI WMT14, Helsinki-NLP, Facebook FairSeq, U-DL Translator, MarianMT,

FLAN-T5, LaMini-GPT and CodeGen) against four correctness-based testing methods. In

1https://huggingface.co/models?pipeline_tag=text2text-generation&sort=downloads

1:4 Xiaoning Feng, Xiaohong Han, Simin Chen, and Wei Yang

addition, we propose a series of metrics (Eq.(5)) to quantify the eectiveness of the triggered

computation eciency degradation. Evaluation results suggest existing correctness-based

testing methods cannot generate test inputs that impact computation eciency. In contrast,

LLMEffiChecker

generates test inputs that increase LLMs’ latency and energy consumption

by 291% to 12536% and 207% to 11172%, respectively.

•

Mitigation: We propose a lightweight method to mitigate possible computation eciency

degradation: running a detector at runtime for input validation. We evaluate the performance

of our proposed mitigation method in terms of accuracy and additional overheads. Results

conrm the ecacy and eciency of our proposed mitigation method.

This article represents a substantial expansion of our prior research featured in ESEC/FSE 2022

[

]. This extension encompasses several key advancements: (1) Diversication of Testing Scope:

We have broadened our focus from eciency testing specic to neural machine translation (NMT)

models to encompass a broader range, specically targeting General Large Language Models (LLMs).

The scope of our study is now more inclusive, as detailed in the Sec. 3. (2) Introduction of a Black-

Box Approach: In addition to the original white-box methodology, we have introduced a novel

black-box approach, as explained in Sec. 5.3. This innovative methodology is designed to operate

eectively under realistic scenarios, oering a more robust evaluation of the model’s performance.

(3) Expanded Subject Evaluation: Going beyond the connes of NMT models, our research evaluates

our proposed framework across a wider array of subjects. This includes a comprehensive assessment

of the framework’s applicability to LLMs for diverse applications, such as sentence completion and

code generation.

2 BACKGROUND

2.1 Working Mechanism Of Large Language Models

Encoder Decoder

Input

SOS

ILike Reading

EOS

Output

123 4

(a) The Encoder-Decoder architecture

Decoder

SOS

Decoder

SOS ILike

Reading

Decoder

SOS ILike Reading

EOS

Time Step #1 Time Step #2 Time Step #3

ILike Reading EOS

Output

12 3 4

Input

(b) The Decoder-Only architecture

Fig. 1. Working mechanism of LLMs

Much recent research has been done towards developing more accurate and ecient large

language models (LLMs) [

]. The language model computes the conditional

probability

𝑃(𝑌|𝑋)

, where

𝑋=[𝑥1, 𝑥2,· · · , 𝑥𝑚]

is the input token sequence and

𝑌=[𝑦1, 𝑦2,· · · , 𝑦𝑛]

is the output token sequence. Modern LLMs apply the neural networks to approximate such

conditional probability

𝑃(𝑌|𝑋)

. As shown in Fig. 1, The structure of LLMs can be broadly categorized

into two types: the Encoder-Decoder architecture (e.g., Google T5 series) and the Decoder-Only

architecture (e.g., OpenAI GPT series). The encoder

𝑓𝑒𝑛 (·)

encodes the source input

𝑋

into hidden

representation

𝐻

, then

𝐻

is fed into the decoder for decoding. Notably, the attention layers in the

encoder possess the capacity to analyze all words within the initial sentence, whereas the attention

layers of the decoder

𝑓𝑑𝑒 (·)

can only access the words positioned before a given word in the input.

LLMEffiChecker: Understanding and Testing Eiciency Degradation of Large Language Models 1:5

Consequently, these two architectures are often chosen for dierent tasks. The Encoder-Decoder

architecture is well-suited for tasks involving sequence-to-sequence mappings, (e.g., translation and

summarization). While the Decoder-Only architecture is more tting for autoregressive generation

tasks, characterized by the sequential generation of output sequences (e.g., text continuation and

dialogue systems), it excels in predicting the next piece of text based on the sequence that has

already been generated (or a given initial text). An implementation example of LLMs’ decoding

process is shown in Listing 1

. From the code snippet, we observe that the decoding process starts

with a special token (SOS) and iteratively accesses

𝐻

for an auto-regressive generation of each

token

𝑦𝑖

until the end of sequence token (EOS) or the maximum iteration (e.g.,

max_length

) is

reached (whichever condition is reached earlier). To improve LLMs’ accuracy, a common practice

is to apply the beam search algorithm to search multiple top tokens at each iteration and select the

best one after the whole decoding process.

1'''

2Encodi n g p r ocess

3'''

4decoded_words = ['< SO S > ']

5for di in range (max_length):

6decoder_ o u t p ut , decod e r_h i dde n = de coder ( decoder_inp u t , decoder_hi d d e n ,

encoder_outputs)

7topv , topi = de coder_ outpu t . dat a . to pk (1)

8if to pi . i te m () == EOS _t oken :

9d ecod ed_w ords . append ( '< EOS > ')

10 break

11 else:

12 de coded_ words . a ppend ( ind ex2word [ t opi . item () ])

13 d eco de r _i np u t = topi . s queez e () . d etach ( )

14 return decoded_words

Listing 1. Source Code Example of LLMs Implementation

2.2 Robustness Testing for NLP Systems

Although modern NLP systems demonstrate human-level performance in terms of accuracy, NLP

systems are still far from robust due to the complexity and intractability of the underlying neural

networks. To improve the robustness of NLP systems, a series of testing methods have been

proposed, which focus on accuracy testing. The core idea of existing work is to perturb seed

input sentences with dierent perturbations and detect output inconsistency between perturbed

and seed outputs. At high-level, the perturbations in existing work can be categorized into three

types. (i) character-level: This type of perturbations [

] represents the natural typos

and noises in textual inputs. For example, character swap (e.g., noise

→

nosie), order random

(e.g., noise

→

nisoe), character insertions (e.g., noise

→

noisde), and keyboard typo (e.g., noise

→

noide); (ii) token-level: This type of perturbations [

] replaces a few tokens in

the seed sentences with other tokens. However, token replacement sometimes would completely

change the semantic of the input text; thus, this type of perturbation usually appears in adversary

scenarios; (iii) structure-level: Dierent from the above two perturbations, this type of perturbations

[

] seeks to generate legal sentences that do not contain lexical or syntactic errors. For

example, [

] proposes a structure invariant testing method to perturb seed inputs with

Bert

[

and the perturbed sentences will exhibit similar sentence structure with the seed sentences.

2The code snippet is downloaded from PyTorch LLM tutorial

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1LLMEffiChecker:UnderstandingandTestingEfficiencyDegradationofLargeLanguageModelsXIAONINGFENG,TaiyuanUniversityofTechnology,ChinaXIAOHONGHAN∗,TaiyuanUniversityofTechnology,ChinaSIMINCHEN†,TheUniversityofTexasatDallas,USAWEIYANG,TheUniversityofTexasatDallas,USALargeLanguageModels(LLMs)havereceivedmuc...

展开>> 收起<<

LLMEffiChecker Understanding and Testing Efficiency Degradation of Large Language Models.pdf

共37页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

LLMEffiChecker Understanding and Testing Efficiency Degradation of Large Language Models

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: