Practical Program Repair in the Era of Large Pre-trained Language Models Chunqiu Steven Xia

2025-05-02 0 0 735.98KB 13 页 10玖币
侵权投诉
Practical Program Repair in the Era of Large
Pre-trained Language Models
Chunqiu Steven Xia
University of Illinois at
Urbana-Champaign
chunqiu2@illinois.edu
Yuxiang Wei
University of Illinois at
Urbana-Champaign
ywei40@illinois.edu
Lingming Zhang
University of Illinois at
Urbana-Champaign
lingming@illinois.edu
Abstract—Automated Program Repair (APR) aims to help
developers automatically patch software bugs. However, current
state-of-the-art traditional and learning-based APR techniques
face the problem of limited patch variety, failing to fix com-
plicated bugs. This is mainly due to the reliance on bug-fixing
datasets to craft fix templates (traditional) or directly predict
potential patches (learning-based). Large Pre-Trained Language
Models (PLMs), trained using billions of text/code tokens, can
potentially help avoid this issue. Very recently, researchers have
directly leveraged PLMs for APR without relying on any bug-
fixing datasets. Meanwhile, such existing work either failed to
include state-of-the-art PLMs or was not evaluated on realistic
datasets. Thus, the true power of modern PLMs on the important
APR problem is yet to be revealed.
In this work, we perform the first extensive study on directly
applying PLMs for APR. We select 9 recent state-of-the-art
PLMs, including both generative and infilling models, ranging
from 125M to 20B in size. We designed 3 different repair settings
to evaluate the different ways we can use PLMs to generate
patches: 1) generate the entire patch function, 2) fill in a chunk
of code given the prefix and suffix 3) output a single line fix.
We apply the PLMs under these repair settings on 5 datasets
across 3 different languages and compare different PLMs in the
number of bugs fixed, generation speed and compilation rate.
We also compare the PLMs against recent state-of-the-art APR
tools. Our study demonstrates that directly applying state-of-
the-art PLMs can already substantially outperform all existing
APR techniques on all our datasets. Among the studied PLMs,
the scaling effect exists for APR where larger models tend to
achieve better performance. Also, we show for the first time that
suffix code after the buggy line (adopted in infilling-style APR)
is important in not only generating more fixes but more patches
with higher compilation rate. Besides patch generation, the PLMs
consider correct patches to be more natural than other ones,
and can even be leveraged for effective patch ranking or patch
correctness checking. Lastly, we show that PLM-based APR can
be further substantially boosted via: 1) increasing the sample
size, and 2) incorporating fix template information.
I. INTRODUCTION
As software programs and systems become more and more
ubiquitous in everyday life, so do software bugs. Due to
the wide-ranging adoption of software systems in fields from
healthcare [1] to transportation [2], these bugs can potentially
cause dangerous safety issues [3] and financial losses [4]. As
such, developers often need to spend a significant amount of
time and effort to fix software bugs [5]. In order to help
developers reduce this manual effort, Automated Program
Repair (APR) tools have been built to automatically generate
potential patches given the original buggy program [6].
Among traditional APR techniques [7]–[16], template-based
APR has been widely recognized as the state of the art [17],
[18]. These techniques leverage fix templates, often designed
by human experts, to fix specific types of bugs in the source
code. As a result, these APR tools are constrained by the
underlying fix templates in the types of bugs that can be
fixed. To combat this, researchers have proposed learning-
based APR tools [19]–[22], which typically model program
repair as a Neural Machine Translation (NMT) problem [23],
where the goal is to translate a buggy program into a fixed
program. The core component of these learning-based APR
tools is an encoder and decoder pair, where the model aims
to capture the buggy context via the encoder and then autore-
gressively generate the patch using the decoder. As such, these
learning-based APR tools require supervised training datasets
containing pairs of buggy and patched code, usually obtained
by mining historical bug fixes from open-source repositories.
While learning-based APR tools have shown improvements in
both the number and variety of bugs that can be fixed [19],
[20], they are still restricted by their training data which
only contain a limited amount of bug fix types and may not
generalize to unseen bug types [24].
Recent developments in building Large Pre-Trained Lan-
guage Models (PLMs) offer an alternative solution that can
be applied for program repair without relying on historical
bug fixes. While PLMs are usually general-purpose tools for
NLP tasks (e.g., GPT3 [25]), they have also been used for pro-
gramming languages by finetuning on code (e.g., Codex [26]).
Unlike the specifically designed learning-based APR models,
PLMs are trained in an unsupervised fashion using up to
billions of text/code tokens and can be used in a variety of
code tasks. Recently, AlphaRepair [24] proposes to leverage
CodeBERT [27], a large code model pre-trained on millions
of code snippets, directly for APR. The key insight from
AlphaRepair is instead of learning transformations to go from
buggy code to fixed code, we can directly use the model
to predict what the correct code should look like given its
surrounding context (including both prefix and suffix), i.e.,
infilling-style APR. Using this idea, AlphaRepair demonstrated
state-of-the-art repair results without finetuning on bug fixing
dataset. While AlphaRepair has shown improvements over
arXiv:2210.14179v2 [cs.SE] 9 Dec 2024
previous learning-based APR, the model (125M parameters)
it uses is far smaller than the current state-of-the-art PLMs
(Codex: 12B parameters and GPT-3: 175B parameters). Beside
AlphaRepair, researchers have also directly leveraged Codex
for generative APR [28], [29], i.e., generating the fixes based
on the context before bugs (i.e., prefix only). However, these
studies mostly focus on Codex and are only evaluated on a
small dataset with 40 bugs on simple programming tasks.
Current state-of-the-art PLMs such as Codex [26] and
INCODER [30] have also included evaluation for code related
tasks such as code completion, docstring generation and vari-
able/type prediction. However, these evaluations still mainly
focus on NLP metrics such as BLEU score [31] which do not
accurately measure the functional or semantic correctness of
the generated code. Furthermore, the datasets consist of hand-
curated code problems which do not accurately reflect the type
of projects developers work on in the real world.
Our Work. We present the first extensive evaluation of recent
PLMs for fixing real-world projects. We designed 3 different
APR experimental settings: 1) complete function generation 2)
correct code infilling and 3) single line generation to showcase
the different ways PLMs can be applied for APR. In our
study, we include both popular types of PLM architectures
(generative and infilling models) to show the advantages and
flaws of using each type for APR. We include models with
a wide range of different parameter sizes, spanning from 125
million to 20 billion. We evaluate not only the improvement in
repair effectiveness but also the trade-off with respect to speed
when increasing the model size. In total, we use 5 different
repair datasets containing real open-source bugs and developer
written tests across 3 programming languages to evaluate APR
under realistic settings. Compared with existing applications of
PLMs for APR [24], [28], [29], our study is the first to include
state-of-the-art PLMs for both infilling-style and generative
APR on various datasets and programming languages. To
summarize, this paper makes the following contributions.
Dimension. This paper bridges the gap between the re-
cent advances in PLMs and a crucial software engineering
problem – APR. This paper not only demonstrates the
potential and future for directly leveraging PLMs for solving
the important APR problem, but also provides a realistic
evaluation scenario for the recent PLMs, which were mainly
evaluated on simple/synthetic coding problems rather than
real-world systems as studied in the APR area.
Study. We conduct extensive evaluations using 9 different
recent PLMs on 5 different repair datasets across 3 different
programming languages (Java, Python, and C). We compare
the PLMs against each other using the 3 repair settings
we designed. Using the popular repair datasets, we further
compare the PLMs with state-of-the-art APR tools.
Practical Guidelines. Our study shows for the first time
that directly applying state-of-the-art PLMs can already sub-
stantially outperform all existing APR tools on the widely
studied Defects4J 1.2 dataset (and other ones), e.g., Codex
can fix 32 more bugs than the existing best APR technique.
Among the studied PLMs, the scaling effect exists for APR
where larger models tend to deliver stronger APR results.
Also, we show for the first time that suffix code after the
buggy line (adopted in infilling-style APR) is important
in not only generating more fixes but more patches with
higher compilation rate. Besides patch generation, the PLMs
consider correct patches to be more natural than other
ones, and can even be used for effective patch ranking or
correctness checking. Lastly, we show that PLM-based APR
can be further substantially improved via: 1) increasing the
sample size, and 2) incorporating fix template information.
II. BACKGROUND AND RELATED WORK
A. Large Pre-Trained Language Model
Large Pre-Trained Language Models (PLMs) have become
ubiquitous in the domain of NLP, achieving impressive per-
formance in many tasks such as machine translation [23], text
summarization [32] and classification [33]. PLMs follow the
Transformer architecture [34] – an encoder to capture input
representation and a decoder to generate output tokens. These
PLMs are first pre-trained in an unsupervised manner, on
large amounts of text data and then finetuned for downstream
tasks. However, certain tasks may not have an abundance of
finetuned data available. As such, researchers have evaluated
the ability for PLMs to perform on downstream tasks without
finetuning. This is achieved via prompt engineering [35] –
providing the model with natural language descriptions and
demonstrations of the task it is trying to solve before giving the
model the target input. This works by leveraging the general-
purpose setup of PLMs where the unsupervised pretraining
dataset already encompasses many domains of problems/tasks.
Using this idea and the exponential growth in PLM size [36],
impressive performance in many tasks can be achieved even
without any finetuning [25].
PLMs can be classified into encoder-only, decoder-only and
encoder-decoder models based on their architectures. Encoder-
only models (such as BERT [37]) contain only the encoder
component of a Transformer. They are typically designed to
learn data representations and are trained using the Masked
Language Modeling (MLM) objective – a small percentage
(e.g., 15%) of tokens in the training data will be replaced by
masked tokens, and then the models are trained to predict the
original values of the masked tokens based on the bidirectional
contexts. Decoder-only models (such as GPT-3 [25] and GPT-
Neo [38]) are large generative models that use the decoder to
predict the next token output given all previous tokens (i.e., left
context or prefix only). To combine the usage of both encoder
and decoder, encoder-decoder models (such as T5 [39] and
BART [40]) have also been proposed for sequence-to-sequence
tasks where the training objective aims to recover the correct
output sequence given the original input (e.g., corrupted to
uncorrupted). One such training objective is span prediction
tasks, where random spans (multiple tokens) are replaced with
artificial span tokens and the model is tasked with recovering
the original tokens. For inferencing, one can use the encoder-
decoder models to infill text by also adding the artificial
span token in place. Recently, researchers have also combined
MLM with generative models to perform both bidirectional
and autoregressive text generation or infilling [41]. In our APR
scenario, all types of PLMs can potentially be leveraged for
generative or infilling-style APR, and we select 9 state-of-the-
art PLMs for our study (detailed in Section III-A).
B. Automated Program Repair
Automated Program Repair (APR) tools are used to generate
patched code given the original code and the corresponding
buggy location. Each patch generated by the APR tool is
validated against the test suite. Plausible patches are ones
which pass the entire suite. Correct patches are plausible
patches which correctly fix the underlying bug.
Traditional APR tools can be classified as heuristic-
based [7]–[9], constraint-based [10]–[12] and template-
based [13]–[17]. Traditionally, template-based APR tools
achieve the best performance, where each template is hand-
crafted by human experts designed to provide a fix for a
specific type of bug. However, these template-based APR tools
can only fix the bug types that are part of the templates. As
a result, researchers employed learning-based APR tools to
generate more expressive patches. Learning-based APR tools
such as Recoder [19], RewardRepair [21], and CURE [20]
are based on NMT techniques [23] which require specific bug
fixing data to train the NMT model to generate a fix line
given the buggy line. Due to this reliance on the bug-fixing
data, these learning-based tools are still limited in terms of the
type of fixes it can apply. Recent work of AlphaRepair [24]
addresses this by performing APR under a zero-shot setting by
directly using the CodeBERT model for repair. AlphaRepair
fills the original buggy line with masked tokens and uses
CodeBERT to replace the masked tokens with correct code
tokens to generate repair, i.e., infilling-style (also called cloze-
style) APR. While AlphaRepair is able to achieve state-of-the-
art results, CodeBERT is considerably smaller than the newest
PLMs. Additionally, AlphaRepair is only tested on a single
setting where the correct location of the buggy line is known.
Recent work [28], [29] has also looked into directly apply-
ing PLMs for APR. Prenner et al. [29] conducted a small-
scale evaluation for the Codex model on a simple dataset
containing both Java and Python versions of buggy algorithm
implementations. Codex is given the buggy function and
by using prompt engineering, are then asked to generate
a complete fixed function. The results show that Codex is
competitive with state-of-the-art learning-based APR tools in
Python but worse in Java. In contrast, we show that by using
our practical repair settings, PLMs are able to outperform
state-of-the-art APR tools on both Java and Python. Kolak
et al. [28] also used Codex along with 2 smaller PLMs and
evaluated their ability to generate the correct patch line when
given the code prefix on the same dataset as the previous
work [29]. The evaluation demonstrated the scaling effect of
PLMs where the repair results can be improved by using larger
models. Interestingly, the study leverages sum entropy for
patch ranking while AlphaRepair leverages mean entropy (i.e.,
TABLE I: Evaluation PLM overview
Model #Parameters Training Dataset Type
GPT-Neo 125M/1.3B/2.7B The Pile Generative
GPT-J 6.7B The Pile Generative
GPT-NeoX 20B The Pile Generative
Codex 12B N.R. Generative
& Infilling
CodeT5 220M CodeSearchNet
& BigQuery Infilling
INCODER 1.3B/6.7B N.R. Infilling
both favors more natural [42] patches). Thus, we also perform
a study of leveraging various recent PLMs for computing both
entropies for patch ranking on real-world systems.
Overall, the 2 prior studies [28], [29] are done on a small
dataset with synthetic bugs using only a small number of
PLMs. Moreover, the input and repair setting being used in the
studies are also limited, e.g., only considered generative APR.
In this paper, we present an extensive study of applying various
state-of-the-art PLMs for both infilling-style and generative
APR on diverse repair datasets across programming languages.
III. APPROACH
In this section we describe the PLMs selected for evaluation
and introduce 3 different APR generation settings we use to
evaluate each PLM. These settings are designed to showcase
the different practical ways we can directly use PLMs for
APR and highlight advantages and differences of the studied
PLM types. Also, we detail the patch ranking strategy of using
entropy to prioritize patches that are more likely to be correct.
A. Models
We begin by describing the different PLMs we use for
evaluation. Our selection process starts with the list of popular
models hosted on the Hugging Face [43] – an open-source
platform to host and deploy large models. We sort the list
of models based on popularity (#downloads this month) and
select the PLMs which contain code as training data. Fur-
thermore, we also pick models from different organizations
and types (described below) to obtain a diverse set of models.
Along with the open-source models, we also use the closed-
source Codex model [26] (accessible only via API) since it
has shown to achieve impressive performance on code related
tasks. In total, we use 9 different PLMs for our experiment.
Our chosen PLMs range from 125M to 20B in parameter
size. Table I presents the PLM overview. Column Model is
the model name, #Parameters presents the number of model
parameters, Training Dataset indicates the dataset used for
pre-training (N.R. is not released), and Type refers to the type
of APR the model can perform (infilling or generative).
1) Generative Models:
GPT-Neo [38], GPT-J [44], GPT-NeoX [45] All three
models are open-source implementations of the GPT-3 trans-
former architecture [25]. In our experiments, we use GPT-
Neo models with 125M, 1.3B and 2.7B parameters. GPT-J
and GPT-NeoX are even larger models with 6.7B and 20B
parameters. These models were trained on The Pile [46],
摘要:

PracticalProgramRepairintheEraofLargePre-trainedLanguageModelsChunqiuStevenXiaUniversityofIllinoisatUrbana-Champaignchunqiu2@illinois.eduYuxiangWeiUniversityofIllinoisatUrbana-Champaignywei40@illinois.eduLingmingZhangUniversityofIllinoisatUrbana-Champaignlingming@illinois.eduAbstract—AutomatedProgra...

展开>> 收起<<
Practical Program Repair in the Era of Large Pre-trained Language Models Chunqiu Steven Xia.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:735.98KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注