Practical Program Repair in the Era of Large Pre-trained Language Models Chunqiu Steven Xia

2025-05-02 0 0 735.98KB 13 页 10玖币

侵权投诉

Practical Program Repair in the Era of Large

Pre-trained Language Models

Chunqiu Steven Xia

University of Illinois at

Urbana-Champaign

chunqiu2@illinois.edu

Yuxiang Wei

University of Illinois at

Urbana-Champaign

ywei40@illinois.edu

Lingming Zhang

University of Illinois at

Urbana-Champaign

lingming@illinois.edu

Abstract—Automated Program Repair (APR) aims to help

developers automatically patch software bugs. However, current

state-of-the-art traditional and learning-based APR techniques

face the problem of limited patch variety, failing to ﬁx com-

plicated bugs. This is mainly due to the reliance on bug-ﬁxing

datasets to craft ﬁx templates (traditional) or directly predict

potential patches (learning-based). Large Pre-Trained Language

Models (PLMs), trained using billions of text/code tokens, can

potentially help avoid this issue. Very recently, researchers have

directly leveraged PLMs for APR without relying on any bug-

ﬁxing datasets. Meanwhile, such existing work either failed to

include state-of-the-art PLMs or was not evaluated on realistic

datasets. Thus, the true power of modern PLMs on the important

APR problem is yet to be revealed.

In this work, we perform the ﬁrst extensive study on directly

applying PLMs for APR. We select 9 recent state-of-the-art

PLMs, including both generative and inﬁlling models, ranging

from 125M to 20B in size. We designed 3 different repair settings

to evaluate the different ways we can use PLMs to generate

patches: 1) generate the entire patch function, 2) ﬁll in a chunk

of code given the preﬁx and sufﬁx 3) output a single line ﬁx.

We apply the PLMs under these repair settings on 5 datasets

across 3 different languages and compare different PLMs in the

number of bugs ﬁxed, generation speed and compilation rate.

We also compare the PLMs against recent state-of-the-art APR

tools. Our study demonstrates that directly applying state-of-

the-art PLMs can already substantially outperform all existing

APR techniques on all our datasets. Among the studied PLMs,

the scaling effect exists for APR where larger models tend to

achieve better performance. Also, we show for the ﬁrst time that

sufﬁx code after the buggy line (adopted in inﬁlling-style APR)

is important in not only generating more ﬁxes but more patches

with higher compilation rate. Besides patch generation, the PLMs

consider correct patches to be more natural than other ones,

and can even be leveraged for effective patch ranking or patch

correctness checking. Lastly, we show that PLM-based APR can

be further substantially boosted via: 1) increasing the sample

size, and 2) incorporating ﬁx template information.

I. INTRODUCTION

As software programs and systems become more and more

ubiquitous in everyday life, so do software bugs. Due to

the wide-ranging adoption of software systems in ﬁelds from

healthcare [1] to transportation [2], these bugs can potentially

cause dangerous safety issues [3] and ﬁnancial losses [4]. As

such, developers often need to spend a signiﬁcant amount of

time and effort to ﬁx software bugs [5]. In order to help

developers reduce this manual effort, Automated Program

Repair (APR) tools have been built to automatically generate

potential patches given the original buggy program [6].

Among traditional APR techniques [7]–[16], template-based

APR has been widely recognized as the state of the art [17],

[18]. These techniques leverage ﬁx templates, often designed

by human experts, to ﬁx speciﬁc types of bugs in the source

code. As a result, these APR tools are constrained by the

underlying ﬁx templates in the types of bugs that can be

ﬁxed. To combat this, researchers have proposed learning-

based APR tools [19]–[22], which typically model program

repair as a Neural Machine Translation (NMT) problem [23],

where the goal is to translate a buggy program into a ﬁxed

program. The core component of these learning-based APR

tools is an encoder and decoder pair, where the model aims

to capture the buggy context via the encoder and then autore-

gressively generate the patch using the decoder. As such, these

learning-based APR tools require supervised training datasets

containing pairs of buggy and patched code, usually obtained

by mining historical bug ﬁxes from open-source repositories.

While learning-based APR tools have shown improvements in

both the number and variety of bugs that can be ﬁxed [19],

[20], they are still restricted by their training data which

only contain a limited amount of bug ﬁx types and may not

generalize to unseen bug types [24].

Recent developments in building Large Pre-Trained Lan-

guage Models (PLMs) offer an alternative solution that can

be applied for program repair without relying on historical

bug ﬁxes. While PLMs are usually general-purpose tools for

NLP tasks (e.g., GPT3 [25]), they have also been used for pro-

gramming languages by ﬁnetuning on code (e.g., Codex [26]).

Unlike the speciﬁcally designed learning-based APR models,

PLMs are trained in an unsupervised fashion using up to

billions of text/code tokens and can be used in a variety of

code tasks. Recently, AlphaRepair [24] proposes to leverage

CodeBERT [27], a large code model pre-trained on millions

of code snippets, directly for APR. The key insight from

AlphaRepair is instead of learning transformations to go from

buggy code to ﬁxed code, we can directly use the model

to predict what the correct code should look like given its

surrounding context (including both preﬁx and sufﬁx), i.e.,

inﬁlling-style APR. Using this idea, AlphaRepair demonstrated

state-of-the-art repair results without ﬁnetuning on bug ﬁxing

dataset. While AlphaRepair has shown improvements over

arXiv:2210.14179v2 [cs.SE] 9 Dec 2024

previous learning-based APR, the model (125M parameters)

it uses is far smaller than the current state-of-the-art PLMs

(Codex: 12B parameters and GPT-3: 175B parameters). Beside

AlphaRepair, researchers have also directly leveraged Codex

for generative APR [28], [29], i.e., generating the ﬁxes based

on the context before bugs (i.e., preﬁx only). However, these

studies mostly focus on Codex and are only evaluated on a

small dataset with 40 bugs on simple programming tasks.

Current state-of-the-art PLMs such as Codex [26] and

INCODER [30] have also included evaluation for code related

tasks such as code completion, docstring generation and vari-

able/type prediction. However, these evaluations still mainly

focus on NLP metrics such as BLEU score [31] which do not

accurately measure the functional or semantic correctness of

the generated code. Furthermore, the datasets consist of hand-

curated code problems which do not accurately reﬂect the type

of projects developers work on in the real world.

Our Work. We present the ﬁrst extensive evaluation of recent

PLMs for ﬁxing real-world projects. We designed 3 different

APR experimental settings: 1) complete function generation 2)

correct code inﬁlling and 3) single line generation to showcase

the different ways PLMs can be applied for APR. In our

study, we include both popular types of PLM architectures

(generative and inﬁlling models) to show the advantages and

ﬂaws of using each type for APR. We include models with

a wide range of different parameter sizes, spanning from 125

million to 20 billion. We evaluate not only the improvement in

repair effectiveness but also the trade-off with respect to speed

when increasing the model size. In total, we use 5 different

repair datasets containing real open-source bugs and developer

written tests across 3 programming languages to evaluate APR

under realistic settings. Compared with existing applications of

PLMs for APR [24], [28], [29], our study is the ﬁrst to include

state-of-the-art PLMs for both inﬁlling-style and generative

APR on various datasets and programming languages. To

summarize, this paper makes the following contributions.

⋆Dimension. This paper bridges the gap between the re-

cent advances in PLMs and a crucial software engineering

problem – APR. This paper not only demonstrates the

potential and future for directly leveraging PLMs for solving

the important APR problem, but also provides a realistic

evaluation scenario for the recent PLMs, which were mainly

evaluated on simple/synthetic coding problems rather than

real-world systems as studied in the APR area.

⋆Study. We conduct extensive evaluations using 9 different

recent PLMs on 5 different repair datasets across 3 different

programming languages (Java, Python, and C). We compare

the PLMs against each other using the 3 repair settings

we designed. Using the popular repair datasets, we further

compare the PLMs with state-of-the-art APR tools.

⋆Practical Guidelines. Our study shows for the ﬁrst time

that directly applying state-of-the-art PLMs can already sub-

stantially outperform all existing APR tools on the widely

studied Defects4J 1.2 dataset (and other ones), e.g., Codex

can ﬁx 32 more bugs than the existing best APR technique.

Among the studied PLMs, the scaling effect exists for APR

where larger models tend to deliver stronger APR results.

Also, we show for the ﬁrst time that sufﬁx code after the

buggy line (adopted in inﬁlling-style APR) is important

in not only generating more ﬁxes but more patches with

higher compilation rate. Besides patch generation, the PLMs

consider correct patches to be more natural than other

ones, and can even be used for effective patch ranking or

correctness checking. Lastly, we show that PLM-based APR

can be further substantially improved via: 1) increasing the

sample size, and 2) incorporating ﬁx template information.

II. BACKGROUND AND RELATED WORK

A. Large Pre-Trained Language Model

Large Pre-Trained Language Models (PLMs) have become

ubiquitous in the domain of NLP, achieving impressive per-

formance in many tasks such as machine translation [23], text

summarization [32] and classiﬁcation [33]. PLMs follow the

Transformer architecture [34] – an encoder to capture input

representation and a decoder to generate output tokens. These

PLMs are ﬁrst pre-trained in an unsupervised manner, on

large amounts of text data and then ﬁnetuned for downstream

tasks. However, certain tasks may not have an abundance of

ﬁnetuned data available. As such, researchers have evaluated

the ability for PLMs to perform on downstream tasks without

ﬁnetuning. This is achieved via prompt engineering [35] –

providing the model with natural language descriptions and

demonstrations of the task it is trying to solve before giving the

model the target input. This works by leveraging the general-

purpose setup of PLMs where the unsupervised pretraining

dataset already encompasses many domains of problems/tasks.

Using this idea and the exponential growth in PLM size [36],

impressive performance in many tasks can be achieved even

without any ﬁnetuning [25].

PLMs can be classiﬁed into encoder-only, decoder-only and

encoder-decoder models based on their architectures. Encoder-

only models (such as BERT [37]) contain only the encoder

component of a Transformer. They are typically designed to

learn data representations and are trained using the Masked

Language Modeling (MLM) objective – a small percentage

(e.g., 15%) of tokens in the training data will be replaced by

masked tokens, and then the models are trained to predict the

original values of the masked tokens based on the bidirectional

contexts. Decoder-only models (such as GPT-3 [25] and GPT-

Neo [38]) are large generative models that use the decoder to

predict the next token output given all previous tokens (i.e., left

context or preﬁx only). To combine the usage of both encoder

and decoder, encoder-decoder models (such as T5 [39] and

BART [40]) have also been proposed for sequence-to-sequence

tasks where the training objective aims to recover the correct

output sequence given the original input (e.g., corrupted to

uncorrupted). One such training objective is span prediction

tasks, where random spans (multiple tokens) are replaced with

artiﬁcial span tokens and the model is tasked with recovering

the original tokens. For inferencing, one can use the encoder-

decoder models to inﬁll text by also adding the artiﬁcial

span token in place. Recently, researchers have also combined

MLM with generative models to perform both bidirectional

and autoregressive text generation or inﬁlling [41]. In our APR

scenario, all types of PLMs can potentially be leveraged for

generative or inﬁlling-style APR, and we select 9 state-of-the-

art PLMs for our study (detailed in Section III-A).

B. Automated Program Repair

Automated Program Repair (APR) tools are used to generate

patched code given the original code and the corresponding

buggy location. Each patch generated by the APR tool is

validated against the test suite. Plausible patches are ones

which pass the entire suite. Correct patches are plausible

patches which correctly ﬁx the underlying bug.

Traditional APR tools can be classiﬁed as heuristic-

based [7]–[9], constraint-based [10]–[12] and template-

based [13]–[17]. Traditionally, template-based APR tools

achieve the best performance, where each template is hand-

crafted by human experts designed to provide a ﬁx for a

speciﬁc type of bug. However, these template-based APR tools

can only ﬁx the bug types that are part of the templates. As

a result, researchers employed learning-based APR tools to

generate more expressive patches. Learning-based APR tools

such as Recoder [19], RewardRepair [21], and CURE [20]

are based on NMT techniques [23] which require speciﬁc bug

ﬁxing data to train the NMT model to generate a ﬁx line

given the buggy line. Due to this reliance on the bug-ﬁxing

data, these learning-based tools are still limited in terms of the

type of ﬁxes it can apply. Recent work of AlphaRepair [24]

addresses this by performing APR under a zero-shot setting by

directly using the CodeBERT model for repair. AlphaRepair

ﬁlls the original buggy line with masked tokens and uses

CodeBERT to replace the masked tokens with correct code

tokens to generate repair, i.e., inﬁlling-style (also called cloze-

style) APR. While AlphaRepair is able to achieve state-of-the-

art results, CodeBERT is considerably smaller than the newest

PLMs. Additionally, AlphaRepair is only tested on a single

setting where the correct location of the buggy line is known.

Recent work [28], [29] has also looked into directly apply-

ing PLMs for APR. Prenner et al. [29] conducted a small-

scale evaluation for the Codex model on a simple dataset

containing both Java and Python versions of buggy algorithm

implementations. Codex is given the buggy function and

by using prompt engineering, are then asked to generate

a complete ﬁxed function. The results show that Codex is

competitive with state-of-the-art learning-based APR tools in

Python but worse in Java. In contrast, we show that by using

our practical repair settings, PLMs are able to outperform

state-of-the-art APR tools on both Java and Python. Kolak

et al. [28] also used Codex along with 2 smaller PLMs and

evaluated their ability to generate the correct patch line when

given the code preﬁx on the same dataset as the previous

work [29]. The evaluation demonstrated the scaling effect of

PLMs where the repair results can be improved by using larger

models. Interestingly, the study leverages sum entropy for

patch ranking while AlphaRepair leverages mean entropy (i.e.,

TABLE I: Evaluation PLM overview

Model #Parameters Training Dataset Type

GPT-Neo 125M/1.3B/2.7B The Pile Generative

GPT-J 6.7B The Pile Generative

GPT-NeoX 20B The Pile Generative

Codex 12B N.R. Generative

& Inﬁlling

CodeT5 220M CodeSearchNet

& BigQuery Inﬁlling

INCODER 1.3B/6.7B N.R. Inﬁlling

both favors more natural [42] patches). Thus, we also perform

a study of leveraging various recent PLMs for computing both

entropies for patch ranking on real-world systems.

Overall, the 2 prior studies [28], [29] are done on a small

dataset with synthetic bugs using only a small number of

PLMs. Moreover, the input and repair setting being used in the

studies are also limited, e.g., only considered generative APR.

In this paper, we present an extensive study of applying various

state-of-the-art PLMs for both inﬁlling-style and generative

APR on diverse repair datasets across programming languages.

III. APPROACH

In this section we describe the PLMs selected for evaluation

and introduce 3 different APR generation settings we use to

evaluate each PLM. These settings are designed to showcase

the different practical ways we can directly use PLMs for

APR and highlight advantages and differences of the studied

PLM types. Also, we detail the patch ranking strategy of using

entropy to prioritize patches that are more likely to be correct.

A. Models

We begin by describing the different PLMs we use for

evaluation. Our selection process starts with the list of popular

models hosted on the Hugging Face [43] – an open-source

platform to host and deploy large models. We sort the list

of models based on popularity (#downloads this month) and

select the PLMs which contain code as training data. Fur-

thermore, we also pick models from different organizations

and types (described below) to obtain a diverse set of models.

Along with the open-source models, we also use the closed-

source Codex model [26] (accessible only via API) since it

has shown to achieve impressive performance on code related

tasks. In total, we use 9 different PLMs for our experiment.

Our chosen PLMs range from 125M to 20B in parameter

size. Table I presents the PLM overview. Column Model is

the model name, #Parameters presents the number of model

parameters, Training Dataset indicates the dataset used for

pre-training (N.R. is not released), and Type refers to the type

of APR the model can perform (inﬁlling or generative).

1) Generative Models:

•GPT-Neo [38], GPT-J [44], GPT-NeoX [45] All three

models are open-source implementations of the GPT-3 trans-

former architecture [25]. In our experiments, we use GPT-

Neo models with 125M, 1.3B and 2.7B parameters. GPT-J

and GPT-NeoX are even larger models with 6.7B and 20B

parameters. These models were trained on The Pile [46],

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PracticalProgramRepairintheEraofLargePre-trainedLanguageModelsChunqiuStevenXiaUniversityofIllinoisatUrbana-Champaignchunqiu2@illinois.eduYuxiangWeiUniversityofIllinoisatUrbana-Champaignywei40@illinois.eduLingmingZhangUniversityofIllinoisatUrbana-Champaignlingming@illinois.eduAbstract—AutomatedProgra...

展开>> 收起<<

Practical Program Repair in the Era of Large Pre-trained Language Models Chunqiu Steven Xia.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Practical Program Repair in the Era of Large Pre-trained Language Models Chunqiu Steven Xia

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: