previous learning-based APR, the model (125M parameters)
it uses is far smaller than the current state-of-the-art PLMs
(Codex: 12B parameters and GPT-3: 175B parameters). Beside
AlphaRepair, researchers have also directly leveraged Codex
for generative APR [28], [29], i.e., generating the fixes based
on the context before bugs (i.e., prefix only). However, these
studies mostly focus on Codex and are only evaluated on a
small dataset with 40 bugs on simple programming tasks.
Current state-of-the-art PLMs such as Codex [26] and
INCODER [30] have also included evaluation for code related
tasks such as code completion, docstring generation and vari-
able/type prediction. However, these evaluations still mainly
focus on NLP metrics such as BLEU score [31] which do not
accurately measure the functional or semantic correctness of
the generated code. Furthermore, the datasets consist of hand-
curated code problems which do not accurately reflect the type
of projects developers work on in the real world.
Our Work. We present the first extensive evaluation of recent
PLMs for fixing real-world projects. We designed 3 different
APR experimental settings: 1) complete function generation 2)
correct code infilling and 3) single line generation to showcase
the different ways PLMs can be applied for APR. In our
study, we include both popular types of PLM architectures
(generative and infilling models) to show the advantages and
flaws of using each type for APR. We include models with
a wide range of different parameter sizes, spanning from 125
million to 20 billion. We evaluate not only the improvement in
repair effectiveness but also the trade-off with respect to speed
when increasing the model size. In total, we use 5 different
repair datasets containing real open-source bugs and developer
written tests across 3 programming languages to evaluate APR
under realistic settings. Compared with existing applications of
PLMs for APR [24], [28], [29], our study is the first to include
state-of-the-art PLMs for both infilling-style and generative
APR on various datasets and programming languages. To
summarize, this paper makes the following contributions.
⋆Dimension. This paper bridges the gap between the re-
cent advances in PLMs and a crucial software engineering
problem – APR. This paper not only demonstrates the
potential and future for directly leveraging PLMs for solving
the important APR problem, but also provides a realistic
evaluation scenario for the recent PLMs, which were mainly
evaluated on simple/synthetic coding problems rather than
real-world systems as studied in the APR area.
⋆Study. We conduct extensive evaluations using 9 different
recent PLMs on 5 different repair datasets across 3 different
programming languages (Java, Python, and C). We compare
the PLMs against each other using the 3 repair settings
we designed. Using the popular repair datasets, we further
compare the PLMs with state-of-the-art APR tools.
⋆Practical Guidelines. Our study shows for the first time
that directly applying state-of-the-art PLMs can already sub-
stantially outperform all existing APR tools on the widely
studied Defects4J 1.2 dataset (and other ones), e.g., Codex
can fix 32 more bugs than the existing best APR technique.
Among the studied PLMs, the scaling effect exists for APR
where larger models tend to deliver stronger APR results.
Also, we show for the first time that suffix code after the
buggy line (adopted in infilling-style APR) is important
in not only generating more fixes but more patches with
higher compilation rate. Besides patch generation, the PLMs
consider correct patches to be more natural than other
ones, and can even be used for effective patch ranking or
correctness checking. Lastly, we show that PLM-based APR
can be further substantially improved via: 1) increasing the
sample size, and 2) incorporating fix template information.
II. BACKGROUND AND RELATED WORK
A. Large Pre-Trained Language Model
Large Pre-Trained Language Models (PLMs) have become
ubiquitous in the domain of NLP, achieving impressive per-
formance in many tasks such as machine translation [23], text
summarization [32] and classification [33]. PLMs follow the
Transformer architecture [34] – an encoder to capture input
representation and a decoder to generate output tokens. These
PLMs are first pre-trained in an unsupervised manner, on
large amounts of text data and then finetuned for downstream
tasks. However, certain tasks may not have an abundance of
finetuned data available. As such, researchers have evaluated
the ability for PLMs to perform on downstream tasks without
finetuning. This is achieved via prompt engineering [35] –
providing the model with natural language descriptions and
demonstrations of the task it is trying to solve before giving the
model the target input. This works by leveraging the general-
purpose setup of PLMs where the unsupervised pretraining
dataset already encompasses many domains of problems/tasks.
Using this idea and the exponential growth in PLM size [36],
impressive performance in many tasks can be achieved even
without any finetuning [25].
PLMs can be classified into encoder-only, decoder-only and
encoder-decoder models based on their architectures. Encoder-
only models (such as BERT [37]) contain only the encoder
component of a Transformer. They are typically designed to
learn data representations and are trained using the Masked
Language Modeling (MLM) objective – a small percentage
(e.g., 15%) of tokens in the training data will be replaced by
masked tokens, and then the models are trained to predict the
original values of the masked tokens based on the bidirectional
contexts. Decoder-only models (such as GPT-3 [25] and GPT-
Neo [38]) are large generative models that use the decoder to
predict the next token output given all previous tokens (i.e., left
context or prefix only). To combine the usage of both encoder
and decoder, encoder-decoder models (such as T5 [39] and
BART [40]) have also been proposed for sequence-to-sequence
tasks where the training objective aims to recover the correct
output sequence given the original input (e.g., corrupted to
uncorrupted). One such training objective is span prediction
tasks, where random spans (multiple tokens) are replaced with
artificial span tokens and the model is tasked with recovering
the original tokens. For inferencing, one can use the encoder-
decoder models to infill text by also adding the artificial