data in the training set, is not helpful in improving the gener-
alization property of DNN models. Therefore, it still remains
an open problem to design data augmentation approaches that
can effectively enhance DNN training for code analysis.
In this paper, for source code classification tasks, we
propose a novel data augmentation framework named MIX-
CODE that aims to enhance the DNN model training process.
Roughly speaking, MIXCODE follows a similar paradigm to
Mixup [20] but adapts the technique in order to handle the
specific data type of source code. Mixup is a well-known
data augmentation technique originally proposed for image
classification, which linearly mixes training data, including
their labels, to increase the data volume. In our case, MIX-
CODE consists of two steps: first, it generates transformed data
by different code refactoring methods, and then, it linearly
mixes the original code and transformed code to generate
new data. Specifically, we study 18 types of existing code
refactoring methods, such as argument renaming and statement
enhancement. More details are in Section III-B.
We conduct experiments to evaluate the effectiveness of
MIXCODE on two programming languages (Java and Python),
two widely-studied code learning tasks (problem classification
and bug detection), and seven model architectures. Based on
that, we answer three research questions as follows:
RQ1: How effective is MIXCODE for enhancing the ac-
curacy and robustness of DNNs? We compare MIXCODE
to the standard training (without data augmentation) and the
existing simple code data augmentation approach, which relies
on transformed or adversarial data only. The results show
that MIXCODE outperforms the baselines by up to 6.24% in
accuracy and 26.06% in robustness. Here, accuracy is the basic
metric that measures the effectiveness of the trained DNN
models. Moreover, robustness [21] reflects the generalization
ability of the trained model to handle unseen data, which is
also an important metric for the deployment of DNN models
in practice [22].
RQ2: How do different Mixup strategies affect the ef-
fectiveness of MIXCODE?We study the effectiveness of
MIXCODE under different settings of Mixup. First, we study
the effect of using different data mixture strategies, which
involve 1) mixing only original code, 2) mixing original code
and transformed code, and 3) mixing only transformed code.
Moreover, we also study the effect of different hyperparame-
ters in MIXCODE. Our evaluation demonstrates that using the
2) strategy, namely, mixing original code and transformed data,
in Mixup can achieve the best performance, and we also make
the suggestion on the use of the most suitable hyperparameters
of MIXCODE.
RQ3: How does the refactoring method affect the effec-
tiveness of MIXCODE?To investigate the impact of the code
refactoring methods on MIXCODE, we evaluate MIXCODE
using different refactoring methods individually. We find that
there is a trade-off between the original test accuracy and
robustness when choosing different refactoring methods, i.e.,
using the refactoring methods that lead to higher accuracy
could harm the model’s robustness.
In summary, the main contributions of this paper are:
•We propose MIXCODE, the first Mixup-based data aug-
mentation framework for source code analysis. Experi-
mental results demonstrate that MIXCODE outperforms
the baseline data augmentation approach by up to 6.24%
in accuracy and 26.06% in robustness. The implementa-
tion of MIXCODE are available online.1
•We empirically demonstrate that simply mixing the origi-
nal code is not the best strategy in MIXCODE. In addition,
MIXCODE using original code and transformed code can
achieve 9.23% performance superiority in accuracy.
•We empirically show that selection of refactoring meth-
ods is also an important factor affecting the performance
of MIXCODE.
II. PRELIMINARIES
We briefly introduce the preliminaries of this work from the
perspectives of DNNs for source code analysis, DNN model
training methods, and Mixup for data augmentation.
A. DNNs for Source Code Analysis
DNNs have been widely used in NLP and achieved great
success. Similar to the natural language text, source code also
consists of discrete symbols that can be processed as sequential
or structural data fed into DNN models. Thus, researchers
have tried to employ DNNs to help programmers process and
understand source code in recent years. The impressive perfor-
mance of DNNs has been demonstrated in multiple important
code-related tasks, such as automated program repair [23]–
[30], automated program synthesis [31], [32], and automated
code comments generation [33].
To unlock the potential of DNNs for code-related tasks,
properly representing snippets of code that are fed into DNNs
is necessary. Code representation, which transfers the raw
code to machine-readable data, plays an important role in
source code analysis. Existing representation techniques can
be roughly divided into two categories, namely, sequence
representation [34] and graph representation [8], [35], [36].
Sequence representation converts the code into a sequence
of tokens. The input features of classical neural networks in
sequence representation learning are typically embedded or
features that live in Euclidean space. In this way, the original
source code is processed to multiple tokens, e.g., from “def
func(a, b)” to “[def, func, (, a, b, ),]”. Se-
quence representation is useful for learning the semantic in-
formation of the source code because it remains the context of
the source code. On the other side, graph representation builds
structural data. In source code, the structure information can be
presented by the abstract syntax tree (AST) and different code
flows (control follow, data flow). By learning these structural
data, the model can perceive functional information of code.
Recently, more researchers have focused on the field that
1https://github.com/zemingd/Mixup4Code
2