MIXCODE Enhancing Code Classification by Mixup-Based Data Augmentation Zeming Dongy Qiang Huz Yuejun Guox Maxime Cordyz

2025-05-02 0 0 489.23KB 12 页 10玖币
侵权投诉
MIXCODE: Enhancing Code Classification by
Mixup-Based Data Augmentation
Zeming Dong, Qiang Hu, Yuejun Guo§, Maxime Cordy,
Mike Papadakis, Zhenya Zhang, Yves Le Traon, and Jianjun Zhao
Kyushu University, Japan, dong.zeming.011@s.kyushu-u.ac.jp, {zhang, zhao}@ait.kyushu-u.ac.jp
University of Luxembourg, Luxembourg, {qiang.hu, maxime.cordy, michail.papadakis, Yves.LeTraon}@uni.lu
§Luxembourg Institute of Science and Technology, Luxembourg, yuejun.guo@list.lu
Abstract—Inspired by the great success of Deep Neural Net-
works (DNNs) in natural language processing (NLP), DNNs have
been increasingly applied in source code analysis and attracted
significant attention from the software engineering community.
Due to its data-driven nature, a DNN model requires massive
and high-quality labeled training data to achieve expert-level
performance. Collecting such data is often not hard, but the
labeling process is notoriously laborious. The task of DNN-based
code analysis even worsens the situation because source code
labeling also demands sophisticated expertise. Data augmentation
has been a popular approach to supplement training data in
domains such as computer vision and NLP. However, existing
data augmentation approaches in code analysis adopt simple
methods, such as data transformation and adversarial example
generation, thus bringing limited performance superiority. In this
paper, we propose a data augmentation approach MIXCODE that
aims to effectively supplement valid training data, inspired by
the recent advance named Mixup in computer vision. Specifically,
we first utilize multiple code refactoring methods to generate
transformed code that holds consistent labels with the original
data. Then, we adapt the Mixup technique to mix the original
code with the transformed code to augment the training data. We
evaluate MIXCODE on two programming languages (Java and
Python), two code tasks (problem classification and bug detec-
tion), four benchmark datasets (JAVA250,Python800,CodRep1,
and Refactory), and seven model architectures (including two pre-
trained models CodeBERT and GraphCodeBERT). Experimental
results demonstrate that MIXCODE outperforms the baseline
data augmentation approach by up to 6.24% in accuracy and
26.06% in robustness.
Index Terms—Data augmentation, Mixup, Source code analysis
I. INTRODUCTION
Due to its remarkable performance, deep learning (DL)
has gained widespread adoption in different application do-
mains, such as face recognition [1], language translation [2],
video games [3], and autonomous driving [3]. More recently,
researchers from the software engineering community have
attempted to use DL techniques to automate multiple down-
stream code tasks, e.g., code search [4], problem classifica-
tion [5], and bug detection [6]. Relevant studies [7], [8] reveal
that DL benefits source code analysis.
As the key pillar of DL systems, deep neural networks
(DNNs) automatically gain knowledge from training data and
make inferences for unseen data after deployment. Generally,
*Qiang Hu is the corresponding author.
two important factors could affect the performance of the
trained DNNs, namely, the model architecture and the training
data. In the context of code analysis, for the former factor, a
common practice of building proper model architectures of
DNNs is to directly apply natural language processing (NLP)
models to source code. For example, Feng et al. [7] have
modified BERT to create CodeBERT that solves downstream
tasks effectively. For the latter factor, though adequate labeled
training data are necessary for the training process, producing
high-quality labeled source code data is not yet sufficiently
investigated. The main challenge is that data labeling requires
not only extensive human efforts but also sophisticated domain
knowledge. According to to [9], labeling code from only four
libraries can take up to 600 man-hours. In a nutshell, data
preparation is indispensable but challenging for developing
desirable models, and therefore in this paper, we take a specific
focus on this important issue.
Data augmentation is a technique that tackles the aforemen-
tioned data labeling issue, which produces additional training
data by modifying existing data rather than human efforts.
Generally, the new data sample is semantically consistent with
the source data, i.e., they share the same functionalities and
labels. In this way, the model can learn more information and
gain better generalization, compared to the approach that relies
only on the original training data. In computer vision and
NLP tasks, data augmentation has been widely used and well-
studied [10]–[12] for model training. For example, in computer
vision tasks, many image transformation methods (e.g., image
rotation, shear) are designed to mimic different real-world
situations that the model could face after deployment. In
traditional NLP tasks, the typical augmentation is to perform
synonym substitution, which is also beneficial to cover more
context that might occur in the real world.
Although data augmentation has proved to be effective in
fields such as CV and NLP, the investigation of its application
in code analysis still remains at an early stage. Researchers
have borrowed ideas from other fields and proposed several
data augmentation approaches for code analysis [13]–[17];
usually, these techniques generate more transformed or ad-
versarial data simply via methods such as code refactoring.
However, existing studies [18] already show that these simple
strategies have limited effects. For example, Bielik et al. [19]
show that adversarial training, by simply adding adversarial
1
arXiv:2210.03003v2 [cs.SE] 10 Jan 2023
data in the training set, is not helpful in improving the gener-
alization property of DNN models. Therefore, it still remains
an open problem to design data augmentation approaches that
can effectively enhance DNN training for code analysis.
In this paper, for source code classification tasks, we
propose a novel data augmentation framework named MIX-
CODE that aims to enhance the DNN model training process.
Roughly speaking, MIXCODE follows a similar paradigm to
Mixup [20] but adapts the technique in order to handle the
specific data type of source code. Mixup is a well-known
data augmentation technique originally proposed for image
classification, which linearly mixes training data, including
their labels, to increase the data volume. In our case, MIX-
CODE consists of two steps: first, it generates transformed data
by different code refactoring methods, and then, it linearly
mixes the original code and transformed code to generate
new data. Specifically, we study 18 types of existing code
refactoring methods, such as argument renaming and statement
enhancement. More details are in Section III-B.
We conduct experiments to evaluate the effectiveness of
MIXCODE on two programming languages (Java and Python),
two widely-studied code learning tasks (problem classification
and bug detection), and seven model architectures. Based on
that, we answer three research questions as follows:
RQ1: How effective is MIXCODE for enhancing the ac-
curacy and robustness of DNNs? We compare MIXCODE
to the standard training (without data augmentation) and the
existing simple code data augmentation approach, which relies
on transformed or adversarial data only. The results show
that MIXCODE outperforms the baselines by up to 6.24% in
accuracy and 26.06% in robustness. Here, accuracy is the basic
metric that measures the effectiveness of the trained DNN
models. Moreover, robustness [21] reflects the generalization
ability of the trained model to handle unseen data, which is
also an important metric for the deployment of DNN models
in practice [22].
RQ2: How do different Mixup strategies affect the ef-
fectiveness of MIXCODE?We study the effectiveness of
MIXCODE under different settings of Mixup. First, we study
the effect of using different data mixture strategies, which
involve 1) mixing only original code, 2) mixing original code
and transformed code, and 3) mixing only transformed code.
Moreover, we also study the effect of different hyperparame-
ters in MIXCODE. Our evaluation demonstrates that using the
2) strategy, namely, mixing original code and transformed data,
in Mixup can achieve the best performance, and we also make
the suggestion on the use of the most suitable hyperparameters
of MIXCODE.
RQ3: How does the refactoring method affect the effec-
tiveness of MIXCODE?To investigate the impact of the code
refactoring methods on MIXCODE, we evaluate MIXCODE
using different refactoring methods individually. We find that
there is a trade-off between the original test accuracy and
robustness when choosing different refactoring methods, i.e.,
using the refactoring methods that lead to higher accuracy
could harm the model’s robustness.
In summary, the main contributions of this paper are:
We propose MIXCODE, the first Mixup-based data aug-
mentation framework for source code analysis. Experi-
mental results demonstrate that MIXCODE outperforms
the baseline data augmentation approach by up to 6.24%
in accuracy and 26.06% in robustness. The implementa-
tion of MIXCODE are available online.1
We empirically demonstrate that simply mixing the origi-
nal code is not the best strategy in MIXCODE. In addition,
MIXCODE using original code and transformed code can
achieve 9.23% performance superiority in accuracy.
We empirically show that selection of refactoring meth-
ods is also an important factor affecting the performance
of MIXCODE.
II. PRELIMINARIES
We briefly introduce the preliminaries of this work from the
perspectives of DNNs for source code analysis, DNN model
training methods, and Mixup for data augmentation.
A. DNNs for Source Code Analysis
DNNs have been widely used in NLP and achieved great
success. Similar to the natural language text, source code also
consists of discrete symbols that can be processed as sequential
or structural data fed into DNN models. Thus, researchers
have tried to employ DNNs to help programmers process and
understand source code in recent years. The impressive perfor-
mance of DNNs has been demonstrated in multiple important
code-related tasks, such as automated program repair [23]–
[30], automated program synthesis [31], [32], and automated
code comments generation [33].
To unlock the potential of DNNs for code-related tasks,
properly representing snippets of code that are fed into DNNs
is necessary. Code representation, which transfers the raw
code to machine-readable data, plays an important role in
source code analysis. Existing representation techniques can
be roughly divided into two categories, namely, sequence
representation [34] and graph representation [8], [35], [36].
Sequence representation converts the code into a sequence
of tokens. The input features of classical neural networks in
sequence representation learning are typically embedded or
features that live in Euclidean space. In this way, the original
source code is processed to multiple tokens, e.g., from “def
func(a, b)” to “[def, func, (, a, b, ),]”. Se-
quence representation is useful for learning the semantic in-
formation of the source code because it remains the context of
the source code. On the other side, graph representation builds
structural data. In source code, the structure information can be
presented by the abstract syntax tree (AST) and different code
flows (control follow, data flow). By learning these structural
data, the model can perceive functional information of code.
Recently, more researchers have focused on the field that
1https://github.com/zemingd/Mixup4Code
2
applies graph representation to source code analysis based on
different variants of graph neural networks (GNNs). In our
study, we consider both categories of code representation to
evaluate MIXCODE.
B. DNN Model Training Methods
DNN training consists in, given a set of training data,
searching for the best parameters (e.g., weights, biases) that
enable the model to fit the training data. Here, we introduce
the standard training process and the basic data augmentation
framework for the source code model. Algorithm 1 presents
the pseudocode of these two training strategies. In the standard
manner of training, all the training data are fed into several
epochs of training (See Lines 1-3 in Algorithm 1).
Algorithm 1 Existing model training strategies
Require: M: initialized DNN model
Require: X, Y : original training data and labels
Require: R={r}:a set of data transformation methods
Ensure: M: trained model
Standard training (without augmentation)
1: for run ∈ {0,...,#epochs}do
2: M.Fit (X, Y )
3: return M
Basic augmentation
4: for run ∈ {0,...,#epochs}do
5: Xref φ
6: for xXdo
7: rRandomSelection (R)
8: Xref Xref r(x)
9: M.Fit (Xref , Y )
10: return M
However, since the prepared training data can only rep-
resent a limited part of data distribution, the training data
volume has been a bottleneck that prevents DNN models
from achieving high performance [37]. Data augmentation is
proposed to automatically increase the volume of the training
set, and thus enhance the quality of training. The basic idea of
data augmentation is to generate new data from the existing
training data by well-designed data transformation methods.
Generally, such data transformation methods modify the data
and do not change their semantic information; for example,
in image data processing, commonly-used methods include
random rotating, padding, and adding brightness [10]. Line 4-
10 in Algorithm 1 shows the process of training with data
augmentation. Specifically, in each epoch, the DNN is trained
by using a transformed version of the data generated by
randomly selected data transformation methods.
C. Mixup: A Data Augmentation Approach in Image Classifi-
cation Tasks
Mixup [20] is an effective data augmentation technique
proposed for image classification tasks. Mixup contains two
steps: first, it randomly selects two data samples from the
training data; then, it mixes both the data features and the
labels of the selected data to generate a new sample as the
training data. In addition to image classification, recently,
researchers have achieved great success in applying Mixup
to text classification [38].
Technically, Mixup is shown as follows: given a pair of sam-
ples (xi,yi) and (xj,yj), where xrepresents the input feature
and its corresponding output label yis donated with one-hot
encoding, Mixup produces new data pairs (xij
mix, yij
mix):
xij
mix =λxi+ (1 λ)xj
yij
mix =λyi+ (1 λ)yj(1)
where λis a mixing policy for the input sample pair, which
is sampled from a Beta distribution with a shape parameter
α(λBeta(α,α)). Figure 1 depicts an example of an image
generated by Mixup. By mixing two images into one, a model
can gain knowledge from both sides.
0 5 10 15 20 25
0
5
10
15
20
25
(a) original image 1
0 5 10 15 20 25
0
5
10
15
20
25
(b) original image 2
0 5 10 15 20 25
0
5
10
15
20
25
(c) mixed image
Fig. 1. An example of Mixup for image data. The mixed image is calculated
using Eq. (1). λ= 0.2.
III. MIXCODETHE PROPOSED APPROACH
A. Methodology of MIXCODE
Inspired by the great success of Mixup and its variants in
image classification tasks, we propose MIXCODE, a simple
yet effective data augmentation framework for source code
classification tasks. Algorithm 2 presents the whole process
of MIXCODE. Essentially, Algorithm 2 is different from the
existing approaches in Algorithm 1 in the way it augments
the training data in each training epoch. Figure 2 presents an
Original
Code(O)
Transformed
Code(T)
Refactoring
Mix-up
O + T
T + T
O + O Training
Data
Train
DNN
Original
Code
Transformed
Code
Mix-up
Training
Data
Train
DNN
Representation
Representation
Refactoring
Refactoring
Representation
Representation
Fig. 2. Workflow of MIXCODE within one training epoch.
overview of MIXCODE in one epoch. Concretely, this process
consists of the following three phases:
3
摘要:

MIXCODE:EnhancingCodeClassicationbyMixup-BasedDataAugmentationZemingDongy,QiangHuz,YuejunGuox,MaximeCordyz,MikePapadakisz,ZhenyaZhangy,YvesLeTraonz,andJianjunZhaoyyKyushuUniversity,Japan,dong.zeming.011@s.kyushu-u.ac.jp,fzhang,zhaog@ait.kyushu-u.ac.jpzUniversityofLuxembourg,Luxembourg,fqiang.hu,ma...

展开>> 收起<<
MIXCODE Enhancing Code Classification by Mixup-Based Data Augmentation Zeming Dongy Qiang Huz Yuejun Guox Maxime Cordyz.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:489.23KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注