MIXCODE Enhancing Code Classiﬁcation by Mixup-Based Data Augmentation Zeming Dongy Qiang Huz Yuejun Guox Maxime Cordyz

2025-05-02 0 0 489.23KB 12 页 10玖币

侵权投诉

MIXCODE: Enhancing Code Classiﬁcation by

Mixup-Based Data Augmentation

Zeming Dong†, Qiang Hu‡∗, Yuejun Guo§, Maxime Cordy‡,

Mike Papadakis‡, Zhenya Zhang†, Yves Le Traon‡, and Jianjun Zhao†

†Kyushu University, Japan, dong.zeming.011@s.kyushu-u.ac.jp, {zhang, zhao}@ait.kyushu-u.ac.jp

‡University of Luxembourg, Luxembourg, {qiang.hu, maxime.cordy, michail.papadakis, Yves.LeTraon}@uni.lu

§Luxembourg Institute of Science and Technology, Luxembourg, yuejun.guo@list.lu

Abstract—Inspired by the great success of Deep Neural Net-

works (DNNs) in natural language processing (NLP), DNNs have

been increasingly applied in source code analysis and attracted

signiﬁcant attention from the software engineering community.

Due to its data-driven nature, a DNN model requires massive

and high-quality labeled training data to achieve expert-level

performance. Collecting such data is often not hard, but the

labeling process is notoriously laborious. The task of DNN-based

code analysis even worsens the situation because source code

labeling also demands sophisticated expertise. Data augmentation

has been a popular approach to supplement training data in

domains such as computer vision and NLP. However, existing

data augmentation approaches in code analysis adopt simple

methods, such as data transformation and adversarial example

generation, thus bringing limited performance superiority. In this

paper, we propose a data augmentation approach MIXCODE that

aims to effectively supplement valid training data, inspired by

the recent advance named Mixup in computer vision. Speciﬁcally,

we ﬁrst utilize multiple code refactoring methods to generate

transformed code that holds consistent labels with the original

data. Then, we adapt the Mixup technique to mix the original

code with the transformed code to augment the training data. We

evaluate MIXCODE on two programming languages (Java and

Python), two code tasks (problem classiﬁcation and bug detec-

tion), four benchmark datasets (JAVA250,Python800,CodRep1,

and Refactory), and seven model architectures (including two pre-

trained models CodeBERT and GraphCodeBERT). Experimental

results demonstrate that MIXCODE outperforms the baseline

data augmentation approach by up to 6.24% in accuracy and

26.06% in robustness.

Index Terms—Data augmentation, Mixup, Source code analysis

I. INTRODUCTION

Due to its remarkable performance, deep learning (DL)

has gained widespread adoption in different application do-

mains, such as face recognition [1], language translation [2],

video games [3], and autonomous driving [3]. More recently,

researchers from the software engineering community have

attempted to use DL techniques to automate multiple down-

stream code tasks, e.g., code search [4], problem classiﬁca-

tion [5], and bug detection [6]. Relevant studies [7], [8] reveal

that DL beneﬁts source code analysis.

As the key pillar of DL systems, deep neural networks

(DNNs) automatically gain knowledge from training data and

make inferences for unseen data after deployment. Generally,

*Qiang Hu is the corresponding author.

two important factors could affect the performance of the

trained DNNs, namely, the model architecture and the training

data. In the context of code analysis, for the former factor, a

common practice of building proper model architectures of

DNNs is to directly apply natural language processing (NLP)

models to source code. For example, Feng et al. [7] have

modiﬁed BERT to create CodeBERT that solves downstream

tasks effectively. For the latter factor, though adequate labeled

training data are necessary for the training process, producing

high-quality labeled source code data is not yet sufﬁciently

investigated. The main challenge is that data labeling requires

not only extensive human efforts but also sophisticated domain

knowledge. According to to [9], labeling code from only four

libraries can take up to 600 man-hours. In a nutshell, data

preparation is indispensable but challenging for developing

desirable models, and therefore in this paper, we take a speciﬁc

focus on this important issue.

Data augmentation is a technique that tackles the aforemen-

tioned data labeling issue, which produces additional training

data by modifying existing data rather than human efforts.

Generally, the new data sample is semantically consistent with

the source data, i.e., they share the same functionalities and

labels. In this way, the model can learn more information and

gain better generalization, compared to the approach that relies

only on the original training data. In computer vision and

NLP tasks, data augmentation has been widely used and well-

studied [10]–[12] for model training. For example, in computer

vision tasks, many image transformation methods (e.g., image

rotation, shear) are designed to mimic different real-world

situations that the model could face after deployment. In

traditional NLP tasks, the typical augmentation is to perform

synonym substitution, which is also beneﬁcial to cover more

context that might occur in the real world.

Although data augmentation has proved to be effective in

ﬁelds such as CV and NLP, the investigation of its application

in code analysis still remains at an early stage. Researchers

have borrowed ideas from other ﬁelds and proposed several

data augmentation approaches for code analysis [13]–[17];

usually, these techniques generate more transformed or ad-

versarial data simply via methods such as code refactoring.

However, existing studies [18] already show that these simple

strategies have limited effects. For example, Bielik et al. [19]

show that adversarial training, by simply adding adversarial

arXiv:2210.03003v2 [cs.SE] 10 Jan 2023

data in the training set, is not helpful in improving the gener-

alization property of DNN models. Therefore, it still remains

an open problem to design data augmentation approaches that

can effectively enhance DNN training for code analysis.

In this paper, for source code classiﬁcation tasks, we

propose a novel data augmentation framework named MIX-

CODE that aims to enhance the DNN model training process.

Roughly speaking, MIXCODE follows a similar paradigm to

Mixup [20] but adapts the technique in order to handle the

speciﬁc data type of source code. Mixup is a well-known

data augmentation technique originally proposed for image

classiﬁcation, which linearly mixes training data, including

their labels, to increase the data volume. In our case, MIX-

CODE consists of two steps: ﬁrst, it generates transformed data

by different code refactoring methods, and then, it linearly

mixes the original code and transformed code to generate

new data. Speciﬁcally, we study 18 types of existing code

refactoring methods, such as argument renaming and statement

enhancement. More details are in Section III-B.

We conduct experiments to evaluate the effectiveness of

MIXCODE on two programming languages (Java and Python),

two widely-studied code learning tasks (problem classiﬁcation

and bug detection), and seven model architectures. Based on

that, we answer three research questions as follows:

RQ1: How effective is MIXCODE for enhancing the ac-

curacy and robustness of DNNs? We compare MIXCODE

to the standard training (without data augmentation) and the

existing simple code data augmentation approach, which relies

on transformed or adversarial data only. The results show

that MIXCODE outperforms the baselines by up to 6.24% in

accuracy and 26.06% in robustness. Here, accuracy is the basic

metric that measures the effectiveness of the trained DNN

models. Moreover, robustness [21] reﬂects the generalization

ability of the trained model to handle unseen data, which is

also an important metric for the deployment of DNN models

in practice [22].

RQ2: How do different Mixup strategies affect the ef-

fectiveness of MIXCODE?We study the effectiveness of

MIXCODE under different settings of Mixup. First, we study

the effect of using different data mixture strategies, which

involve 1) mixing only original code, 2) mixing original code

and transformed code, and 3) mixing only transformed code.

Moreover, we also study the effect of different hyperparame-

ters in MIXCODE. Our evaluation demonstrates that using the

2) strategy, namely, mixing original code and transformed data,

in Mixup can achieve the best performance, and we also make

the suggestion on the use of the most suitable hyperparameters

of MIXCODE.

RQ3: How does the refactoring method affect the effec-

tiveness of MIXCODE?To investigate the impact of the code

refactoring methods on MIXCODE, we evaluate MIXCODE

using different refactoring methods individually. We ﬁnd that

there is a trade-off between the original test accuracy and

robustness when choosing different refactoring methods, i.e.,

using the refactoring methods that lead to higher accuracy

could harm the model’s robustness.

In summary, the main contributions of this paper are:

•We propose MIXCODE, the ﬁrst Mixup-based data aug-

mentation framework for source code analysis. Experi-

mental results demonstrate that MIXCODE outperforms

the baseline data augmentation approach by up to 6.24%

in accuracy and 26.06% in robustness. The implementa-

tion of MIXCODE are available online.1

•We empirically demonstrate that simply mixing the origi-

nal code is not the best strategy in MIXCODE. In addition,

MIXCODE using original code and transformed code can

achieve 9.23% performance superiority in accuracy.

•We empirically show that selection of refactoring meth-

ods is also an important factor affecting the performance

of MIXCODE.

II. PRELIMINARIES

We brieﬂy introduce the preliminaries of this work from the

perspectives of DNNs for source code analysis, DNN model

training methods, and Mixup for data augmentation.

A. DNNs for Source Code Analysis

DNNs have been widely used in NLP and achieved great

success. Similar to the natural language text, source code also

consists of discrete symbols that can be processed as sequential

or structural data fed into DNN models. Thus, researchers

have tried to employ DNNs to help programmers process and

understand source code in recent years. The impressive perfor-

mance of DNNs has been demonstrated in multiple important

code-related tasks, such as automated program repair [23]–

[30], automated program synthesis [31], [32], and automated

code comments generation [33].

To unlock the potential of DNNs for code-related tasks,

properly representing snippets of code that are fed into DNNs

is necessary. Code representation, which transfers the raw

code to machine-readable data, plays an important role in

source code analysis. Existing representation techniques can

be roughly divided into two categories, namely, sequence

representation [34] and graph representation [8], [35], [36].

Sequence representation converts the code into a sequence

of tokens. The input features of classical neural networks in

sequence representation learning are typically embedded or

features that live in Euclidean space. In this way, the original

source code is processed to multiple tokens, e.g., from “def

func(a, b)” to “[def, func, (, a, b, ),]”. Se-

quence representation is useful for learning the semantic in-

formation of the source code because it remains the context of

the source code. On the other side, graph representation builds

structural data. In source code, the structure information can be

presented by the abstract syntax tree (AST) and different code

ﬂows (control follow, data ﬂow). By learning these structural

data, the model can perceive functional information of code.

Recently, more researchers have focused on the ﬁeld that

1https://github.com/zemingd/Mixup4Code

applies graph representation to source code analysis based on

different variants of graph neural networks (GNNs). In our

study, we consider both categories of code representation to

evaluate MIXCODE.

B. DNN Model Training Methods

DNN training consists in, given a set of training data,

searching for the best parameters (e.g., weights, biases) that

enable the model to ﬁt the training data. Here, we introduce

the standard training process and the basic data augmentation

framework for the source code model. Algorithm 1 presents

the pseudocode of these two training strategies. In the standard

manner of training, all the training data are fed into several

epochs of training (See Lines 1-3 in Algorithm 1).

Algorithm 1 Existing model training strategies

Require: M: initialized DNN model

Require: X, Y : original training data and labels

Require: R={r}:a set of data transformation methods

Ensure: M: trained model

Standard training (without augmentation)

1: for run ∈ {0,...,#epochs}do

2: M.Fit (X, Y )

3: return M

Basic augmentation

4: for run ∈ {0,...,#epochs}do

5: Xref ←φ

6: for x∈Xdo

7: r←RandomSelection (R)

8: Xref ←Xref ∪r(x)

9: M.Fit (Xref , Y )

10: return M

However, since the prepared training data can only rep-

resent a limited part of data distribution, the training data

volume has been a bottleneck that prevents DNN models

from achieving high performance [37]. Data augmentation is

proposed to automatically increase the volume of the training

set, and thus enhance the quality of training. The basic idea of

data augmentation is to generate new data from the existing

training data by well-designed data transformation methods.

Generally, such data transformation methods modify the data

and do not change their semantic information; for example,

in image data processing, commonly-used methods include

random rotating, padding, and adding brightness [10]. Line 4-

10 in Algorithm 1 shows the process of training with data

augmentation. Speciﬁcally, in each epoch, the DNN is trained

by using a transformed version of the data generated by

randomly selected data transformation methods.

C. Mixup: A Data Augmentation Approach in Image Classiﬁ-

cation Tasks

Mixup [20] is an effective data augmentation technique

proposed for image classiﬁcation tasks. Mixup contains two

steps: ﬁrst, it randomly selects two data samples from the

training data; then, it mixes both the data features and the

labels of the selected data to generate a new sample as the

training data. In addition to image classiﬁcation, recently,

researchers have achieved great success in applying Mixup

to text classiﬁcation [38].

Technically, Mixup is shown as follows: given a pair of sam-

ples (xi,yi) and (xj,yj), where xrepresents the input feature

and its corresponding output label yis donated with one-hot

encoding, Mixup produces new data pairs (xij

mix, yij

mix):

xij

mix =λxi+ (1 −λ)xj

yij

mix =λyi+ (1 −λ)yj(1)

where λis a mixing policy for the input sample pair, which

is sampled from a Beta distribution with a shape parameter

α(λ∼Beta(α,α)). Figure 1 depicts an example of an image

generated by Mixup. By mixing two images into one, a model

can gain knowledge from both sides.

0 5 10 15 20 25

(a) original image 1

0 5 10 15 20 25

(b) original image 2

0 5 10 15 20 25

Fig. 1. An example of Mixup for image data. The mixed image is calculated

using Eq. (1). λ= 0.2.

III. MIXCODE—THE PROPOSED APPROACH

A. Methodology of MIXCODE

Inspired by the great success of Mixup and its variants in

image classiﬁcation tasks, we propose MIXCODE, a simple

yet effective data augmentation framework for source code

classiﬁcation tasks. Algorithm 2 presents the whole process

of MIXCODE. Essentially, Algorithm 2 is different from the

existing approaches in Algorithm 1 in the way it augments

the training data in each training epoch. Figure 2 presents an

Original

Code(O)

Transformed

Code(T)

Refactoring

Mix-up

O + T

T + T

O + O Training

Data

Train

DNN

Original

Code

Transformed

Code

Mix-up

Training

Data

Train

DNN

Representation

Refactoring

Representation

Fig. 2. Workﬂow of MIXCODE within one training epoch.

overview of MIXCODE in one epoch. Concretely, this process

consists of the following three phases:

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MIXCODE:EnhancingCodeClassicationbyMixup-BasedDataAugmentationZemingDongy,QiangHuz,YuejunGuox,MaximeCordyz,MikePapadakisz,ZhenyaZhangy,YvesLeTraonz,andJianjunZhaoyyKyushuUniversity,Japan,dong.zeming.011@s.kyushu-u.ac.jp,fzhang,zhaog@ait.kyushu-u.ac.jpzUniversityofLuxembourg,Luxembourg,fqiang.hu,ma...

展开>> 收起<<

MIXCODE Enhancing Code Classiﬁcation by Mixup-Based Data Augmentation Zeming Dongy Qiang Huz Yuejun Guox Maxime Cordyz.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MIXCODE Enhancing Code Classiﬁcation by Mixup-Based Data Augmentation Zeming Dongy Qiang Huz Yuejun Guox Maxime Cordyz

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: