Exploring Representation-Level Augmentation for Code Search Haochen Li1Chunyan Miao12Cyril Leung12Yanxian Huang3 Yuan Huang3Hongyu Zhang4Yanlin Wang3

2025-05-06 0 0 415.35KB 13 页 10玖币
侵权投诉
Exploring Representation-Level Augmentation for Code Search
Haochen Li1Chunyan Miao1,2Cyril Leung1,2Yanxian Huang3
Yuan Huang3Hongyu Zhang4Yanlin Wang3
1School of Computer Science and Engineering, Nanyang Technological University, Singapore
2China-Singapore International Joint Research Institute (CSIJRI), China
3School of Software Engineering, Sun Yat-sen University, China
4The University of Newcastle, Australia
{haochen003,ascymiao,cleung}@ntu.edu.sg, huangyx353@mail2.sysu.edu.cn
{huangyuan5,wangylin36}@mail.sysu.edu.cn, hongyu.zhang@newcastle.edu.au
Abstract
Code search, which aims at retrieving the most
relevant code fragment for a given natural lan-
guage query, is a common activity in soft-
ware development practice. Recently, con-
trastive learning is widely used in code search
research, where many data augmentation ap-
proaches for source code (e.g., semantic-
preserving program transformation) are pro-
posed to learn better representations. However,
these augmentations are at the raw-data level,
which requires additional code analysis in the
preprocessing stage and additional training
costs in the training stage. In this paper, we ex-
plore augmentation methods that augment data
(both code and query) at representation level
which does not require additional data process-
ing and training, and based on this we pro-
pose a general format of representation-level
augmentation that unifies existing methods.
Then, we propose three new augmentation
methods (linear extrapolation, binary interpo-
lation, and Gaussian scaling) based on the gen-
eral format. Furthermore, we theoretically
analyze the advantages of the proposed aug-
mentation methods over traditional contrastive
learning methods on code search. We experi-
mentally evaluate the proposed representation-
level augmentation methods with state-of-the-
art code search models on a large-scale pub-
lic dataset consisting of six programming lan-
guages. The experimental results show that
our approach can consistently boost the perfor-
mance of the studied code search models. Our
source code is available at https://github.
com/Alex-HaochenLi/RACS.
1 Introduction
In software development, developers often search
and reuse commonly used functionalities to im-
prove their productivity (Nie et al.,2016;Shuai
et al.,2020). With the growing size of large-scale
codebases such as GitHub, retrieving semantically
Corresponding author.
relevant code fragments accurately becomes in-
creasingly important in this field (Allamanis et al.,
2018;Liu et al.,2021).
Traditional approaches (Nie et al.,2016;Yang
and Huang,2017;Rosario,2000;Hill et al.,2011;
Satter and Sakib,2016;Lv et al.,2015;Van Nguyen
et al.,2017) leverage information retrieval tech-
niques to treat code snippets as natural language
text and match certain terms in code with queries,
hence suffering from the vocabulary mismatch
problem (McMillan et al.,2011;Robertson et al.,
1995). Deep siamese neural networks first embed
queries and code fragments into a joint embedding
space, then measure similarity by calculating dot
product or cosine distance (Lv et al.,2015;Cam-
bronero et al.,2019;Gu et al.,2021). Recently,
with the popularity of large scale pre-training tech-
niques, some big models for source code (Guo
et al.,2021;Feng et al.,2020;Guo et al.,2022;
Wang et al.,2021;Jain et al.,2021;Li et al.,2022)
with various pre-training tasks are proposed and
significantly outperform previous models.
Contrastive learning is widely adopted by the
above-mentioned models. It is suitable for code
search because the learning objective aims to push
apart negative query-code pairs and pull together
positive pairs at the same time. In contrastive learn-
ing, negative pairs are usually generated by In-
Batch Augmentation (Huang et al.,2021). For pos-
itive pairs, besides labeled ones, some researchers
proposed augmentation approaches to generate
more positive pairs (Bui et al.,2021;Jain et al.,
2021;Fang et al.,2020;Gao et al.,2021;He
et al.,2020). The main hypothesis behind these
approaches is that augmentations do not change the
original semantics. However, these approaches are
resource-consuming (Yin et al.,2021;Jeong et al.,
2022). Models have to embed the data again for
the augmented data.
To solve this problem, some researchers pro-
posed representation-level augmentation, which
arXiv:2210.12285v1 [cs.SE] 21 Oct 2022
augments the representations of the original data.
For example, linear interpolation, a representation-
level augmentation method, is adopted by many
classification tasks in NLP (Guo et al.,2019;Sun
et al.,2020;Du et al.,2021). The augmented rep-
resentation captures the structure of the data man-
ifold and hence could force model to learn better
features, as argued by Verma et al. (2021). These
augmentation approaches are also considered to be
semantic-preserving.
The representation-level augmentation methods
are not investigated on the code search task before.
To the best of our knowledge, Jeong et al. (2022)
is the only work to bring representation-level aug-
mentation approaches to a retrieval task. Besides
linear interpolation, it also proposes another ap-
proach called stochastic perturbation for document
retrieval. Although these augmentation methods
bring improvements to model performance, they
are not yet fully investigated. The relationships
between the existing methods and how they affect
model performance remain to be explored.
In this work, we first unify linear interpolation
and stochastic perturbation into a general format
of representation-level augmentation. We further
propose three augmentation methods (linear extrap-
olation, binary interpolation, and Gaussian scal-
ing) based on the general format. Then we theo-
retically analyze the advantages of the proposed
augmentation methods based on the most com-
monly used InfoNCE loss (Van den Oord et al.,
2018). As optimizing InfoNCE loss equals to max-
imizing the mutual information between positive
pairs, applying representation-level augmentation
leads to tighter lower bounds of mutual informa-
tion. We evaluate representation-level augmenta-
tion on several Siamese networks across several
large-scale datasets. Experimental results show
the effectiveness of the representation-level aug-
mentation methods in boosting the performance
of these code search models. To verify the gener-
alization ability of our method to other tasks, we
also conduct experiments on the paragraph retrieval
task, and the results show that our method can also
improve the performance of several paragraph re-
trieval models.
In summary, our contributions of this work are
as follows:
We unify previous representation-level aug-
mentation methods to propose a general for-
mat. Based on this general format, we propose
three novel augmentation methods.
We conduct theoretical analysis to show that
representation-level augmentation has tighter
lower bounds of mutual information between
positive pairs.
We apply representation-level augmentation
on several code search models and eval-
uate them on the public CodeSearchNet
dataset with six programming languages.
Improvement of MRR (Mean Reciprocal
Rank) demonstrates the effectiveness of the
representation-level augmentation methods.
The rest of the paper is organized as follows.
We introduce related work of code search and
data augmentation in Section 2. Section 3in-
troduces the main part, including the general for-
mat of representation-level augmentation, new aug-
mentation methods, and their application on code
search. In Section 4, we analyze the theoretical
lower bounds of mutual information and study why
our approach works. In Section 5and Section 6,
we conduct extensive experiments to show the ef-
fectiveness of our approach. Then we discuss the
generality of our approach in Section 7and Sec-
tion 8concludes this paper.
2 Related Work
2.1 Code search
As code search can significantly improve the pro-
ductivity of software developers by reusing func-
tionalities in large codebases, finding the semantic-
relevant code fragments precisely is one of the key
challenges in code search.
Traditional approaches leverage information re-
trieval techniques that try to match some keywords
between queries and codes (McMillan et al.,2011;
Robertson et al.,1995). These approaches suffer
from vocabulary mismatch problem where mod-
els fail to retrieve the relevant codes due to the
difference in semantics.
Later, deep neural models for code search are
proposed. They could be divided into two cate-
gories, early fusion and late fusion. Late fusion
approaches (Gu et al.,2018;Husain et al.,2019)
use a siamese network to embed queries and codes
into a shared vector space separately, then calcu-
late dot product or cosine distance to measure the
semantic similarity. Recently, following the idea of
late fusion, some transformer-based models with
specifically designed pre-training tasks are pro-
posed (Feng et al.,2020;Guo et al.,2021,2022).
They significantly outperform previous models by
improving the understanding of code semantics.
Instead of calculating representations of queries
and codes independently, early fusion approaches
model the correlations between queries and codes
during the embedding process (Li et al.,2020). Li
et al. (2020) argues that early fusion makes it easier
to capture implicit similarities. For applications of
an online code search system, late fusion approach
facilitates the use of neural models because the
code representations can be calculated and stored
in advance. During run time, only query represen-
tations need to be computed. Thus, in this work,
we focus on late fusion approaches.
2.2 Data augmentation
Data augmentation has long been considered cru-
cial for learning better representations in con-
trastive learning. The augmented data are consid-
ered to have the same semantics with the original
data. For the augmentation of queries, synonym
replacement, random insertion, random swap, ran-
dom deletion, back-translation, spans technique
and word perturbation can be potentially used to
generate individual augmentations (Wei and Zou,
2019;Giorgi et al.,2021;Fang et al.,2020). For
the augmentation of code fragments, Bui et al.
(2021) proposed six semantic-preserving transfor-
mations: Variable Renaming,Permute Statement,
Unused Statement,Loop Exchange,Switch to If
and Boolean Exchange. These query and code
augmentation approaches have one thing in com-
mon, that is, the transformation is applied to the
original input data. Another category is augment-
ing during the embedding process. Models can
generate different representations of the same data
by leveraging time-varying mechanisms. MoCo
(He et al.,2020) encodes data twice by the same
model with different parameters. SimCSE (Gao
et al.,2021) leverages the property of dropout lay-
ers by randomly deactivating different neurons for
the same input. Methods described in this para-
graph are resource-consuming because models em-
bed twice to get representations of original data
and augmented one.
For representation-level augmentation on NLP
tasks, linear interpolation is widely used on classifi-
cation tasks in previous work (Guo et al.,2019;Sun
et al.,2020;Du et al.,2021). They take the interpo-
lation result as noised data and want models to clas-
sify the noised one into the original class. Verma
et al. (2021) theoretically analyzed how interpola-
tion noise benefits classification tasks and why it
is better than Gaussian noise. Jeong et al. (2022)
is the first to introduce linear interpolation and per-
turbation to the document retrieval task. However,
the effect and intrinsic relationship of these two
methods are not fully investigated.
3 Approach
In this section, we unify the linear interpolation
and stochastic perturbation into a general format.
Based on it, we propose three other augmentation
methods for the code retrieval task, linear extrap-
olation, binary interpolation and Gaussian scaling.
Then, we explain how to apply representation-level
augmentation with InfoNCE loss in code retrieval.
3.1 General format of representation-level
augmentation
For simplicity, we take code augmentation as an
example to elaborate the details. The calculation
process is similar when applying to query augmen-
tations. Given a data distribution
D={xi}K
i=1
where
xi
is a code snippet,
K
is the size of the
dataset. We use an encoder function
h:DH
to map codes to representations H.
Linear interpolation
Linear interpolation ran-
domly interpolate
hi
with another chosen sample
hjfrom H:
h+
i=λhi+ (1 λ)hj(1)
where
λ
is a coefficient sampled from a random
distribution. For example,
λ
can be sampled from
a uniform distribution
λU(α, 1.0)
with high
values of
α
to make sure that the augmented data
has similar semantics with the original code xi.
Stochastic perturbation
Stochastic perturba-
tion aims at randomly deactivating some features
of representation vectors. In order to do so, masks
are sampled from a Bernoulli distribution
B(e, p)
,
where
e
is the embedding dimension.
p
is a low
probability value since we only deactivate a small
proportion of features. For implementation, we
could use Dropout layers.
General format of representation-level augmen-
tation
We revisit the above two augmentation
摘要:

ExploringRepresentation-LevelAugmentationforCodeSearchHaochenLi1ChunyanMiao1;2CyrilLeung1;2YanxianHuang3YuanHuang3HongyuZhang4YanlinWang31SchoolofComputerScienceandEngineering,NanyangTechnologicalUniversity,Singapore2China-SingaporeInternationalJointResearchInstitute(CSIJRI),China3SchoolofSoftwareE...

展开>> 收起<<
Exploring Representation-Level Augmentation for Code Search Haochen Li1Chunyan Miao12Cyril Leung12Yanxian Huang3 Yuan Huang3Hongyu Zhang4Yanlin Wang3.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:415.35KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注