
Exploring Representation-Level Augmentation for Code Search
Haochen Li1Chunyan Miao1,2∗Cyril Leung1,2Yanxian Huang3
Yuan Huang3Hongyu Zhang4Yanlin Wang3
1School of Computer Science and Engineering, Nanyang Technological University, Singapore
2China-Singapore International Joint Research Institute (CSIJRI), China
3School of Software Engineering, Sun Yat-sen University, China
4The University of Newcastle, Australia
{haochen003,ascymiao,cleung}@ntu.edu.sg, huangyx353@mail2.sysu.edu.cn
{huangyuan5,wangylin36}@mail.sysu.edu.cn, hongyu.zhang@newcastle.edu.au
Abstract
Code search, which aims at retrieving the most
relevant code fragment for a given natural lan-
guage query, is a common activity in soft-
ware development practice. Recently, con-
trastive learning is widely used in code search
research, where many data augmentation ap-
proaches for source code (e.g., semantic-
preserving program transformation) are pro-
posed to learn better representations. However,
these augmentations are at the raw-data level,
which requires additional code analysis in the
preprocessing stage and additional training
costs in the training stage. In this paper, we ex-
plore augmentation methods that augment data
(both code and query) at representation level
which does not require additional data process-
ing and training, and based on this we pro-
pose a general format of representation-level
augmentation that unifies existing methods.
Then, we propose three new augmentation
methods (linear extrapolation, binary interpo-
lation, and Gaussian scaling) based on the gen-
eral format. Furthermore, we theoretically
analyze the advantages of the proposed aug-
mentation methods over traditional contrastive
learning methods on code search. We experi-
mentally evaluate the proposed representation-
level augmentation methods with state-of-the-
art code search models on a large-scale pub-
lic dataset consisting of six programming lan-
guages. The experimental results show that
our approach can consistently boost the perfor-
mance of the studied code search models. Our
source code is available at https://github.
com/Alex-HaochenLi/RACS.
1 Introduction
In software development, developers often search
and reuse commonly used functionalities to im-
prove their productivity (Nie et al.,2016;Shuai
et al.,2020). With the growing size of large-scale
codebases such as GitHub, retrieving semantically
∗Corresponding author.
relevant code fragments accurately becomes in-
creasingly important in this field (Allamanis et al.,
2018;Liu et al.,2021).
Traditional approaches (Nie et al.,2016;Yang
and Huang,2017;Rosario,2000;Hill et al.,2011;
Satter and Sakib,2016;Lv et al.,2015;Van Nguyen
et al.,2017) leverage information retrieval tech-
niques to treat code snippets as natural language
text and match certain terms in code with queries,
hence suffering from the vocabulary mismatch
problem (McMillan et al.,2011;Robertson et al.,
1995). Deep siamese neural networks first embed
queries and code fragments into a joint embedding
space, then measure similarity by calculating dot
product or cosine distance (Lv et al.,2015;Cam-
bronero et al.,2019;Gu et al.,2021). Recently,
with the popularity of large scale pre-training tech-
niques, some big models for source code (Guo
et al.,2021;Feng et al.,2020;Guo et al.,2022;
Wang et al.,2021;Jain et al.,2021;Li et al.,2022)
with various pre-training tasks are proposed and
significantly outperform previous models.
Contrastive learning is widely adopted by the
above-mentioned models. It is suitable for code
search because the learning objective aims to push
apart negative query-code pairs and pull together
positive pairs at the same time. In contrastive learn-
ing, negative pairs are usually generated by In-
Batch Augmentation (Huang et al.,2021). For pos-
itive pairs, besides labeled ones, some researchers
proposed augmentation approaches to generate
more positive pairs (Bui et al.,2021;Jain et al.,
2021;Fang et al.,2020;Gao et al.,2021;He
et al.,2020). The main hypothesis behind these
approaches is that augmentations do not change the
original semantics. However, these approaches are
resource-consuming (Yin et al.,2021;Jeong et al.,
2022). Models have to embed the data again for
the augmented data.
To solve this problem, some researchers pro-
posed representation-level augmentation, which
arXiv:2210.12285v1 [cs.SE] 21 Oct 2022