Exploring Representation-Level Augmentation for Code Search Haochen Li1Chunyan Miao12Cyril Leung12Yanxian Huang3 Yuan Huang3Hongyu Zhang4Yanlin Wang3

2025-05-06 1 0 415.35KB 13 页 10玖币

侵权投诉

Exploring Representation-Level Augmentation for Code Search

Haochen Li1Chunyan Miao1,2∗Cyril Leung1,2Yanxian Huang3

Yuan Huang3Hongyu Zhang4Yanlin Wang3

1School of Computer Science and Engineering, Nanyang Technological University, Singapore

2China-Singapore International Joint Research Institute (CSIJRI), China

3School of Software Engineering, Sun Yat-sen University, China

4The University of Newcastle, Australia

{haochen003,ascymiao,cleung}@ntu.edu.sg, huangyx353@mail2.sysu.edu.cn

{huangyuan5,wangylin36}@mail.sysu.edu.cn, hongyu.zhang@newcastle.edu.au

Abstract

Code search, which aims at retrieving the most

relevant code fragment for a given natural lan-

guage query, is a common activity in soft-

ware development practice. Recently, con-

trastive learning is widely used in code search

research, where many data augmentation ap-

proaches for source code (e.g., semantic-

preserving program transformation) are pro-

posed to learn better representations. However,

these augmentations are at the raw-data level,

which requires additional code analysis in the

preprocessing stage and additional training

costs in the training stage. In this paper, we ex-

plore augmentation methods that augment data

(both code and query) at representation level

which does not require additional data process-

ing and training, and based on this we pro-

pose a general format of representation-level

augmentation that uniﬁes existing methods.

Then, we propose three new augmentation

methods (linear extrapolation, binary interpo-

lation, and Gaussian scaling) based on the gen-

eral format. Furthermore, we theoretically

analyze the advantages of the proposed aug-

mentation methods over traditional contrastive

learning methods on code search. We experi-

mentally evaluate the proposed representation-

level augmentation methods with state-of-the-

art code search models on a large-scale pub-

lic dataset consisting of six programming lan-

guages. The experimental results show that

our approach can consistently boost the perfor-

mance of the studied code search models. Our

source code is available at https://github.

com/Alex-HaochenLi/RACS.

1 Introduction

In software development, developers often search

and reuse commonly used functionalities to im-

prove their productivity (Nie et al.,2016;Shuai

et al.,2020). With the growing size of large-scale

codebases such as GitHub, retrieving semantically

∗Corresponding author.

relevant code fragments accurately becomes in-

creasingly important in this ﬁeld (Allamanis et al.,

2018;Liu et al.,2021).

Traditional approaches (Nie et al.,2016;Yang

and Huang,2017;Rosario,2000;Hill et al.,2011;

Satter and Sakib,2016;Lv et al.,2015;Van Nguyen

et al.,2017) leverage information retrieval tech-

niques to treat code snippets as natural language

text and match certain terms in code with queries,

hence suffering from the vocabulary mismatch

problem (McMillan et al.,2011;Robertson et al.,

1995). Deep siamese neural networks ﬁrst embed

queries and code fragments into a joint embedding

space, then measure similarity by calculating dot

product or cosine distance (Lv et al.,2015;Cam-

bronero et al.,2019;Gu et al.,2021). Recently,

with the popularity of large scale pre-training tech-

niques, some big models for source code (Guo

et al.,2021;Feng et al.,2020;Guo et al.,2022;

Wang et al.,2021;Jain et al.,2021;Li et al.,2022)

with various pre-training tasks are proposed and

signiﬁcantly outperform previous models.

Contrastive learning is widely adopted by the

above-mentioned models. It is suitable for code

search because the learning objective aims to push

apart negative query-code pairs and pull together

positive pairs at the same time. In contrastive learn-

ing, negative pairs are usually generated by In-

Batch Augmentation (Huang et al.,2021). For pos-

itive pairs, besides labeled ones, some researchers

proposed augmentation approaches to generate

more positive pairs (Bui et al.,2021;Jain et al.,

2021;Fang et al.,2020;Gao et al.,2021;He

et al.,2020). The main hypothesis behind these

approaches is that augmentations do not change the

original semantics. However, these approaches are

resource-consuming (Yin et al.,2021;Jeong et al.,

2022). Models have to embed the data again for

the augmented data.

To solve this problem, some researchers pro-

posed representation-level augmentation, which

arXiv:2210.12285v1 [cs.SE] 21 Oct 2022

augments the representations of the original data.

For example, linear interpolation, a representation-

level augmentation method, is adopted by many

classiﬁcation tasks in NLP (Guo et al.,2019;Sun

et al.,2020;Du et al.,2021). The augmented rep-

resentation captures the structure of the data man-

ifold and hence could force model to learn better

features, as argued by Verma et al. (2021). These

augmentation approaches are also considered to be

semantic-preserving.

The representation-level augmentation methods

are not investigated on the code search task before.

To the best of our knowledge, Jeong et al. (2022)

is the only work to bring representation-level aug-

mentation approaches to a retrieval task. Besides

linear interpolation, it also proposes another ap-

proach called stochastic perturbation for document

retrieval. Although these augmentation methods

bring improvements to model performance, they

are not yet fully investigated. The relationships

between the existing methods and how they affect

model performance remain to be explored.

In this work, we ﬁrst unify linear interpolation

and stochastic perturbation into a general format

of representation-level augmentation. We further

propose three augmentation methods (linear extrap-

olation, binary interpolation, and Gaussian scal-

ing) based on the general format. Then we theo-

retically analyze the advantages of the proposed

augmentation methods based on the most com-

monly used InfoNCE loss (Van den Oord et al.,

2018). As optimizing InfoNCE loss equals to max-

imizing the mutual information between positive

pairs, applying representation-level augmentation

leads to tighter lower bounds of mutual informa-

tion. We evaluate representation-level augmenta-

tion on several Siamese networks across several

large-scale datasets. Experimental results show

the effectiveness of the representation-level aug-

mentation methods in boosting the performance

of these code search models. To verify the gener-

alization ability of our method to other tasks, we

also conduct experiments on the paragraph retrieval

task, and the results show that our method can also

improve the performance of several paragraph re-

trieval models.

In summary, our contributions of this work are

as follows:

•

We unify previous representation-level aug-

mentation methods to propose a general for-

mat. Based on this general format, we propose

three novel augmentation methods.

•

We conduct theoretical analysis to show that

representation-level augmentation has tighter

lower bounds of mutual information between

positive pairs.

•

We apply representation-level augmentation

on several code search models and eval-

uate them on the public CodeSearchNet

dataset with six programming languages.

Improvement of MRR (Mean Reciprocal

Rank) demonstrates the effectiveness of the

representation-level augmentation methods.

The rest of the paper is organized as follows.

We introduce related work of code search and

data augmentation in Section 2. Section 3in-

troduces the main part, including the general for-

mat of representation-level augmentation, new aug-

mentation methods, and their application on code

search. In Section 4, we analyze the theoretical

lower bounds of mutual information and study why

our approach works. In Section 5and Section 6,

we conduct extensive experiments to show the ef-

fectiveness of our approach. Then we discuss the

generality of our approach in Section 7and Sec-

tion 8concludes this paper.

2 Related Work

2.1 Code search

As code search can signiﬁcantly improve the pro-

ductivity of software developers by reusing func-

tionalities in large codebases, ﬁnding the semantic-

relevant code fragments precisely is one of the key

challenges in code search.

Traditional approaches leverage information re-

trieval techniques that try to match some keywords

between queries and codes (McMillan et al.,2011;

Robertson et al.,1995). These approaches suffer

from vocabulary mismatch problem where mod-

els fail to retrieve the relevant codes due to the

difference in semantics.

Later, deep neural models for code search are

proposed. They could be divided into two cate-

gories, early fusion and late fusion. Late fusion

approaches (Gu et al.,2018;Husain et al.,2019)

use a siamese network to embed queries and codes

into a shared vector space separately, then calcu-

late dot product or cosine distance to measure the

semantic similarity. Recently, following the idea of

late fusion, some transformer-based models with

speciﬁcally designed pre-training tasks are pro-

posed (Feng et al.,2020;Guo et al.,2021,2022).

They signiﬁcantly outperform previous models by

improving the understanding of code semantics.

Instead of calculating representations of queries

and codes independently, early fusion approaches

model the correlations between queries and codes

during the embedding process (Li et al.,2020). Li

et al. (2020) argues that early fusion makes it easier

to capture implicit similarities. For applications of

an online code search system, late fusion approach

facilitates the use of neural models because the

code representations can be calculated and stored

in advance. During run time, only query represen-

tations need to be computed. Thus, in this work,

we focus on late fusion approaches.

2.2 Data augmentation

Data augmentation has long been considered cru-

cial for learning better representations in con-

trastive learning. The augmented data are consid-

ered to have the same semantics with the original

data. For the augmentation of queries, synonym

replacement, random insertion, random swap, ran-

dom deletion, back-translation, spans technique

and word perturbation can be potentially used to

generate individual augmentations (Wei and Zou,

2019;Giorgi et al.,2021;Fang et al.,2020). For

the augmentation of code fragments, Bui et al.

(2021) proposed six semantic-preserving transfor-

mations: Variable Renaming,Permute Statement,

Unused Statement,Loop Exchange,Switch to If

and Boolean Exchange. These query and code

augmentation approaches have one thing in com-

mon, that is, the transformation is applied to the

original input data. Another category is augment-

ing during the embedding process. Models can

generate different representations of the same data

by leveraging time-varying mechanisms. MoCo

(He et al.,2020) encodes data twice by the same

model with different parameters. SimCSE (Gao

et al.,2021) leverages the property of dropout lay-

ers by randomly deactivating different neurons for

the same input. Methods described in this para-

graph are resource-consuming because models em-

bed twice to get representations of original data

and augmented one.

For representation-level augmentation on NLP

tasks, linear interpolation is widely used on classiﬁ-

cation tasks in previous work (Guo et al.,2019;Sun

et al.,2020;Du et al.,2021). They take the interpo-

lation result as noised data and want models to clas-

sify the noised one into the original class. Verma

et al. (2021) theoretically analyzed how interpola-

tion noise beneﬁts classiﬁcation tasks and why it

is better than Gaussian noise. Jeong et al. (2022)

is the ﬁrst to introduce linear interpolation and per-

turbation to the document retrieval task. However,

the effect and intrinsic relationship of these two

methods are not fully investigated.

3 Approach

In this section, we unify the linear interpolation

and stochastic perturbation into a general format.

Based on it, we propose three other augmentation

methods for the code retrieval task, linear extrap-

olation, binary interpolation and Gaussian scaling.

Then, we explain how to apply representation-level

augmentation with InfoNCE loss in code retrieval.

3.1 General format of representation-level

augmentation

For simplicity, we take code augmentation as an

example to elaborate the details. The calculation

process is similar when applying to query augmen-

tations. Given a data distribution

D={xi}K

i=1

where

is a code snippet,

is the size of the

dataset. We use an encoder function

h:D→H

to map codes to representations H.

Linear interpolation

Linear interpolation ran-

domly interpolate

with another chosen sample

hjfrom H:

i=λhi+ (1 −λ)hj(1)

where

is a coefﬁcient sampled from a random

distribution. For example,

can be sampled from

a uniform distribution

λ∼U(α, 1.0)

with high

values of

to make sure that the augmented data

has similar semantics with the original code xi.

Stochastic perturbation

Stochastic perturba-

tion aims at randomly deactivating some features

of representation vectors. In order to do so, masks

are sampled from a Bernoulli distribution

B(e, p)

where

is the embedding dimension.

is a low

probability value since we only deactivate a small

proportion of features. For implementation, we

could use Dropout layers.

General format of representation-level augmen-

tation

We revisit the above two augmentation

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ExploringRepresentation-LevelAugmentationforCodeSearchHaochenLi1ChunyanMiao1;2CyrilLeung1;2YanxianHuang3YuanHuang3HongyuZhang4YanlinWang31SchoolofComputerScienceandEngineering,NanyangTechnologicalUniversity,Singapore2China-SingaporeInternationalJointResearchInstitute(CSIJRI),China3SchoolofSoftwareE...

展开>> 收起<<

Exploring Representation-Level Augmentation for Code Search Haochen Li1Chunyan Miao12Cyril Leung12Yanxian Huang3 Yuan Huang3Hongyu Zhang4Yanlin Wang3.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Exploring Representation-Level Augmentation for Code Search Haochen Li1Chunyan Miao12Cyril Leung12Yanxian Huang3 Yuan Huang3Hongyu Zhang4Yanlin Wang3

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: