Low-Resources Project-Specific Code Summarization

2025-05-06 0 0 1.47MB 12 页 10玖币

侵权投诉

Rui Xie†

Tianxiang Hu†

Wei Ye∗

Shikun Zhang∗

ruixie,hutianxiang,wye,shangsk@pku.edu.cn

National Engineering Research Center for Software Engineering, Peking University

Beijing, China

ABSTRACT

Code summarization generates brief natural language descriptions

of source code pieces, which can assist developers in understanding

code and reduce documentation workload. Recent neural models on

code summarization are trained and evaluated on large-scale multi-

project datasets consisting of independent code-summary pairs.

Despite the technical advances, their eectiveness on a specic

project is rarely explored. In practical scenarios, however, develop-

ers are more concerned with generating high-quality summaries for

their working projects. And these projects may not maintain su-

cient documentation, hence having few historical code-summary

pairs. To this end, we investigate low-resource project-specic code

summarization, a novel task more consistent with the developers’

requirements. To better characterize project-specic knowledge

with limited training samples, we propose a meta transfer learn-

ing method by incorporating a lightweight ne-tuning mechanism

into a meta-learning framework. Experimental results on nine real-

world projects verify the superiority of our method over alternative

ones and reveal how the project-specic knowledge is learned.

KEYWORDS

low-resources project-specic code summarization, parameter e-

cient transfer learning, meta learning

ACM Reference Format:

Rui Xie

†

, Tianxiang Hu

†

, Wei Ye

∗

, and Shikun Zhang

∗

. 2022. Low-Resources

Project-Specic Code Summarization. In 37th IEEE/ACM International Con-

ference on Automated Software Engineering (ASE ’22), October 10–14, 2022,

Rochester, MI, USA. ACM, New York, NY, USA, 12 pages. https://doi.org/10.

1145/3551349.3556909

1 INTRODUCTION

Code summaries, also referred to as code documentation, are read-

able natural language texts that describe source code’s functionality

†Both authors contributed equally to this research.

∗Corresponding author.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior specic permission and/or a

fee. Request permissions from permissions@acm.org.

ASE ’22, October 10–14, 2022, Rochester, MI, USA

ACM ISBN 978-1-4503-9475-8/22/10. . . $15.00

https://doi.org/10.1145/3551349.3556909

and serve as one of the most common support to help developers un-

derstand programs [

]. Data-driven automatic code summarization

has now become a rapidly-growing research topic. Researchers in

the software engineering and natural language processing commu-

nities have proposed a variety of neural models for code summariza-

tion. These models are usually built upon techniques widely used in

machine translation and text summarization, such as reinforcement

learning [

], Variational AutoEncoders [

], dual learning [

][

and retrieval techniques [15, 29].

Previous neural code summarization models are typically trained

and evaluated on large-scale datasets consisting of independent

code-summary pairs from many software projects. They are re-

ferred as

general code summarization (GCS)

models in this pa-

per. Despite the promising results of recent GCS methods, few of

them explore their eectiveness on a specic project, which, how-

ever, is more concerned by developers in practical scenarios. After

all, what developers need more is not a performant model over

cross-project datasets, but a tool to generate high-quality and con-

sistent code summaries for their specic working projects. We term

the scenario of generating code summaries for an individual project

project-specic code summarization (PCS).

Unfortunately, a good GCS model is not guaranteed to be a good

PCS one. The reason mainly lies in that GCS models typically focus

on capturing common cross-project semantics, potentially leading

the generated summaries to fail to capture project-specic charac-

teristics. For example, one software project has its unique domain

knowledge, and in most cases, its documentation is written in a

relatively consistent style. Under the task settings of GCS, however,

models tend to generate monotonous summaries in the manner

used by most projects, rather than the way or style of the target

project. Table 1 shows a concrete case, where a robust Transformer-

based GCS model [

] is trained on a large-scale multi-project dataset

[

]. For the two code snippets in Table 1, Transformer generates

code summaries with an expression style of "return true if." These

summaries are consistent with the nding of Li et al. [

] that

the pattern of “returns true if” frequently appears in the dataset

provided by LeClair et al. [

]. The problem here is that project

Flink prefers another writing style of “check whether”, making

the generated summary incoherent with its historical summaries.

Meanwhile, we can easily nd that the generated summaries do

not cover enough meaningful topic words, since a GCS model may

have limited domain knowledge of specic projects (Guava and

Flink here). These facts inspire us that project-specic knowledge

should be better distilled to improve code summary quality in PCS.

arXiv:2210.11843v1 [cs.SE] 21 Oct 2022

ASE ’22, October 10–14, 2022, Rochester, MI, USA Rui Xie†, Tianxiang Hu†, Wei Ye∗, and Shikun Zhang∗

Table 1: Motivation Scenario. Current code summarization

models tend to generate summaries in the manner used by

most projects (e.g., with the frequent pattern of “return true

if”) rather than the way of the target project (e.g., with the

pattern of “checks whether” of project Flink). Meanwhile,

directly applying a code summarization model in specic

projects may generate semantically-poor summaries due to

the lack of project-specic domain knowledge. The sum-

maries in this example are generated by a Transformer-

based code summarization model [1] trained on a large-scale

dataset proposed by LeClair[14].

Guava

Source Code public boolean contains(...) {

nal Monitor monitor = this.monitor;

monitor.enter();

...

return q.contains(o);

}

Human-Written returns true if this queue contains the ...

Transformer returns tt true tt if this multimap contains ...

Flink

Source Code public boolean isEmpty() {

return size() == 0;

}

Human-Written checks whether the queue is empty has no ...

Transformer returns true if this map contains no key ...

A natural technical design for PCS models is to introduce trans-

fer learning, by considering each project as a unique domain. In

our preliminary experiment, a classical ne-tuning strategy yields

robust performance improvements in projects with numerous o-

the-shelf code summaries. However, going back to the practical

development scenario again, we nd that projects may usually

be poorly documented, with insucient code summaries for ne-

tuning. According to our investigation of 300 projects from three

prominent open-source organizations (Apache, Google, and Spring),

a large number of the projects lack historical code summaries in

terms of model training. For example, as shown in Table 2, nearly

one-third of projects have code summaries of less than 100. Con-

sidering the authority of these three organizations, we believe that

there are more open-source projects with a small number or even

none of existing code summaries. To this end, this paper proposes a

novel and essential task—low-resource project-specic code summa-

rization—to tackle a more practical code summarization challenge

in the software engineering community.

As a pioneering eort on low-resource project-specic code sum-

marization, we then propose a simple yet eective meta-learning-

based approach with the following two characteristics.

1. Since one development organization usually has more than one

working project, we investigate how to leverage multiple projects

to promote transferring project-specic knowledge. Unlike conven-

tional GCS methods that uniformly treat data samples from dierent

projects, our method regards each project as an independent do-

main to better characterize project features. Specically, due to the

ecacy of meta-learning in handling low-resource applications[

Table 2: Statistics of the number of open-source projects

whose code summary counts are less than 10 or 100. We

selected 100 projects from open-source organizations of

Apache, Google, and Spring, respectively. And then we statis-

tic the number of summaries for public methods.

Source Total Number of Projects

#Summary <10 #Summary <100

Apache 100 3 11

Google 100 11 41

Spring 100 13 44

Total 300 27 96

we introduce Model-Agnostic Meta-Learning (MAML) [

] into PCS.

Meta transfer learning in our scenario means learning to do transfer

learning via ne-tuning multiple projects together. More speci-

cally, we condense shared cross-project knowledge into the form

of weight initialization of neural models, enabling better project-

specic domain adaption.

2. Code summarization models utilizing modern sequence-to-

sequence architecture usually have large-scale parameters, bringing

two limitations to PCS. On the one hand, project-specic knowledge

is accumulated and updated frequently with code or documentation

revisions, hindering the eciency of iterative model optimization.

On the other hand, limited training data for a specic project will

easily make big models overtted. Therefore, inspired by the recent

advent of prompt learning in the NLP community[

], in our meta

transfer learning, we keep the pre-trained model parameters frozen

and only optimize a sequence of continuous project-specic vectors.

These vectors, named project-specic prex in our paper, only

involve a small number of extra parameters, eectively improving

the overall meta transfer learning process.

We curate a PCS dataset consisting of nine diversied real-

world projects. The automatic and human evaluation results on

this dataset verify the overall eectiveness of our method and the

necessity of individual components, and, more importantly, suggest

promising research opportunities on project-specic code summa-

rization.

The contributions of this paper are listed as follows:

•

We propose low-resource project-specic code summariza-

tion, an essential and novel task more consistent with prac-

tical developing scenarios.

•

As a pioneering exploration on low-resource project-specic

code summarization, we design

eta

rex-tuning for

ummarization (

MPCos

). MPCos captures project knowl-

edge eectively and eciently, by integrating a project-

specic prex-based ne-tuning mechanism into a meta-

learning framework, serving as a solid baseline for future

study.

•

By looking into the token frequency patterns in our gener-

ated summaries, we reveal the internal process of project-

specic knowledge learning to some extent.

2 PROBLEM FORMULATION

We dene the low-resource project-specic code summarization

problem as follows. Given a target project with limited code-summary

Low-Resources Project-Specific Code Summarization ASE ’22, October 10–14, 2022, Rochester, MI, USA

pairs corpus:

𝐶tgt ={(𝑋𝑖,^

𝑌𝑖)}𝑖=1..𝑁 tgt

, where

𝑁tgt

is the number

of code-summary pairs in the target project, the code summariza-

tion model is supposed learning how to generate correct summaries

for the target project, with the help of the general knowledge of

large-scale multi-project corpus and the cross-domain knowledge

of few accompanying projects. Note that a target project can also

serve as an accompanying one for other projects. Our task setting

in this paper involves nine target projects, which means each one

has eight accompanying projects. The task should care about the

overall performance of all target projects instead of a single one. In

the low-resource situation, the number of code-summary pairs in

target project

𝑁tgt

is usually small. Therefore, following previous

work and real-world development experience, we mainly focus on

two settings where 𝑁tgt =10 and 𝑁tgt =100.

3 METHODOLOGIES

We employ a classical Transformer as the backbone of our method.

As shown in Figure 1, MPCos mainly consists of three compo-

nents: (1) Code Summarization Module, which generates target

summaries based on the Encoder-Decoder architecture of Trans-

former; (2) Meta-Transfer Learning Module, which leverages multi-

ple projects to learn better initial weights for prex-tuning; and (3)

Prex-Tuning Module, which preserves separate prexes for each

project to promote project-specic transfer learning and avoid data

cross-contamination. We will describe the details of each compo-

nent in the following subsection.

3.1 Code Summarization

In recent years, the Transformer framework has been widely used

in generative tasks and achieved good results. Therefore, we use a

Transformer’s encoder-decoder framework as our base model. In

order to process programming languages and natural languages, we

rst tokenize input code

𝑋

into a code token sequence

[𝑥1;...;𝑥𝑁𝑋]

where

𝑥𝑖

represents a code token in original code, and

𝑁𝑋

repre-

sents the number of tokens; for the target summary

𝑌

, we also

tokenize it into a sequence of tokens

𝑦1, ..., ^

𝑦𝑁^

𝑌]

where

𝑌𝑖

repre-

sents a summary token in target summary, and

𝑁^

𝑌

represents the

number of tokens.

The base model consists of a Transformer encoder and a Trans-

former decoder. Transformer encoder takes code token sequences

as input, and generate a hidden vector of the input code.

𝐻=TransformerEncoder(𝑋),(1)

Transformer encoder consists of stacked transformer layers. Each

transformer layer takes last layer’s output as input and use multi-

head attention mechanism to enhance the represents:

𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑 (𝑞, 𝑣)=𝐶𝑜𝑛𝑐𝑎𝑡 (ℎ𝑒𝑎𝑑1, ..., ℎ𝑒𝑎𝑑𝑛)

𝑤ℎ𝑒𝑟𝑒 ℎ𝑒𝑎𝑑𝑖=𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑞, 𝑣)(2)

𝐻𝑙=𝐿𝑁 (𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑 (𝐻𝑙−1, 𝐻𝑙−1) + 𝐻𝑙−1)(3)

𝐻𝑙=𝐿𝑁 (𝐹 𝐹 𝑁 (^

𝐻𝑙) + ^

𝐻𝑙)(4)

where

𝑙

represents the

𝑙𝑡ℎ

layer in transformer encoder, n repre-

sents the number of heads,

𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛

represents classical attention

mechanism proposed by [

𝐿𝑁

represents Layer normalization,

𝐹 𝐹 𝑁

represents a feed-forward network, and we use the input code

embeddings 𝑒𝑚𝑏 (𝑋)as the initial state of 𝐻0.

Transformer decoder takes

𝐻

as input and generate summary

in an auto-regressive way. At time

𝑡

, given hidden vector

𝐻

and

previous generated summary

𝑌<𝑡={𝑦1, 𝑦2, ...,𝑦𝑡−1}

, the decoder

will generate a hidden vector represents the generated word:

𝑆𝑡=TransformerDecoder(𝑌<𝑡, 𝐻).(5)

Transformer decoder also consists of several stacked transformer

layers. And the output of last layer can be further used to estimate

the probability distribution of word 𝑦𝑡:

𝑝(𝑦𝑡|𝑋)softmax(𝑊𝑌𝑆𝑡).(6)

where 𝑊𝑌is a weight matrix.

We denoted the base model as

𝑀𝜃

, where

𝜃

indicates the trainable

parameters in the model. Thus, the model can used to estimate

distributions of target summary as:

𝑝(𝑦𝑡|𝑋, 𝜃)𝑀(𝑌<𝑡, 𝑋, 𝜃 )(7)

To utilize general code summarization knowledge, we rst train

𝑀𝜃

with general corpus

𝐶pre

. Here, we use general code summa-

rization dataset proposed by [14].

Given an input code

𝑋

and its ground-truth summary

𝑌=

𝑦1,^

𝑦2, ..., ^

𝑦|^

𝑌|}

from

𝐶pre

, we optimize the model to minimize

the negative log-likelihood (NLL) as:

LNLL =−log 𝑝(^

𝑌;𝑋, 𝜃 )=−1

𝑌|



𝑡=1

𝑃𝑟 (𝑦𝑡=^

𝑦𝑡|𝑋, 𝜃 )(8)

3.2 Prex Tuning

To prevent over-tting when training large pre-trained model on

low-resource scenario, we propose restricting the number of meta-

trainable parameters and layers. In particular, we apply prex tun-

ing to reduce trainable parameters.

The prex-tuning is a prompting mechanism prepending to

Transformer model by inserting a prex vector into each layer

of the transformer. Taking encoder as example, the prex-tuning

Transformer layer can be expressed as:

𝑃𝑙=𝑀𝐿𝑃𝑙(𝑒𝑚𝑏𝑙(𝑝𝑟𝑜 𝑗𝑒𝑐𝑡 )) (9)

𝐻𝑙=𝐿𝑁 (𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑 (𝐻𝑙−1,[𝑃𝑙;𝐻𝑙−1]) + 𝐻𝑙−1)(10)

𝐻𝑙=𝐿𝑁 (𝐹 𝐹 𝑁 (^

𝐻𝑙) + ^

𝐻𝑙)(11)

where 𝑃𝑙represents the prexed vectors of corresponding project

on the

𝑙𝑡ℎ

layer,

𝑒𝑚𝑏𝑙

represents the projection operation for layer

𝑙

from source project to its corresponding embedding based on the

embedding matrix

𝑀𝑙

, and

𝑀𝐿𝑃𝑙

represents the classical Multilayer

Perceptron network for layer

𝑙

. Following [

], we update the pa-

rameters of

𝑀𝐿𝑃𝑙

and the embedding matrix

𝑀𝑙

during training.

Once training is complete, these parameters can be dropped, and

only the prexed vectors

𝑃𝑙

needs to be saved. The illustration of

the proposed prex tuning model is shown in Figure 1.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Low-ResourcesProject-SpecificCodeSummarizationRuiXie†TianxiangHu†WeiYe∗ShikunZhang∗ruixie,hutianxiang,wye,shangsk@pku.edu.cnNationalEngineeringResearchCenterforSoftwareEngineering,PekingUniversityBeijing,ChinaABSTRACTCodesummarizationgeneratesbriefnaturallanguagedescriptionsofsourcecodepieces,whichc...

展开>> 收起<<

Low-Resources Project-Specific Code Summarization.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Low-Resources Project-Specific Code Summarization

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: