Low-Resources Project-Specific Code Summarization

2025-05-06 0 0 1.47MB 12 页 10玖币
侵权投诉
Low-Resources Project-Specific Code Summarization
Rui Xie
Tianxiang Hu
Wei Ye
Shikun Zhang
ruixie,hutianxiang,wye,shangsk@pku.edu.cn
National Engineering Research Center for Software Engineering, Peking University
Beijing, China
ABSTRACT
Code summarization generates brief natural language descriptions
of source code pieces, which can assist developers in understanding
code and reduce documentation workload. Recent neural models on
code summarization are trained and evaluated on large-scale multi-
project datasets consisting of independent code-summary pairs.
Despite the technical advances, their eectiveness on a specic
project is rarely explored. In practical scenarios, however, develop-
ers are more concerned with generating high-quality summaries for
their working projects. And these projects may not maintain su-
cient documentation, hence having few historical code-summary
pairs. To this end, we investigate low-resource project-specic code
summarization, a novel task more consistent with the developers’
requirements. To better characterize project-specic knowledge
with limited training samples, we propose a meta transfer learn-
ing method by incorporating a lightweight ne-tuning mechanism
into a meta-learning framework. Experimental results on nine real-
world projects verify the superiority of our method over alternative
ones and reveal how the project-specic knowledge is learned.
KEYWORDS
low-resources project-specic code summarization, parameter e-
cient transfer learning, meta learning
ACM Reference Format:
Rui Xie
, Tianxiang Hu
, Wei Ye
, and Shikun Zhang
. 2022. Low-Resources
Project-Specic Code Summarization. In 37th IEEE/ACM International Con-
ference on Automated Software Engineering (ASE ’22), October 10–14, 2022,
Rochester, MI, USA. ACM, New York, NY, USA, 12 pages. https://doi.org/10.
1145/3551349.3556909
1 INTRODUCTION
Code summaries, also referred to as code documentation, are read-
able natural language texts that describe source code’s functionality
Both authors contributed equally to this research.
Corresponding author.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
ASE ’22, October 10–14, 2022, Rochester, MI, USA
©2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-9475-8/22/10. . . $15.00
https://doi.org/10.1145/3551349.3556909
and serve as one of the most common support to help developers un-
derstand programs [
8
]. Data-driven automatic code summarization
has now become a rapidly-growing research topic. Researchers in
the software engineering and natural language processing commu-
nities have proposed a variety of neural models for code summariza-
tion. These models are usually built upon techniques widely used in
machine translation and text summarization, such as reinforcement
learning [
25
], Variational AutoEncoders [
4
], dual learning [
26
][
28
],
and retrieval techniques [15, 29].
Previous neural code summarization models are typically trained
and evaluated on large-scale datasets consisting of independent
code-summary pairs from many software projects. They are re-
ferred as
general code summarization (GCS)
models in this pa-
per. Despite the promising results of recent GCS methods, few of
them explore their eectiveness on a specic project, which, how-
ever, is more concerned by developers in practical scenarios. After
all, what developers need more is not a performant model over
cross-project datasets, but a tool to generate high-quality and con-
sistent code summaries for their specic working projects. We term
the scenario of generating code summaries for an individual project
project-specic code summarization (PCS).
Unfortunately, a good GCS model is not guaranteed to be a good
PCS one. The reason mainly lies in that GCS models typically focus
on capturing common cross-project semantics, potentially leading
the generated summaries to fail to capture project-specic charac-
teristics. For example, one software project has its unique domain
knowledge, and in most cases, its documentation is written in a
relatively consistent style. Under the task settings of GCS, however,
models tend to generate monotonous summaries in the manner
used by most projects, rather than the way or style of the target
project. Table 1 shows a concrete case, where a robust Transformer-
based GCS model [
1
] is trained on a large-scale multi-project dataset
[
14
]. For the two code snippets in Table 1, Transformer generates
code summaries with an expression style of "return true if." These
summaries are consistent with the nding of Li et al. [
15
] that
the pattern of “returns true if” frequently appears in the dataset
provided by LeClair et al. [
14
]. The problem here is that project
Flink prefers another writing style of “check whether”, making
the generated summary incoherent with its historical summaries.
Meanwhile, we can easily nd that the generated summaries do
not cover enough meaningful topic words, since a GCS model may
have limited domain knowledge of specic projects (Guava and
Flink here). These facts inspire us that project-specic knowledge
should be better distilled to improve code summary quality in PCS.
arXiv:2210.11843v1 [cs.SE] 21 Oct 2022
ASE ’22, October 10–14, 2022, Rochester, MI, USA Rui Xie, Tianxiang Hu, Wei Ye, and Shikun Zhang
Table 1: Motivation Scenario. Current code summarization
models tend to generate summaries in the manner used by
most projects (e.g., with the frequent pattern of “return true
if”) rather than the way of the target project (e.g., with the
pattern of “checks whether” of project Flink). Meanwhile,
directly applying a code summarization model in specic
projects may generate semantically-poor summaries due to
the lack of project-specic domain knowledge. The sum-
maries in this example are generated by a Transformer-
based code summarization model [1] trained on a large-scale
dataset proposed by LeClair[14].
Guava
Source Code public boolean contains(...) {
nal Monitor monitor = this.monitor;
monitor.enter();
...
return q.contains(o);
}
Human-Written returns true if this queue contains the ...
Transformer returns tt true tt if this multimap contains ...
Flink
Source Code public boolean isEmpty() {
return size() == 0;
}
Human-Written checks whether the queue is empty has no ...
Transformer returns true if this map contains no key ...
A natural technical design for PCS models is to introduce trans-
fer learning, by considering each project as a unique domain. In
our preliminary experiment, a classical ne-tuning strategy yields
robust performance improvements in projects with numerous o-
the-shelf code summaries. However, going back to the practical
development scenario again, we nd that projects may usually
be poorly documented, with insucient code summaries for ne-
tuning. According to our investigation of 300 projects from three
prominent open-source organizations (Apache, Google, and Spring),
a large number of the projects lack historical code summaries in
terms of model training. For example, as shown in Table 2, nearly
one-third of projects have code summaries of less than 100. Con-
sidering the authority of these three organizations, we believe that
there are more open-source projects with a small number or even
none of existing code summaries. To this end, this paper proposes a
novel and essential task—low-resource project-specic code summa-
rization—to tackle a more practical code summarization challenge
in the software engineering community.
As a pioneering eort on low-resource project-specic code sum-
marization, we then propose a simple yet eective meta-learning-
based approach with the following two characteristics.
1. Since one development organization usually has more than one
working project, we investigate how to leverage multiple projects
to promote transferring project-specic knowledge. Unlike conven-
tional GCS methods that uniformly treat data samples from dierent
projects, our method regards each project as an independent do-
main to better characterize project features. Specically, due to the
ecacy of meta-learning in handling low-resource applications[
5
],
Table 2: Statistics of the number of open-source projects
whose code summary counts are less than 10 or 100. We
selected 100 projects from open-source organizations of
Apache, Google, and Spring, respectively. And then we statis-
tic the number of summaries for public methods.
Source Total Number of Projects
#Summary <10 #Summary <100
Apache 100 3 11
Google 100 11 41
Spring 100 13 44
Total 300 27 96
we introduce Model-Agnostic Meta-Learning (MAML) [
6
] into PCS.
Meta transfer learning in our scenario means learning to do transfer
learning via ne-tuning multiple projects together. More speci-
cally, we condense shared cross-project knowledge into the form
of weight initialization of neural models, enabling better project-
specic domain adaption.
2. Code summarization models utilizing modern sequence-to-
sequence architecture usually have large-scale parameters, bringing
two limitations to PCS. On the one hand, project-specic knowledge
is accumulated and updated frequently with code or documentation
revisions, hindering the eciency of iterative model optimization.
On the other hand, limited training data for a specic project will
easily make big models overtted. Therefore, inspired by the recent
advent of prompt learning in the NLP community[
16
], in our meta
transfer learning, we keep the pre-trained model parameters frozen
and only optimize a sequence of continuous project-specic vectors.
These vectors, named project-specic prex in our paper, only
involve a small number of extra parameters, eectively improving
the overall meta transfer learning process.
We curate a PCS dataset consisting of nine diversied real-
world projects. The automatic and human evaluation results on
this dataset verify the overall eectiveness of our method and the
necessity of individual components, and, more importantly, suggest
promising research opportunities on project-specic code summa-
rization.
The contributions of this paper are listed as follows:
We propose low-resource project-specic code summariza-
tion, an essential and novel task more consistent with prac-
tical developing scenarios.
As a pioneering exploration on low-resource project-specic
code summarization, we design
M
eta
P
rex-tuning for
CO
de
S
ummarization (
MPCos
). MPCos captures project knowl-
edge eectively and eciently, by integrating a project-
specic prex-based ne-tuning mechanism into a meta-
learning framework, serving as a solid baseline for future
study.
By looking into the token frequency patterns in our gener-
ated summaries, we reveal the internal process of project-
specic knowledge learning to some extent.
2 PROBLEM FORMULATION
We dene the low-resource project-specic code summarization
problem as follows. Given a target project with limited code-summary
Low-Resources Project-Specific Code Summarization ASE ’22, October 10–14, 2022, Rochester, MI, USA
pairs corpus:
𝐶tgt ={(𝑋𝑖,^
𝑌𝑖)}𝑖=1..𝑁 tgt
, where
𝑁tgt
is the number
of code-summary pairs in the target project, the code summariza-
tion model is supposed learning how to generate correct summaries
for the target project, with the help of the general knowledge of
large-scale multi-project corpus and the cross-domain knowledge
of few accompanying projects. Note that a target project can also
serve as an accompanying one for other projects. Our task setting
in this paper involves nine target projects, which means each one
has eight accompanying projects. The task should care about the
overall performance of all target projects instead of a single one. In
the low-resource situation, the number of code-summary pairs in
target project
𝑁tgt
is usually small. Therefore, following previous
work and real-world development experience, we mainly focus on
two settings where 𝑁tgt =10 and 𝑁tgt =100.
3 METHODOLOGIES
We employ a classical Transformer as the backbone of our method.
As shown in Figure 1, MPCos mainly consists of three compo-
nents: (1) Code Summarization Module, which generates target
summaries based on the Encoder-Decoder architecture of Trans-
former; (2) Meta-Transfer Learning Module, which leverages multi-
ple projects to learn better initial weights for prex-tuning; and (3)
Prex-Tuning Module, which preserves separate prexes for each
project to promote project-specic transfer learning and avoid data
cross-contamination. We will describe the details of each compo-
nent in the following subsection.
3.1 Code Summarization
In recent years, the Transformer framework has been widely used
in generative tasks and achieved good results. Therefore, we use a
Transformer’s encoder-decoder framework as our base model. In
order to process programming languages and natural languages, we
rst tokenize input code
𝑋
into a code token sequence
[𝑥1;...;𝑥𝑁𝑋]
where
𝑥𝑖
represents a code token in original code, and
𝑁𝑋
repre-
sents the number of tokens; for the target summary
^
𝑌
, we also
tokenize it into a sequence of tokens
[^
𝑦1, ..., ^
𝑦𝑁^
𝑌]
where
𝑌𝑖
repre-
sents a summary token in target summary, and
𝑁^
𝑌
represents the
number of tokens.
The base model consists of a Transformer encoder and a Trans-
former decoder. Transformer encoder takes code token sequences
as input, and generate a hidden vector of the input code.
𝐻=TransformerEncoder(𝑋),(1)
Transformer encoder consists of stacked transformer layers. Each
transformer layer takes last layer’s output as input and use multi-
head attention mechanism to enhance the represents:
𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑 (𝑞, 𝑣)=𝐶𝑜𝑛𝑐𝑎𝑡 (ℎ𝑒𝑎𝑑1, ..., ℎ𝑒𝑎𝑑𝑛)
𝑤ℎ𝑒𝑟𝑒 ℎ𝑒𝑎𝑑𝑖=𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑞, 𝑣)(2)
^
𝐻𝑙=𝐿𝑁 (𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑 (𝐻𝑙1, 𝐻𝑙1) + 𝐻𝑙1)(3)
𝐻𝑙=𝐿𝑁 (𝐹 𝐹 𝑁 (^
𝐻𝑙) + ^
𝐻𝑙)(4)
where
𝑙
represents the
𝑙𝑡
layer in transformer encoder, n repre-
sents the number of heads,
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛
represents classical attention
mechanism proposed by [
24
],
𝐿𝑁
represents Layer normalization,
𝐹 𝐹 𝑁
represents a feed-forward network, and we use the input code
embeddings 𝑒𝑚𝑏 (𝑋)as the initial state of 𝐻0.
Transformer decoder takes
𝐻
as input and generate summary
in an auto-regressive way. At time
𝑡
, given hidden vector
𝐻
and
previous generated summary
𝑌<𝑡={𝑦1, 𝑦2, ...,𝑦𝑡1}
, the decoder
will generate a hidden vector represents the generated word:
𝑆𝑡=TransformerDecoder(𝑌<𝑡, 𝐻).(5)
Transformer decoder also consists of several stacked transformer
layers. And the output of last layer can be further used to estimate
the probability distribution of word 𝑦𝑡:
𝑝(𝑦𝑡|𝑋)softmax(𝑊𝑌𝑆𝑡).(6)
where 𝑊𝑌is a weight matrix.
We denoted the base model as
𝑀𝜃
, where
𝜃
indicates the trainable
parameters in the model. Thus, the model can used to estimate
distributions of target summary as:
𝑝(𝑦𝑡|𝑋, 𝜃)𝑀(𝑌<𝑡, 𝑋, 𝜃 )(7)
To utilize general code summarization knowledge, we rst train
𝑀𝜃
with general corpus
𝐶pre
. Here, we use general code summa-
rization dataset proposed by [14].
Given an input code
𝑋
and its ground-truth summary
^
𝑌=
{^
𝑦1,^
𝑦2, ..., ^
𝑦|^
𝑌|}
from
𝐶pre
, we optimize the model to minimize
the negative log-likelihood (NLL) as:
LNLL =log 𝑝(^
𝑌;𝑋, 𝜃 )=1
|^
𝑌|
|^
𝑌|
𝑡=1
𝑃𝑟 (𝑦𝑡=^
𝑦𝑡|𝑋, 𝜃 )(8)
3.2 Prex Tuning
To prevent over-tting when training large pre-trained model on
low-resource scenario, we propose restricting the number of meta-
trainable parameters and layers. In particular, we apply prex tun-
ing to reduce trainable parameters.
The prex-tuning is a prompting mechanism prepending to
Transformer model by inserting a prex vector into each layer
of the transformer. Taking encoder as example, the prex-tuning
Transformer layer can be expressed as:
𝑃𝑙=𝑀𝐿𝑃𝑙(𝑒𝑚𝑏𝑙(𝑝𝑟𝑜 𝑗𝑒𝑐𝑡 )) (9)
^
𝐻𝑙=𝐿𝑁 (𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑 (𝐻𝑙1,[𝑃𝑙;𝐻𝑙1]) + 𝐻𝑙1)(10)
𝐻𝑙=𝐿𝑁 (𝐹 𝐹 𝑁 (^
𝐻𝑙) + ^
𝐻𝑙)(11)
where 𝑃𝑙represents the prexed vectors of corresponding project
on the
𝑙𝑡
layer,
𝑒𝑚𝑏𝑙
represents the projection operation for layer
𝑙
from source project to its corresponding embedding based on the
embedding matrix
𝑀𝑙
, and
𝑀𝐿𝑃𝑙
represents the classical Multilayer
Perceptron network for layer
𝑙
. Following [
16
], we update the pa-
rameters of
𝑀𝐿𝑃𝑙
and the embedding matrix
𝑀𝑙
during training.
Once training is complete, these parameters can be dropped, and
only the prexed vectors
𝑃𝑙
needs to be saved. The illustration of
the proposed prex tuning model is shown in Figure 1.
摘要:

Low-ResourcesProject-SpecificCodeSummarizationRuiXie†TianxiangHu†WeiYe∗ShikunZhang∗ruixie,hutianxiang,wye,shangsk@pku.edu.cnNationalEngineeringResearchCenterforSoftwareEngineering,PekingUniversityBeijing,ChinaABSTRACTCodesummarizationgeneratesbriefnaturallanguagedescriptionsofsourcecodepieces,whichc...

展开>> 收起<<
Low-Resources Project-Specific Code Summarization.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:12 页 大小:1.47MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注