
ASE ’22, October 10–14, 2022, Rochester, MI, USA Rui Xie†, Tianxiang Hu†, Wei Ye∗, and Shikun Zhang∗
Table 1: Motivation Scenario. Current code summarization
models tend to generate summaries in the manner used by
most projects (e.g., with the frequent pattern of “return true
if”) rather than the way of the target project (e.g., with the
pattern of “checks whether” of project Flink). Meanwhile,
directly applying a code summarization model in specic
projects may generate semantically-poor summaries due to
the lack of project-specic domain knowledge. The sum-
maries in this example are generated by a Transformer-
based code summarization model [1] trained on a large-scale
dataset proposed by LeClair[14].
Guava
Source Code public boolean contains(...) {
nal Monitor monitor = this.monitor;
monitor.enter();
...
return q.contains(o);
}
Human-Written returns true if this queue contains the ...
Transformer returns tt true tt if this multimap contains ...
Flink
Source Code public boolean isEmpty() {
return size() == 0;
}
Human-Written checks whether the queue is empty has no ...
Transformer returns true if this map contains no key ...
A natural technical design for PCS models is to introduce trans-
fer learning, by considering each project as a unique domain. In
our preliminary experiment, a classical ne-tuning strategy yields
robust performance improvements in projects with numerous o-
the-shelf code summaries. However, going back to the practical
development scenario again, we nd that projects may usually
be poorly documented, with insucient code summaries for ne-
tuning. According to our investigation of 300 projects from three
prominent open-source organizations (Apache, Google, and Spring),
a large number of the projects lack historical code summaries in
terms of model training. For example, as shown in Table 2, nearly
one-third of projects have code summaries of less than 100. Con-
sidering the authority of these three organizations, we believe that
there are more open-source projects with a small number or even
none of existing code summaries. To this end, this paper proposes a
novel and essential task—low-resource project-specic code summa-
rization—to tackle a more practical code summarization challenge
in the software engineering community.
As a pioneering eort on low-resource project-specic code sum-
marization, we then propose a simple yet eective meta-learning-
based approach with the following two characteristics.
1. Since one development organization usually has more than one
working project, we investigate how to leverage multiple projects
to promote transferring project-specic knowledge. Unlike conven-
tional GCS methods that uniformly treat data samples from dierent
projects, our method regards each project as an independent do-
main to better characterize project features. Specically, due to the
ecacy of meta-learning in handling low-resource applications[
5
],
Table 2: Statistics of the number of open-source projects
whose code summary counts are less than 10 or 100. We
selected 100 projects from open-source organizations of
Apache, Google, and Spring, respectively. And then we statis-
tic the number of summaries for public methods.
Source Total Number of Projects
#Summary <10 #Summary <100
Apache 100 3 11
Google 100 11 41
Spring 100 13 44
Total 300 27 96
we introduce Model-Agnostic Meta-Learning (MAML) [
6
] into PCS.
Meta transfer learning in our scenario means learning to do transfer
learning via ne-tuning multiple projects together. More speci-
cally, we condense shared cross-project knowledge into the form
of weight initialization of neural models, enabling better project-
specic domain adaption.
2. Code summarization models utilizing modern sequence-to-
sequence architecture usually have large-scale parameters, bringing
two limitations to PCS. On the one hand, project-specic knowledge
is accumulated and updated frequently with code or documentation
revisions, hindering the eciency of iterative model optimization.
On the other hand, limited training data for a specic project will
easily make big models overtted. Therefore, inspired by the recent
advent of prompt learning in the NLP community[
16
], in our meta
transfer learning, we keep the pre-trained model parameters frozen
and only optimize a sequence of continuous project-specic vectors.
These vectors, named project-specic prex in our paper, only
involve a small number of extra parameters, eectively improving
the overall meta transfer learning process.
We curate a PCS dataset consisting of nine diversied real-
world projects. The automatic and human evaluation results on
this dataset verify the overall eectiveness of our method and the
necessity of individual components, and, more importantly, suggest
promising research opportunities on project-specic code summa-
rization.
The contributions of this paper are listed as follows:
•
We propose low-resource project-specic code summariza-
tion, an essential and novel task more consistent with prac-
tical developing scenarios.
•
As a pioneering exploration on low-resource project-specic
code summarization, we design
M
eta
P
rex-tuning for
CO
de
S
ummarization (
MPCos
). MPCos captures project knowl-
edge eectively and eciently, by integrating a project-
specic prex-based ne-tuning mechanism into a meta-
learning framework, serving as a solid baseline for future
study.
•
By looking into the token frequency patterns in our gener-
ated summaries, we reveal the internal process of project-
specic knowledge learning to some extent.
2 PROBLEM FORMULATION
We dene the low-resource project-specic code summarization
problem as follows. Given a target project with limited code-summary