Benchmarking Language Models for Code Syntax Understanding Da Shen1 Xinyun Chen2y Chenguang Wang3y Koushik Sen4 Dawn Song4 1University of Maryland College Park2Google Research Brain Team

2025-05-06 0 0 834.58KB 23 页 10玖币
侵权投诉
Benchmarking Language Models for Code Syntax Understanding
Da Shen1, Xinyun Chen2, Chenguang Wang3, Koushik Sen4, Dawn Song4
1University of Maryland, College Park, 2Google Research, Brain Team
3Washington University in St. Louis, 4University of California, Berkeley
dashen@terpmail.umd.edu,xinyunchen@google.com,chenguangwang@wustl.edu,
{ksen,dawnsong}@cs.berkeley.edu
Abstract
Pre-trained language models have demon-
strated impressive performance in both natu-
ral language processing and program under-
standing, which represent the input as a to-
ken sequence without explicitly modeling its
structure. Some prior works show that pre-
trained language models can capture the syn-
tactic rules of natural languages without fine-
tuning on syntax understanding tasks. How-
ever, there is limited understanding of how
well pre-trained models understand the code
structure so far. In this work, we perform
the first thorough benchmarking of the state-
of-the-art pre-trained models for identifying
the syntactic structures of programs. Specifi-
cally, we introduce CodeSyntax, a large-scale
dataset of programs annotated with the syn-
tactic relationships in their corresponding ab-
stract syntax trees. Our key observation is
that existing language models pretrained on
code still lack the understanding of code syn-
tax. In fact, these pre-trained programming
language models fail to match the performance
of simple baselines based on positional offsets
and keywords. We also present a natural lan-
guage benchmark to highlight the differences
between natural languages and programming
languages in terms of syntactic structure un-
derstanding. Our findings point out key limita-
tions of existing pre-training methods for pro-
gramming languages, and suggest the impor-
tance of modeling code syntactic structures.1
1 Introduction
Large-scale pre-training of language models has
become the de-facto paradigm for a variety of natu-
ral language processing tasks. Furthermore, recent
studies show that models pre-trained on a massive
amount of code also achieve competitive perfor-
mance on many tasks, e.g., code generation and
Corresponding authors.
1
Our code and dataset are available at
https://github.
com/dashends/CodeSyntax.
There were many pioneer PC contributors.
result = object.function(argument)
root
expl nn
nn
amod
nsubj
Assign Attribute Call
(a)
(b)
Figure 1: Examples of syntactic relations for (a) natural
languages (NL) and (b) programming languages (PL).
Each relation is represented by an arrow. The relations
in PL represent the syntax of code in a way similar to
those in NL.
Figure 2: A preview of the model performance com-
parison on NL and PL syntax understanding tasks. Pre-
trained models capture NL syntax relatively well, but
perform worse in understanding PL syntax. The Offset
baseline picks the token using a fixed positional offset.
We use BERT-large and RoBERTa-base configurations
(corresponding to the configurations of CuBERT and
CodeBERT). The plot shows top-1 scores. See Tables 3
and 4for the full results.
code classification. These tasks are closely related
to natural language (NL) tasks in their problem
formulation. Nowadays, the common practice for
solving these coding tasks is to utilize the language
model architectures and training schemes that are
originally designed for NL. The design principle of
these neural language models is significantly differ-
ent from the classic rule-based program generation
arXiv:2210.14473v1 [cs.CL] 26 Oct 2022
systems. Specifically, neural language models take
the program as a token sequence, while classic
program generation systems utilize the language
grammar and code structure. Despite the advanced
performance of pre-trained language models on
code understanding tasks, what these models have
learned from the code corpus remains unclear.
In this work, we investigate whether large-scale
pre-training is all we need for code representation
learning. In particular, we conduct the first system-
atic study to analyze how the pre-trained language
models understand the syntactic structures of pro-
grams. To this end, we introduce
CodeSyntax
, a
large-scale benchmark consisting of programs an-
notated with the syntactic relationships between
different tokens. The ground truth syntactic rela-
tionships are extracted from edges in the abstract
syntax trees (AST) of the programs. Figure 1
shows some examples. These syntactic relations
are function-wise similar to dependency relations
for NL, where prior work has demonstrated that
the attention heads of pre-trained language models
can help to identify NL relation types (Clark et al.,
2019;Raganato et al.,2018). To measure how well
the pre-trained language models capture the code
syntactic structures, we adopt the approach to the
PL domain. We focus on investigating the zero-
shot capability of existing pre-training methods in
our experiments, and we evaluate these pre-trained
models without finetuning them on our benchmark.
We evaluate the state-of-the-art pre-trained lan-
guage models for code representation learning, in-
cluding CuBERT (Kanade et al.,2020) and Code-
BERT (Feng et al.,2020). A common character-
istic of these models is that they share the same
Transformer-based architectural design as NL mod-
els (Vaswani et al.,2017;Devlin et al.,2019). This
allows us to directly compare their performance
in capturing the syntax structure. We present a
preview of our key results in Figure 2. Our main
observation is that pre-training is insufficient for
learning the syntactic relations in code. First, we
find that the models pre-trained on code do not al-
ways outperform models pre-trained on NL corpus
alone. Surprisingly, compared to CodeBERT which
is trained on both text and code corpora, RoBERTa
achieves better performance without training on
any code with identical model architecture. This
indicates that pre-training on programs as token
sequences does not help learn the syntactic rela-
tions. On the contrary, without dependency rela-
tions, pre-training still enables language models to
understand the NL syntax to some extent.
Moreover, for code syntax understanding, the
pre-trained models even perform worse than simple
baselines that pick the tokens with a fixed offset.
For example, always selecting the (p+2)-th token as
the p-th token’s dependency yields higher accuracy
than any attention head for several relation types.
On the other hand, the same model architectures
pre-trained on text corpora achieve decent accuracy
in identifying the dependency relations in the NL
domain, where the performance of the same simple
baselines is far behind.
Our analysis reveals several key differences be-
tween NL and PL that lead to different capabilities
of understanding the syntax for pre-trained mod-
els. First, programs are more structured than NL
sentences. Programs usually contain hierarchical
structures representing long-term dependencies be-
tween code tokens. Consequently, a large num-
ber of syntactic relation types are between distant
tokens, which can be difficult to recognize for at-
tention heads. On the contrary, the dependency
relations in NL sentences mostly connect nearby
token pairs, and in this case the attention heads are
more capable of identifying the correct relations.
Meanwhile, language models are good at recog-
nizing keyword-based relations, such as picking
the corresponding else keyword for an if token.
Interestingly, we find that the inclusion of tokens
such as newlines and semicolons notably affects
the performance in the code domain.
Our findings suggest that existing pre-trained
models perform quite differently in PL and NL do-
mains in terms of the ability to understand syntax.
Thus, directly applying training paradigms devel-
oped for NL could be suboptimal for program learn-
ing, and we consider designing better approaches
to model the code structure as future work.
2CodeSyntax: Benchmarking Code
Syntax Understanding
We construct the
CodeSyntax
benchmark to eval-
uate the performance of language models on code
syntax understanding. We focus on Python and
Java languages, on which the publicly released
model checkpoints of both CuBERT (Kanade et al.,
2020) and CodeBERT (Feng et al.,2020) are pre-
trained. We obtain the code samples from Code-
SearchNet (Husain et al.,2019), which is a large-
scale dataset consisting of code in different pro-
Relation Count Explanation Code Example
headdependent Python Java Python Java
Assign:
targetvalue
78,482 13,384
Assigning a value to a target
variable.
target =10 int target =10;
Call:
funcargs
110,949 50,890
Calling a function with some
arguments.
function(arg)function(arg);
For:
forbody
8,704 1,864
A for loop repeatedly executes
the body block for some itera-
tions.
for target in iter:
body
for (initializers;
test; updaters) {
body;
}
If:
ifelse
11,024 5,038
An if statement conditionally
executes a body based upon
some criteria. The dependent
is the else keyword.
if condition:
body1
else:
body2
if (condition) {
body1;
}else {
body2;
}
If:
ifbody
34,250 22,392
An if statement. The depen-
dent is the body block.
if condition:
body1
else:
body2
if (condition) {
body1;
} else {
body2;
}
If:
bodyorelse
11,024 4,976
An if statement. The head is
the body block and the depen-
dent is the body of the else
block.
if condition:
body1
else:
body2
if (condition) {
body1;
} else {
body2;
}
While:
testbody
743 975
The while loop repeatedly exe-
cutes the body block as long as
the specified condition is true.
while condition:
body
while (condition) {
body;
}
Table 1: Dataset statistics of selected relation types in CodeSyntax. For each relation type, we highlight the head
and dependent nodes in the examples in bold, with the head in blue and the dependent in red. We defer the full
statistics of all relation types to Table 8in the appendix.
gramming languages. Its training set is also part
of the pre-training data of CodeBERT, so we re-
move the data samples that are included in the
pre-training data of either CuBERT or CodeBERT.
Thus, none of the programs in
CodeSyntax
has
been seen by CuBERT or CodeBERT in the pre-
training phase.
In total,
CodeSyntax
contains 18,701 code sam-
ples annotated with 1,342,050 relation edges in
43 relation types for Python, and 13,711 code
samples annotated with 864,411 relation edges
in 39 relation types for Java. Each code sam-
ple is an entire function consisting of multiple
statements, which is analogous to a paragraph
in NL. Each relation corresponds to an edge in
the program AST; specifically, we utilize the
Python ast module (Foundation,2021) and the Java
org.eclipse.jdt.core.dom.ASTParser class (Contrib-
utors,2014) to parse a code sample into an AST.
We present some examples of relation types in Ta-
ble 1, and we defer the description of all relation
types to Table 8in the appendix. More details about
relation extraction are discussed in Appendix A.
Note that we can easily extend the dataset to cover
more languages since the workflow for extracting
relations is automated and AST parsers are avail-
able for most popular programming languages.
We observe several characteristics of relations
in
CodeSyntax
. First, the keywords in PL play an
important role in recognizing the code structure.
Specifically, some relation types have fixed key-
words as the edge nodes, such as the
If:ifelse
relation. Meanwhile, compared to the dependency
relations in NL, the relation edges in the program
AST tend to connect nodes that are much farther
away from each other. As shown in Figure 3, the
average
offset between head and dependent nodes
is no more than 10 for dependency relations in
NL, while the average offset for a relation type
can be more than 100 code tokens. Specifically, in
CodeSyntax
, there are 22 near dependency types
whose average offsets are less than 10, and 12 far
(a) CodeSyntax.
(b) Natural language corpus.
Figure 3: Offset distribution of relation types in (a)
CodeSyntax and (b) NL corpus. The x axis is the av-
erage positional offset distance between heads and de-
pendents for each relation. The y axis is the number
of relations that has the average offset value. See Sec-
tion 3for more details on the NL corpus.
dependency types whose average offsets are above
10.
3 Evaluation Setup
Do pre-trained language models capture the code
structure without direct supervision of the syntac-
tic information? To investigate this question, we
evaluate several pre-trained language models with-
out finetuning, and compare their performance in
understanding the syntax for NL and PL.
Natural language benchmark.
To compare the
performance on
CodeSyntax
to NL syntax under-
standing, we construct the NL benchmark that
includes English and German. Specifically, we
use the English News Text Treebank: Penn Tree-
bank Revised (Bies et al.,2015) labeled with Stan-
ford Dependencies (de Marneffe and Manning,
2008a,b), and German Hamburg Dependency Tree-
bank (Foth et al.,2014) labeled with Universal De-
pendencies (de Marneffe et al.,2021). In total, the
English dataset has 48,883 sentences, 43 relation
types, and 1,147,526 relation edges; the German
dataset has 18,459 sentences, 35 relation types, and
307,791 relation edges.
Attention probing approach.
Some prior
works demonstrate that a Transformer archi-
tecture (Vaswani et al.,2017) pre-trained on a
text corpus, such as BERT (Devlin et al.,2019),
contains attention heads that specialize in certain
dependency relations in NL (Raganato et al.,2018;
Clark et al.,2019). Specifically, in the Transformer
architecture, each vector
ei
for an input token
is transformed into the query and key vectors
qi
and
ki
via some linear transformations, and the
transformations vary among different attention
heads. For the
i
-th token, the attention weight
assigned to the j-th token is
αi,j =exp(qT
ikj)
Plexp(qT
ikl)
The attention weight indicates how important
the j-th token is with respect to the i-th token.
Typically, different attention heads learn differ-
ent weights between input tokens. Therefore, to
measure the correctness of recognizing a relation
type
r
, for each edge
<h, t, r>
in the program
AST where
h
is the head node and
t
is the de-
pendent node, we enumerate all attention heads to
compute the attention weight
αh,t
. If an attention
head tends to assign high attention weights that
connect the pair of tokens belonging to the relation
type
r
, we consider the relation type to be captured.
We defer more implementation details of attention
map extraction to Appendix B.
Metrics.
We use the unlabeled attachment score
(UAS) to measure the syntax understanding perfor-
mance, and we consider top-k scores with different
values of k. To compute top-k scores for language
models, for each attention head, given the head to-
ken
h
in a relation edge
<h, t, r>
, we compute
the attention weight over all tokens in the input
code, and we consider the prediction to be correct
if the attention weight over the dependent token
t
is among the top-k tokens with the highest at-
tention weights. For each relation, we select the
best-performing attention head and use its score as
the model’s score for that relation. We calculate a
model’s average score over all relations as the final
score of the model.
In NL dependency parsing problems, the depen-
dent node
t
usually corresponds to a single word.
However, in PL, the dependent can be a block that
contains multiple code tokens. For example, in the
If:ifbody
relation, the head is the keyword
if
,
while the dependent is the entire body block. There-
fore, we measure three metrics. First-token metric
and last-token metric: the prediction is deemed
correct if it successfully predicts the first and last
token of the dependent block, respectively; Any-
token metric: the prediction is considered correct
if it can predict any token within the dependent
block. While we agree that these are not perfect
metrics and one single metric may be incomplete,
we observe that our findings generally hold for all
the three metrics we evaluated. Note that the first-
token metric is stricter than the any-token metric by
design. Unless otherwise specified, we report the
top-k scores using the first-token metric by default.
Model architectures.
Table 2summarizes the
models evaluated in this work. For language
models over code, we consider CuBERT (Kanade
et al.,2020) and CodeBERT (Feng et al.,2020),
and we evaluate their released pre-trained check-
points. Both of them are based on architectures
initially designed for NL. Specifically, CuBERT
utilizes the BERT (Devlin et al.,2019) architec-
ture, and CodeBERT (Feng et al.,2020) utilizes
the RoBERTa (Liu et al.,2019) architecture. For
NL models, we also evaluate multilingual variants
of BERT and RoBERTa on the German dataset,
i.e., Multilingual BERT (Pires et al.,2019) and
XLM-RoBERTa (Conneau et al.,2020). Both of
the two code language models are cased, so we also
evaluate the cased versions of the NL models.
Programming Languages Natural Languages
CuBERT BERT
Multilingual BERT
CodeBERT RoBERTa
XLM-RoBERTa
Table 2: Model architectures evaluated on PL and NL
benchmarks. Models in the same row share the same
architecture, but are pre-trained on different corpora.
Baselines.
To examine how well the attention
performs through comparisons, we design a sim-
ple offset baseline and a simple keyword baseline.
The offset baseline with an offset value of
i
always
selects the token after
i
positions of the input to-
ken as its prediction when
i > 0
, and selects
i
positions before the input token when
i < 0
. The
keyword baseline with a keyword of
key
always
predicts the next
key
token as its prediction. In our
experiments, we evaluate offset baselines with each
possible offset value between 0 and 512 for PL, and
-512 to 512 for NL. We use all Python and Java key-
words for the keyword baselines on Python and
Java datasets respectively, including tokens such
as
if
,
for
,
in
, etc. To evaluate the top-k scores
for baselines where
k2
, we combine k simple
baselines with different offset (keyword) values to
give k predictions. To select k offset (keyword)
values, we repeatedly and greedily include the next
value that yields the highest performance increase
for the relation type under consideration.
4 Experiments
In this section, we present the results of pre-trained
language models for both PL and NL syntax un-
derstanding tasks, and discuss the key observations
that distinguish PL from NL.
4.1 Main Results
Language Model Top-k Score
k=1 k=3 k=10 k=20
Python
Offset 43.6 63.7 87.3 94.9
Keyword 15.7 21.9 23.6 23.8
Combined 49.4 69.7 90.1 96.3
CuBERT 39.2 58.4 81.3 91.4
CodeBERT 33.1 51.8 78.6 89.2
RoBERTa 34.5 56.9 82.5 91.3
Diff (Model - Baseline) -10.2 -11.3 -8.8 -4.9
Java
Offset 52.7 71.5 87.1 94.3
Keyword 22.4 27.3 30.2 30.6
Combined 60.4 77.2 90.0 96.1
CuBERT 39.7 59.8 80.0 90.2
CodeBERT 36.3 57.1 78.3 88.8
RoBERTa 34.7 57.8 80.3 90.5
Diff (Model - Baseline) -20.7 -17.4 -10.0 -5.9
Table 3: Top-k scores for code syntax understanding.
For each language, the upper block contains the re-
sults of baselines, including: (1) Offset: always picking
the token with a fixed positional offset; (2) Keyword:
matching a fixed keyword nearby; and (3) Combined:
combining the best option from Offset and Keyword.
Score differences are calculated as the best attention
score - best baseline score for each language, where
a positive value indicates that the language model sur-
passes the baseline.
We present our main results to compare the per-
formance in syntactic relation understanding on PL
and NL in Tables 3and 4, respectively. First, on
CodeSyntax
, language models generally perform
worse than simple offset baseline and its combi-
nation with the keyword baseline, which indicates
摘要:

BenchmarkingLanguageModelsforCodeSyntaxUnderstandingDaShen1,XinyunChen2y,ChenguangWang3y,KoushikSen4,DawnSong41UniversityofMaryland,CollegePark,2GoogleResearch,BrainTeam3WashingtonUniversityinSt.Louis,4UniversityofCalifornia,Berkeleydashen@terpmail.umd.edu,xinyunchen@google.com,chenguangwang@wustl.e...

展开>> 收起<<
Benchmarking Language Models for Code Syntax Understanding Da Shen1 Xinyun Chen2y Chenguang Wang3y Koushik Sen4 Dawn Song4 1University of Maryland College Park2Google Research Brain Team.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:23 页 大小:834.58KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注