Benchmarking Language Models for Code Syntax Understanding Da Shen1 Xinyun Chen2y Chenguang Wang3y Koushik Sen4 Dawn Song4 1University of Maryland College Park2Google Research Brain Team

2025-05-06 0 0 834.58KB 23 页 10玖币

侵权投诉

Benchmarking Language Models for Code Syntax Understanding

Da Shen1, Xinyun Chen2†, Chenguang Wang3†, Koushik Sen4, Dawn Song4

1University of Maryland, College Park, 2Google Research, Brain Team

3Washington University in St. Louis, 4University of California, Berkeley

dashen@terpmail.umd.edu,xinyunchen@google.com,chenguangwang@wustl.edu,

{ksen,dawnsong}@cs.berkeley.edu

Abstract

Pre-trained language models have demon-

strated impressive performance in both natu-

ral language processing and program under-

standing, which represent the input as a to-

ken sequence without explicitly modeling its

structure. Some prior works show that pre-

trained language models can capture the syn-

tactic rules of natural languages without ﬁne-

tuning on syntax understanding tasks. How-

ever, there is limited understanding of how

well pre-trained models understand the code

structure so far. In this work, we perform

the ﬁrst thorough benchmarking of the state-

of-the-art pre-trained models for identifying

the syntactic structures of programs. Speciﬁ-

cally, we introduce CodeSyntax, a large-scale

dataset of programs annotated with the syn-

tactic relationships in their corresponding ab-

stract syntax trees. Our key observation is

that existing language models pretrained on

code still lack the understanding of code syn-

tax. In fact, these pre-trained programming

language models fail to match the performance

of simple baselines based on positional offsets

and keywords. We also present a natural lan-

guage benchmark to highlight the differences

between natural languages and programming

languages in terms of syntactic structure un-

derstanding. Our ﬁndings point out key limita-

tions of existing pre-training methods for pro-

gramming languages, and suggest the impor-

tance of modeling code syntactic structures.1

1 Introduction

Large-scale pre-training of language models has

become the de-facto paradigm for a variety of natu-

ral language processing tasks. Furthermore, recent

studies show that models pre-trained on a massive

amount of code also achieve competitive perfor-

mance on many tasks, e.g., code generation and

†Corresponding authors.

Our code and dataset are available at

https://github.

com/dashends/CodeSyntax.

There were many pioneer PC contributors.

result = object.function(argument)

root

expl nn

amod

nsubj

Assign Attribute Call

(a)

(b)

Figure 1: Examples of syntactic relations for (a) natural

languages (NL) and (b) programming languages (PL).

Each relation is represented by an arrow. The relations

in PL represent the syntax of code in a way similar to

those in NL.

Figure 2: A preview of the model performance com-

parison on NL and PL syntax understanding tasks. Pre-

trained models capture NL syntax relatively well, but

perform worse in understanding PL syntax. The Offset

baseline picks the token using a ﬁxed positional offset.

We use BERT-large and RoBERTa-base conﬁgurations

(corresponding to the conﬁgurations of CuBERT and

CodeBERT). The plot shows top-1 scores. See Tables 3

and 4for the full results.

code classiﬁcation. These tasks are closely related

to natural language (NL) tasks in their problem

formulation. Nowadays, the common practice for

solving these coding tasks is to utilize the language

model architectures and training schemes that are

originally designed for NL. The design principle of

these neural language models is signiﬁcantly differ-

ent from the classic rule-based program generation

arXiv:2210.14473v1 [cs.CL] 26 Oct 2022

systems. Speciﬁcally, neural language models take

the program as a token sequence, while classic

program generation systems utilize the language

grammar and code structure. Despite the advanced

performance of pre-trained language models on

code understanding tasks, what these models have

learned from the code corpus remains unclear.

In this work, we investigate whether large-scale

pre-training is all we need for code representation

learning. In particular, we conduct the ﬁrst system-

atic study to analyze how the pre-trained language

models understand the syntactic structures of pro-

grams. To this end, we introduce

CodeSyntax

, a

large-scale benchmark consisting of programs an-

notated with the syntactic relationships between

different tokens. The ground truth syntactic rela-

tionships are extracted from edges in the abstract

syntax trees (AST) of the programs. Figure 1

shows some examples. These syntactic relations

are function-wise similar to dependency relations

for NL, where prior work has demonstrated that

the attention heads of pre-trained language models

can help to identify NL relation types (Clark et al.,

2019;Raganato et al.,2018). To measure how well

the pre-trained language models capture the code

syntactic structures, we adopt the approach to the

PL domain. We focus on investigating the zero-

shot capability of existing pre-training methods in

our experiments, and we evaluate these pre-trained

models without ﬁnetuning them on our benchmark.

We evaluate the state-of-the-art pre-trained lan-

guage models for code representation learning, in-

cluding CuBERT (Kanade et al.,2020) and Code-

BERT (Feng et al.,2020). A common character-

istic of these models is that they share the same

Transformer-based architectural design as NL mod-

els (Vaswani et al.,2017;Devlin et al.,2019). This

allows us to directly compare their performance

in capturing the syntax structure. We present a

preview of our key results in Figure 2. Our main

observation is that pre-training is insufﬁcient for

learning the syntactic relations in code. First, we

ﬁnd that the models pre-trained on code do not al-

ways outperform models pre-trained on NL corpus

alone. Surprisingly, compared to CodeBERT which

is trained on both text and code corpora, RoBERTa

achieves better performance without training on

any code with identical model architecture. This

indicates that pre-training on programs as token

sequences does not help learn the syntactic rela-

tions. On the contrary, without dependency rela-

tions, pre-training still enables language models to

understand the NL syntax to some extent.

Moreover, for code syntax understanding, the

pre-trained models even perform worse than simple

baselines that pick the tokens with a ﬁxed offset.

For example, always selecting the (p+2)-th token as

the p-th token’s dependency yields higher accuracy

than any attention head for several relation types.

On the other hand, the same model architectures

pre-trained on text corpora achieve decent accuracy

in identifying the dependency relations in the NL

domain, where the performance of the same simple

baselines is far behind.

Our analysis reveals several key differences be-

tween NL and PL that lead to different capabilities

of understanding the syntax for pre-trained mod-

els. First, programs are more structured than NL

sentences. Programs usually contain hierarchical

structures representing long-term dependencies be-

tween code tokens. Consequently, a large num-

ber of syntactic relation types are between distant

tokens, which can be difﬁcult to recognize for at-

tention heads. On the contrary, the dependency

relations in NL sentences mostly connect nearby

token pairs, and in this case the attention heads are

more capable of identifying the correct relations.

Meanwhile, language models are good at recog-

nizing keyword-based relations, such as picking

the corresponding else keyword for an if token.

Interestingly, we ﬁnd that the inclusion of tokens

such as newlines and semicolons notably affects

the performance in the code domain.

Our ﬁndings suggest that existing pre-trained

models perform quite differently in PL and NL do-

mains in terms of the ability to understand syntax.

Thus, directly applying training paradigms devel-

oped for NL could be suboptimal for program learn-

ing, and we consider designing better approaches

to model the code structure as future work.

2CodeSyntax: Benchmarking Code

Syntax Understanding

We construct the

CodeSyntax

benchmark to eval-

uate the performance of language models on code

syntax understanding. We focus on Python and

Java languages, on which the publicly released

model checkpoints of both CuBERT (Kanade et al.,

2020) and CodeBERT (Feng et al.,2020) are pre-

trained. We obtain the code samples from Code-

SearchNet (Husain et al.,2019), which is a large-

scale dataset consisting of code in different pro-

Relation Count Explanation Code Example

head→dependent Python Java Python Java

Assign:

target→value

78,482 13,384

Assigning a value to a target

variable.

target =10 int target =10;

Call:

func→args

110,949 50,890

Calling a function with some

arguments.

function(arg)function(arg);

For:

for→body

8,704 1,864

A for loop repeatedly executes

the body block for some itera-

tions.

for target in iter:

body

for (initializers;

test; updaters) {

body;

}

If:

if→else

11,024 5,038

An if statement conditionally

executes a body based upon

some criteria. The dependent

is the else keyword.

if condition:

body1

else:

body2

if (condition) {

body1;

}else {

body2;

}

If:

if→body

34,250 22,392

An if statement. The depen-

dent is the body block.

if condition:

body1

else:

body2

if (condition) {

body1;

} else {

body2;

}

If:

body→orelse

11,024 4,976

An if statement. The head is

the body block and the depen-

dent is the body of the else

block.

if condition:

body1

else:

body2

if (condition) {

body1;

} else {

body2;

}

While:

test→body

743 975

The while loop repeatedly exe-

cutes the body block as long as

the speciﬁed condition is true.

while condition:

body

while (condition) {

body;

}

Table 1: Dataset statistics of selected relation types in CodeSyntax. For each relation type, we highlight the head

and dependent nodes in the examples in bold, with the head in blue and the dependent in red. We defer the full

statistics of all relation types to Table 8in the appendix.

gramming languages. Its training set is also part

of the pre-training data of CodeBERT, so we re-

move the data samples that are included in the

pre-training data of either CuBERT or CodeBERT.

Thus, none of the programs in

CodeSyntax

has

been seen by CuBERT or CodeBERT in the pre-

training phase.

In total,

CodeSyntax

contains 18,701 code sam-

ples annotated with 1,342,050 relation edges in

43 relation types for Python, and 13,711 code

samples annotated with 864,411 relation edges

in 39 relation types for Java. Each code sam-

ple is an entire function consisting of multiple

statements, which is analogous to a paragraph

in NL. Each relation corresponds to an edge in

the program AST; speciﬁcally, we utilize the

Python ast module (Foundation,2021) and the Java

org.eclipse.jdt.core.dom.ASTParser class (Contrib-

utors,2014) to parse a code sample into an AST.

We present some examples of relation types in Ta-

ble 1, and we defer the description of all relation

types to Table 8in the appendix. More details about

relation extraction are discussed in Appendix A.

Note that we can easily extend the dataset to cover

more languages since the workﬂow for extracting

relations is automated and AST parsers are avail-

able for most popular programming languages.

We observe several characteristics of relations

CodeSyntax

. First, the keywords in PL play an

important role in recognizing the code structure.

Speciﬁcally, some relation types have ﬁxed key-

words as the edge nodes, such as the

If:if→else

relation. Meanwhile, compared to the dependency

relations in NL, the relation edges in the program

AST tend to connect nodes that are much farther

away from each other. As shown in Figure 3, the

average

offset between head and dependent nodes

is no more than 10 for dependency relations in

NL, while the average offset for a relation type

can be more than 100 code tokens. Speciﬁcally, in

CodeSyntax

, there are 22 near dependency types

whose average offsets are less than 10, and 12 far

(a) CodeSyntax.

(b) Natural language corpus.

Figure 3: Offset distribution of relation types in (a)

CodeSyntax and (b) NL corpus. The x axis is the av-

erage positional offset distance between heads and de-

pendents for each relation. The y axis is the number

of relations that has the average offset value. See Sec-

tion 3for more details on the NL corpus.

dependency types whose average offsets are above

10.

3 Evaluation Setup

Do pre-trained language models capture the code

structure without direct supervision of the syntac-

tic information? To investigate this question, we

evaluate several pre-trained language models with-

out ﬁnetuning, and compare their performance in

understanding the syntax for NL and PL.

Natural language benchmark.

To compare the

performance on

CodeSyntax

to NL syntax under-

standing, we construct the NL benchmark that

includes English and German. Speciﬁcally, we

use the English News Text Treebank: Penn Tree-

bank Revised (Bies et al.,2015) labeled with Stan-

ford Dependencies (de Marneffe and Manning,

2008a,b), and German Hamburg Dependency Tree-

bank (Foth et al.,2014) labeled with Universal De-

pendencies (de Marneffe et al.,2021). In total, the

English dataset has 48,883 sentences, 43 relation

types, and 1,147,526 relation edges; the German

dataset has 18,459 sentences, 35 relation types, and

307,791 relation edges.

Attention probing approach.

Some prior

works demonstrate that a Transformer archi-

tecture (Vaswani et al.,2017) pre-trained on a

text corpus, such as BERT (Devlin et al.,2019),

contains attention heads that specialize in certain

dependency relations in NL (Raganato et al.,2018;

Clark et al.,2019). Speciﬁcally, in the Transformer

architecture, each vector

for an input token

is transformed into the query and key vectors

and

via some linear transformations, and the

transformations vary among different attention

heads. For the

-th token, the attention weight

assigned to the j-th token is

αi,j =exp(qT

ikj)

Plexp(qT

ikl)

The attention weight indicates how important

the j-th token is with respect to the i-th token.

Typically, different attention heads learn differ-

ent weights between input tokens. Therefore, to

measure the correctness of recognizing a relation

type

, for each edge

<h, t, r>

in the program

AST where

is the head node and

is the de-

pendent node, we enumerate all attention heads to

compute the attention weight

αh,t

. If an attention

head tends to assign high attention weights that

connect the pair of tokens belonging to the relation

type

, we consider the relation type to be captured.

We defer more implementation details of attention

map extraction to Appendix B.

Metrics.

We use the unlabeled attachment score

(UAS) to measure the syntax understanding perfor-

mance, and we consider top-k scores with different

values of k. To compute top-k scores for language

models, for each attention head, given the head to-

ken

in a relation edge

<h, t, r>

, we compute

the attention weight over all tokens in the input

code, and we consider the prediction to be correct

if the attention weight over the dependent token

is among the top-k tokens with the highest at-

tention weights. For each relation, we select the

best-performing attention head and use its score as

the model’s score for that relation. We calculate a

model’s average score over all relations as the ﬁnal

score of the model.

In NL dependency parsing problems, the depen-

dent node

usually corresponds to a single word.

However, in PL, the dependent can be a block that

contains multiple code tokens. For example, in the

If:if→body

relation, the head is the keyword

while the dependent is the entire body block. There-

fore, we measure three metrics. First-token metric

and last-token metric: the prediction is deemed

correct if it successfully predicts the ﬁrst and last

token of the dependent block, respectively; Any-

token metric: the prediction is considered correct

if it can predict any token within the dependent

block. While we agree that these are not perfect

metrics and one single metric may be incomplete,

we observe that our ﬁndings generally hold for all

the three metrics we evaluated. Note that the ﬁrst-

token metric is stricter than the any-token metric by

design. Unless otherwise speciﬁed, we report the

top-k scores using the ﬁrst-token metric by default.

Model architectures.

Table 2summarizes the

models evaluated in this work. For language

models over code, we consider CuBERT (Kanade

et al.,2020) and CodeBERT (Feng et al.,2020),

and we evaluate their released pre-trained check-

points. Both of them are based on architectures

initially designed for NL. Speciﬁcally, CuBERT

utilizes the BERT (Devlin et al.,2019) architec-

ture, and CodeBERT (Feng et al.,2020) utilizes

the RoBERTa (Liu et al.,2019) architecture. For

NL models, we also evaluate multilingual variants

of BERT and RoBERTa on the German dataset,

i.e., Multilingual BERT (Pires et al.,2019) and

XLM-RoBERTa (Conneau et al.,2020). Both of

the two code language models are cased, so we also

evaluate the cased versions of the NL models.

Programming Languages Natural Languages

CuBERT BERT

Multilingual BERT

CodeBERT RoBERTa

XLM-RoBERTa

Table 2: Model architectures evaluated on PL and NL

benchmarks. Models in the same row share the same

architecture, but are pre-trained on different corpora.

Baselines.

To examine how well the attention

performs through comparisons, we design a sim-

ple offset baseline and a simple keyword baseline.

The offset baseline with an offset value of

always

selects the token after

positions of the input to-

ken as its prediction when

i > 0

, and selects

positions before the input token when

i < 0

. The

keyword baseline with a keyword of

key

always

predicts the next

key

token as its prediction. In our

experiments, we evaluate offset baselines with each

possible offset value between 0 and 512 for PL, and

-512 to 512 for NL. We use all Python and Java key-

words for the keyword baselines on Python and

Java datasets respectively, including tokens such

for

, etc. To evaluate the top-k scores

for baselines where

k≥2

, we combine k simple

baselines with different offset (keyword) values to

give k predictions. To select k offset (keyword)

values, we repeatedly and greedily include the next

value that yields the highest performance increase

for the relation type under consideration.

4 Experiments

In this section, we present the results of pre-trained

language models for both PL and NL syntax un-

derstanding tasks, and discuss the key observations

that distinguish PL from NL.

4.1 Main Results

Language Model Top-k Score

k=1 k=3 k=10 k=20

Python

Offset 43.6 63.7 87.3 94.9

Keyword 15.7 21.9 23.6 23.8

Combined 49.4 69.7 90.1 96.3

CuBERT 39.2 58.4 81.3 91.4

CodeBERT 33.1 51.8 78.6 89.2

RoBERTa 34.5 56.9 82.5 91.3

Diff (Model - Baseline) -10.2 -11.3 -8.8 -4.9

Java

Offset 52.7 71.5 87.1 94.3

Keyword 22.4 27.3 30.2 30.6

Combined 60.4 77.2 90.0 96.1

CuBERT 39.7 59.8 80.0 90.2

CodeBERT 36.3 57.1 78.3 88.8

RoBERTa 34.7 57.8 80.3 90.5

Diff (Model - Baseline) -20.7 -17.4 -10.0 -5.9

Table 3: Top-k scores for code syntax understanding.

For each language, the upper block contains the re-

sults of baselines, including: (1) Offset: always picking

the token with a ﬁxed positional offset; (2) Keyword:

matching a ﬁxed keyword nearby; and (3) Combined:

combining the best option from Offset and Keyword.

Score differences are calculated as the best attention

score - best baseline score for each language, where

a positive value indicates that the language model sur-

passes the baseline.

We present our main results to compare the per-

formance in syntactic relation understanding on PL

and NL in Tables 3and 4, respectively. First, on

CodeSyntax

, language models generally perform

worse than simple offset baseline and its combi-

nation with the keyword baseline, which indicates

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BenchmarkingLanguageModelsforCodeSyntaxUnderstandingDaShen1,XinyunChen2y,ChenguangWang3y,KoushikSen4,DawnSong41UniversityofMaryland,CollegePark,2GoogleResearch,BrainTeam3WashingtonUniversityinSt.Louis,4UniversityofCalifornia,Berkeleydashen@terpmail.umd.edu,xinyunchen@google.com,chenguangwang@wustl.e...

展开>> 收起<<

Benchmarking Language Models for Code Syntax Understanding Da Shen1 Xinyun Chen2y Chenguang Wang3y Koushik Sen4 Dawn Song4 1University of Maryland College Park2Google Research Brain Team.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Benchmarking Language Models for Code Syntax Understanding Da Shen1 Xinyun Chen2y Chenguang Wang3y Koushik Sen4 Dawn Song4 1University of Maryland College Park2Google Research Brain Team

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: