systems. Specifically, neural language models take
the program as a token sequence, while classic
program generation systems utilize the language
grammar and code structure. Despite the advanced
performance of pre-trained language models on
code understanding tasks, what these models have
learned from the code corpus remains unclear.
In this work, we investigate whether large-scale
pre-training is all we need for code representation
learning. In particular, we conduct the first system-
atic study to analyze how the pre-trained language
models understand the syntactic structures of pro-
grams. To this end, we introduce
CodeSyntax
, a
large-scale benchmark consisting of programs an-
notated with the syntactic relationships between
different tokens. The ground truth syntactic rela-
tionships are extracted from edges in the abstract
syntax trees (AST) of the programs. Figure 1
shows some examples. These syntactic relations
are function-wise similar to dependency relations
for NL, where prior work has demonstrated that
the attention heads of pre-trained language models
can help to identify NL relation types (Clark et al.,
2019;Raganato et al.,2018). To measure how well
the pre-trained language models capture the code
syntactic structures, we adopt the approach to the
PL domain. We focus on investigating the zero-
shot capability of existing pre-training methods in
our experiments, and we evaluate these pre-trained
models without finetuning them on our benchmark.
We evaluate the state-of-the-art pre-trained lan-
guage models for code representation learning, in-
cluding CuBERT (Kanade et al.,2020) and Code-
BERT (Feng et al.,2020). A common character-
istic of these models is that they share the same
Transformer-based architectural design as NL mod-
els (Vaswani et al.,2017;Devlin et al.,2019). This
allows us to directly compare their performance
in capturing the syntax structure. We present a
preview of our key results in Figure 2. Our main
observation is that pre-training is insufficient for
learning the syntactic relations in code. First, we
find that the models pre-trained on code do not al-
ways outperform models pre-trained on NL corpus
alone. Surprisingly, compared to CodeBERT which
is trained on both text and code corpora, RoBERTa
achieves better performance without training on
any code with identical model architecture. This
indicates that pre-training on programs as token
sequences does not help learn the syntactic rela-
tions. On the contrary, without dependency rela-
tions, pre-training still enables language models to
understand the NL syntax to some extent.
Moreover, for code syntax understanding, the
pre-trained models even perform worse than simple
baselines that pick the tokens with a fixed offset.
For example, always selecting the (p+2)-th token as
the p-th token’s dependency yields higher accuracy
than any attention head for several relation types.
On the other hand, the same model architectures
pre-trained on text corpora achieve decent accuracy
in identifying the dependency relations in the NL
domain, where the performance of the same simple
baselines is far behind.
Our analysis reveals several key differences be-
tween NL and PL that lead to different capabilities
of understanding the syntax for pre-trained mod-
els. First, programs are more structured than NL
sentences. Programs usually contain hierarchical
structures representing long-term dependencies be-
tween code tokens. Consequently, a large num-
ber of syntactic relation types are between distant
tokens, which can be difficult to recognize for at-
tention heads. On the contrary, the dependency
relations in NL sentences mostly connect nearby
token pairs, and in this case the attention heads are
more capable of identifying the correct relations.
Meanwhile, language models are good at recog-
nizing keyword-based relations, such as picking
the corresponding else keyword for an if token.
Interestingly, we find that the inclusion of tokens
such as newlines and semicolons notably affects
the performance in the code domain.
Our findings suggest that existing pre-trained
models perform quite differently in PL and NL do-
mains in terms of the ability to understand syntax.
Thus, directly applying training paradigms devel-
oped for NL could be suboptimal for program learn-
ing, and we consider designing better approaches
to model the code structure as future work.
2CodeSyntax: Benchmarking Code
Syntax Understanding
We construct the
CodeSyntax
benchmark to eval-
uate the performance of language models on code
syntax understanding. We focus on Python and
Java languages, on which the publicly released
model checkpoints of both CuBERT (Kanade et al.,
2020) and CodeBERT (Feng et al.,2020) are pre-
trained. We obtain the code samples from Code-
SearchNet (Husain et al.,2019), which is a large-
scale dataset consisting of code in different pro-