CAT-probing A Metric-based Approach to Interpret How Pre-trained Models for Programming Language Attend Code Structure Nuo Chen Qiushi Sun Renyu Zhu Xiang Liy Xuesong Lu and Ming Gao

2025-04-27 0 0 671.63KB 9 页 10玖币
侵权投诉
CAT-probing: A Metric-based Approach to Interpret How Pre-trained
Models for Programming Language Attend Code Structure
Nuo Chen
, Qiushi Sun
, Renyu Zhu
, Xiang Li, Xuesong Lu, and Ming Gao
School of Data Science and Engineering, East China Normal University, Shanghai, China
{nuochen,qiushisun,renyuzhu}@stu.ecnu.edu.cn,
{xiangli,xslu,mgao}@dase.ecnu.edu.cn
Abstract
Code pre-trained models (CodePTMs) have
recently demonstrated significant success in
code intelligence. To interpret these mod-
els, some probing methods have been ap-
plied. However, these methods fail to con-
sider the inherent characteristics of codes. In
this paper, to address the problem, we pro-
pose a novel probing method CAT-probing
to quantitatively interpret how CodePTMs at-
tend code structure. We first denoise the in-
put code sequences based on the token types
pre-defined by the compilers to filter those to-
kens whose attention scores are too small. Af-
ter that, we define a new metric CAT-score to
measure the commonality between the token-
level attention scores generated in CodePTMs
and the pair-wise distances between corre-
sponding AST nodes. The higher the CAT-
score, the stronger the ability of CodePTMs
to capture code structure. We conduct ex-
tensive experiments to integrate CAT-probing
with representative CodePTMs for different
programming languages. Experimental re-
sults show the effectiveness of CAT-probing in
CodePTM interpretation. Our codes and data
are publicly available at https://github.
com/nchen909/CodeAttention.
1 Introduction
In the era of “Big Code” (Allamanis et al.,2018),
the programming platforms, such as GitHub and
Stack Overflow, have generated massive open-
source code data. With the assumption of “Soft-
ware Naturalness” (Hindle et al.,2016), pre-trained
models (Vaswani et al.,2017;Devlin et al.,2019;
Liu et al.,2019) have been applied in the domain
of code intelligence.
Existing code pre-trained models (CodePTMs)
can be mainly divided into two categories:
structure-free methods (Feng et al.,2020;Svy-
Equal contribution, authors are listed alphabetically.
Corresponding author.
atkovskiy et al.,2020) and structure-based meth-
ods (Wang et al.,2021b;Niu et al.,2022b). The
former only utilizes the information from raw code
texts, while the latter employs code structures,
such as data flow (Guo et al.,2021) and flattened
AST
1
(Guo et al.,2022), to enhance the perfor-
mance of pre-trained models. For more details,
readers can refer to Niu et al. (2022a). Recently,
there exist works that use probing techniques (Clark
et al.,2019a;Vig and Belinkov,2019;Zhang et al.,
2021) to investigate what CodePTMs learn. For
example, Karmakar and Robbes (2021) first probe
into CodePTMs and construct four probing tasks
to explain them. Troshin and Chirkova (2022) also
define a series of novel diagnosing probing tasks
about code syntactic structure. Further, Wan et al.
(2022) conduct qualitative structural analyses to
evaluate how CodePTMs interpret code structure.
Despite the success, all these methods lack quan-
titative characterization on the degree of how well
CodePTMs learn from code structure. Therefore,
a research question arises: Can we develop a new
probing way to evaluate how CodePTMs attend
code structure quantitatively?
In this paper, we propose a metric-based probing
method, namely, CAT-probing, to quantitatively
evaluate how
C
odePTMs
A
ttention scores relate to
distances between AS
T
nodes. First, to denoise the
input code sequence in the original attention scores
matrix, we classify the rows/cols by token types
that are pre-defined by compilers, and then retain
tokens whose types have the highest proportion
scores to derive a filtered attention matrix (see Fig-
ure 1(b)). Meanwhile, inspired by the works (Wang
et al.,2020;Zhu et al.,2022), we add edges to im-
prove the connectivity of AST and calculate the dis-
tances between nodes corresponding to the selected
tokens, which generates a distance matrix as shown
in Figure 1(c). After that, we define CAT-score to
measure the matching degree between the filtered
1Abstract syntax tree.
arXiv:2210.04633v4 [cs.SE] 10 Dec 2022
function_
definition
parameters
block
expression_
statement call
attribute
attribute
argument_
list
def write
( self , data )
self tmpbuf
append
.
. ( data )
:
def write (self, data):
self.tmpbuf.append(data)
...
Non-leaves
Leaves
Leaf edges
AST edges
Dataflow edges
(a) A Python code snippet with its U-AST (b) The attention matrix (filtered) (c) The distance matrix (filtered)
Figure 1: Visualization on the U-AST structure, the attention matrix generated in the last layer of CodeBERT (Feng
et al.,2020) and the distance matrix. (a) A Python code snippet with its corresponding U-AST. (b) Heatmaps of the
averaged attention weights after attention matrix filtering. (c) Heatmaps of the pair-wise token distance in U-AST.
In the heatmaps, the darker the color, the more salient the attention score, or the closer the nodes. In this toy
example, only the token “. between “tmpbuf” and “append” is filtered. More visualization examples of filtering
are given in Appendix D.
attention matrix and the distance matrix. Specifi-
cally, the point-wise elements of the two matrices
are matched if both the two conditions are satisfied:
1) the attention score is larger than a threshold; 2)
the distance value is smaller than a threshold. If
only one condition is reached, the elements are un-
matched. We calculate the CAT-score by the ratio
of the number of matched elements to the summa-
tion of matched and unmatched elements. Finally,
the CAT-score is used to interpret how CodePTMs
attend code structure, where a higher score indi-
cates that the model has learned more structural
information.
Our main contributions can be summarized as
follows:
We propose a novel metric-based probing
method CAT-probing to quantitatively inter-
pret how CodePTMs attend code structure.
We apply CAT-probing to several representa-
tive CodePTMs and perform extensive experi-
ments to demonstrate the effectiveness of our
method (See Section 4.3).
We draw two fascinating observations from
the empirical evaluation: 1) The token types
that PTMs focus on vary with programming
languages and are quite different from the gen-
eral perceptions of human programmers (See
Section 4.2). 2) The ability of CodePTMs
to capture code structure dramatically differs
with layers (See Section 4.4).
2 Code Background
2.1 Code Basics
Each code can be represented in two modals: the
source code and the code structure (AST), as shown
in Figure 1(a). In this paper, we use Tree-sitter
2
to generate ASTs, where each token in the raw
code is tagged with a unique type, such as “iden-
tifier”, “return” and “=”. Further, following these
works (Wang et al.,2020;Zhu et al.,2022), we
connect adjacent leaf nodes by adding data flow
edges, which increases the connectivity of AST.
The upgraded AST is named as U-AST.
2.2 Code Matrices
There are two types of code matrices: the atten-
tion matrix and the distance matrix. Specifically,
the attention matrix denotes the attention score
generated by the Transformer-based CodePTMs,
while the distance matrix captures the distance be-
tween nodes in U-AST. We transform the original
subtoken-level attention matrix into the token-level
attention matrix by averaging the attention scores
of subtokens in a token. For the distance matrix, we
use the shortest-path length to compute the distance
between the leaf nodes of U-AST. Our attention
matrix and distance matrix are shown in Figure 1(b)
and Figure 1(c), respectively.
2github.com/tree-sitter
摘要:

CAT-probing:AMetric-basedApproachtoInterpretHowPre-trainedModelsforProgrammingLanguageAttendCodeStructureNuoChen,QiushiSun,RenyuZhu,XiangLiy,XuesongLu,andMingGaoSchoolofDataScienceandEngineering,EastChinaNormalUniversity,Shanghai,China{nuochen,qiushisun,renyuzhu}@stu.ecnu.edu.cn,{xiangli,xslu,mga...

展开>> 收起<<
CAT-probing A Metric-based Approach to Interpret How Pre-trained Models for Programming Language Attend Code Structure Nuo Chen Qiushi Sun Renyu Zhu Xiang Liy Xuesong Lu and Ming Gao.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:671.63KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注