CAT-probing A Metric-based Approach to Interpret How Pre-trained Models for Programming Language Attend Code Structure Nuo Chen Qiushi Sun Renyu Zhu Xiang Liy Xuesong Lu and Ming Gao

2025-04-27 0 0 671.63KB 9 页 10玖币

侵权投诉

CAT-probing: A Metric-based Approach to Interpret How Pre-trained

Models for Programming Language Attend Code Structure

Nuo Chen∗

, Qiushi Sun∗

, Renyu Zhu∗

, Xiang Li†, Xuesong Lu, and Ming Gao

School of Data Science and Engineering, East China Normal University, Shanghai, China

{nuochen,qiushisun,renyuzhu}@stu.ecnu.edu.cn,

{xiangli,xslu,mgao}@dase.ecnu.edu.cn

Abstract

Code pre-trained models (CodePTMs) have

recently demonstrated signiﬁcant success in

code intelligence. To interpret these mod-

els, some probing methods have been ap-

plied. However, these methods fail to con-

sider the inherent characteristics of codes. In

this paper, to address the problem, we pro-

pose a novel probing method CAT-probing

to quantitatively interpret how CodePTMs at-

tend code structure. We ﬁrst denoise the in-

put code sequences based on the token types

pre-deﬁned by the compilers to ﬁlter those to-

kens whose attention scores are too small. Af-

ter that, we deﬁne a new metric CAT-score to

measure the commonality between the token-

level attention scores generated in CodePTMs

and the pair-wise distances between corre-

sponding AST nodes. The higher the CAT-

score, the stronger the ability of CodePTMs

to capture code structure. We conduct ex-

tensive experiments to integrate CAT-probing

with representative CodePTMs for different

programming languages. Experimental re-

sults show the effectiveness of CAT-probing in

CodePTM interpretation. Our codes and data

are publicly available at https://github.

com/nchen909/CodeAttention.

1 Introduction

In the era of “Big Code” (Allamanis et al.,2018),

the programming platforms, such as GitHub and

Stack Overﬂow, have generated massive open-

source code data. With the assumption of “Soft-

ware Naturalness” (Hindle et al.,2016), pre-trained

models (Vaswani et al.,2017;Devlin et al.,2019;

Liu et al.,2019) have been applied in the domain

of code intelligence.

Existing code pre-trained models (CodePTMs)

can be mainly divided into two categories:

structure-free methods (Feng et al.,2020;Svy-

∗Equal contribution, authors are listed alphabetically.

†Corresponding author.

atkovskiy et al.,2020) and structure-based meth-

ods (Wang et al.,2021b;Niu et al.,2022b). The

former only utilizes the information from raw code

texts, while the latter employs code structures,

such as data ﬂow (Guo et al.,2021) and ﬂattened

AST

(Guo et al.,2022), to enhance the perfor-

mance of pre-trained models. For more details,

readers can refer to Niu et al. (2022a). Recently,

there exist works that use probing techniques (Clark

et al.,2019a;Vig and Belinkov,2019;Zhang et al.,

2021) to investigate what CodePTMs learn. For

example, Karmakar and Robbes (2021) ﬁrst probe

into CodePTMs and construct four probing tasks

to explain them. Troshin and Chirkova (2022) also

deﬁne a series of novel diagnosing probing tasks

about code syntactic structure. Further, Wan et al.

(2022) conduct qualitative structural analyses to

evaluate how CodePTMs interpret code structure.

Despite the success, all these methods lack quan-

titative characterization on the degree of how well

CodePTMs learn from code structure. Therefore,

a research question arises: Can we develop a new

probing way to evaluate how CodePTMs attend

code structure quantitatively?

In this paper, we propose a metric-based probing

method, namely, CAT-probing, to quantitatively

evaluate how

odePTMs

ttention scores relate to

distances between AS

nodes. First, to denoise the

input code sequence in the original attention scores

matrix, we classify the rows/cols by token types

that are pre-deﬁned by compilers, and then retain

tokens whose types have the highest proportion

scores to derive a ﬁltered attention matrix (see Fig-

ure 1(b)). Meanwhile, inspired by the works (Wang

et al.,2020;Zhu et al.,2022), we add edges to im-

prove the connectivity of AST and calculate the dis-

tances between nodes corresponding to the selected

tokens, which generates a distance matrix as shown

in Figure 1(c). After that, we deﬁne CAT-score to

measure the matching degree between the ﬁltered

1Abstract syntax tree.

arXiv:2210.04633v4 [cs.SE] 10 Dec 2022

function_

definition

parameters

block

expression_

statement call

attribute

argument_

list

def write

( self , data )

self tmpbuf

append

. ( data )

def write (self, data):

self.tmpbuf.append(data)

...

Non-leaves

Leaves

Leaf edges

AST edges

Dataflow edges

(a) A Python code snippet with its U-AST (b) The attention matrix (ﬁltered) (c) The distance matrix (ﬁltered)

Figure 1: Visualization on the U-AST structure, the attention matrix generated in the last layer of CodeBERT (Feng

et al.,2020) and the distance matrix. (a) A Python code snippet with its corresponding U-AST. (b) Heatmaps of the

averaged attention weights after attention matrix ﬁltering. (c) Heatmaps of the pair-wise token distance in U-AST.

In the heatmaps, the darker the color, the more salient the attention score, or the closer the nodes. In this toy

example, only the token “.” between “tmpbuf” and “append” is ﬁltered. More visualization examples of ﬁltering

are given in Appendix D.

attention matrix and the distance matrix. Speciﬁ-

cally, the point-wise elements of the two matrices

are matched if both the two conditions are satisﬁed:

1) the attention score is larger than a threshold; 2)

the distance value is smaller than a threshold. If

only one condition is reached, the elements are un-

matched. We calculate the CAT-score by the ratio

of the number of matched elements to the summa-

tion of matched and unmatched elements. Finally,

the CAT-score is used to interpret how CodePTMs

attend code structure, where a higher score indi-

cates that the model has learned more structural

information.

Our main contributions can be summarized as

follows:

•

We propose a novel metric-based probing

method CAT-probing to quantitatively inter-

pret how CodePTMs attend code structure.

•

We apply CAT-probing to several representa-

tive CodePTMs and perform extensive experi-

ments to demonstrate the effectiveness of our

method (See Section 4.3).

•

We draw two fascinating observations from

the empirical evaluation: 1) The token types

that PTMs focus on vary with programming

languages and are quite different from the gen-

eral perceptions of human programmers (See

Section 4.2). 2) The ability of CodePTMs

to capture code structure dramatically differs

with layers (See Section 4.4).

2 Code Background

2.1 Code Basics

Each code can be represented in two modals: the

source code and the code structure (AST), as shown

in Figure 1(a). In this paper, we use Tree-sitter

to generate ASTs, where each token in the raw

code is tagged with a unique type, such as “iden-

tiﬁer”, “return” and “=”. Further, following these

works (Wang et al.,2020;Zhu et al.,2022), we

connect adjacent leaf nodes by adding data ﬂow

edges, which increases the connectivity of AST.

The upgraded AST is named as U-AST.

2.2 Code Matrices

There are two types of code matrices: the atten-

tion matrix and the distance matrix. Speciﬁcally,

the attention matrix denotes the attention score

generated by the Transformer-based CodePTMs,

while the distance matrix captures the distance be-

tween nodes in U-AST. We transform the original

subtoken-level attention matrix into the token-level

attention matrix by averaging the attention scores

of subtokens in a token. For the distance matrix, we

use the shortest-path length to compute the distance

between the leaf nodes of U-AST. Our attention

matrix and distance matrix are shown in Figure 1(b)

and Figure 1(c), respectively.

2github.com/tree-sitter

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CAT-probing:AMetric-basedApproachtoInterpretHowPre-trainedModelsforProgrammingLanguageAttendCodeStructureNuoChen,QiushiSun,RenyuZhu,XiangLiy,XuesongLu,andMingGaoSchoolofDataScienceandEngineering,EastChinaNormalUniversity,Shanghai,China{nuochen,qiushisun,renyuzhu}@stu.ecnu.edu.cn,{xiangli,xslu,mga...

展开>> 收起<<

CAT-probing A Metric-based Approach to Interpret How Pre-trained Models for Programming Language Attend Code Structure Nuo Chen Qiushi Sun Renyu Zhu Xiang Liy Xuesong Lu and Ming Gao.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

CAT-probing A Metric-based Approach to Interpret How Pre-trained Models for Programming Language Attend Code Structure Nuo Chen Qiushi Sun Renyu Zhu Xiang Liy Xuesong Lu and Ming Gao

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: