1 Follow-up Attention An Empirical Study of Developer and Neural Model Code Exploration

2025-04-24 1 0 2.48MB 16 页 10玖币

侵权投诉

Follow-up Attention: An Empirical Study of

Developer and Neural Model Code Exploration

Matteo Paltenghi, Rahul Pandita, Austin Z. Henley, Albert Ziegler

Abstract—Recent neural models of code, such as OpenAI

Codex and AlphaCode, have demonstrated remarkable pro-

ﬁciency at code generation due to the underlying attention

mechanism. However, it often remains unclear how the models

actually process code, and to what extent their reasoning and

the way their attention mechanism scans the code matches the

patterns of developers. A poor understanding of the model

reasoning process limits the way in which current neural models

are leveraged today, so far mostly for their raw prediction. To

ﬁll this gap, this work studies how the processed attention signal

of three open large language models - CodeGen, InCoder and

GPT-J - agrees with how developers look at and explore code

when each answers the same sensemaking questions about code.

Furthermore, we contribute an open-source eye-tracking dataset

comprising 92 manually-labeled sessions from 25 developers

engaged in sensemaking tasks. We empirically evaluate ﬁve

heuristics that do not use the attention and ten attention-based

post-processing approaches of the attention signal of CodeGen

against our ground truth of developers exploring code, including

the novel concept of follow-up attention which exhibits the highest

agreement between model and human attention. Our follow-up

attention method can predict the next line a developer will look

at with 47% accuracy. This outperforms the baseline prediction

accuracy of 42.3%, which uses the session history of other

developers to recommend the next line. These results demonstrate

the potential of leveraging the attention signal of pre-trained

models for effective code exploration.

I. INTRODUCTION

Large language models (LLMs) pre-trained on code such

as Codex [1], CodeGen [2], and AlphaCode [3] have demon-

strated remarkable proﬁciency at program synthesis and com-

petitive programming tasks. Yet our understanding of why they

produce a particular solution is limited. In large-scale practical

applications, the models are often used for their prediction

alone, i.e., as generative models, and the way they reason about

code internally largely remains untapped.

These models are often based on the attention mechanism

[4], a key component of the transformer architecture [5].

Besides providing substantial performance beneﬁts, attention

weights have been used to provide interpretability of neural

models [6, 7, 8]. Additionally, existing work [9, 10, 11, 12]

also suggests that the attention mechanism reﬂects or encodes

objective properties of the source code processed by the model.

We argue that just as software developers consider different

locations in the code individually and follow meaningful

Matteo Paltenghi is with the University of Stuttgart, Stuttgart, Germany. E-

mail: mattepalte@live.it. Work done while at GitHub Next for a research

internship. Rahul Pandita and Albert Ziegler are with GitHub Inc, San

Francisco, CA, USA. E-mail: {rahulpandita, wunderalbert}@github.com.

Austin Z. Henley is with Microsoft Research, Redmond, WA, USA. E-mail:

azh321@gmail.com.

connections between them, the self-attention of transformers

connects and creates information ﬂow between similar and

linked code locations. This raises a question:

Are human attention and model attention comparable? And

if so, can the knowledge about source code conveyed by the

attention weights of neural models be leveraged to support

code exploration?

Although there are other observable signals that might cap-

ture the concept of relevance, such as gradients-based [13, 14]

or layer-wise relevance propagation [15], this work focuses on

approaches using only the attention signal. The reasons for

this choice are two: (1) almost all state-of-the-art models of

code are based on the transformer block [5], and the attention

mechanism is ultimately its fundamental component, so we

expect the corresponding attention weights to carry directly

meaningful information about the models’ decision process;

(2) attention weights can be extracted almost for free during

the generation with little runtime overhead since the attention

is computed automatically during a single forward pass.

Answering the main question of this study requires a dataset

tracking developers’ attention. In this work, we use visual

attention as a proxy for the elements to which developers

are paying mental attention while looking at code. However,

the existing datasets of visual attention are not suitable for

our purposes. Indeed, they either put the developers in an

unnatural, and thus possibly biasing, environment where most

of the vision is blurred [8], requiring participants to move the

mouse over tokens to reveal them, or they contain few and

very speciﬁc code comprehension tasks [16] on code snippets

too short to exhibit any interesting code navigation pattern.

This blurring method can introduce bias by forcing unnatural

interactions, potentially affecting how developers naturally

explore and understand code. To address these limitations

and stimulate developers to not only glance at code, but

also to deeply reason about it, we prepare an ad-hoc code

understanding assignment called the sensemaking task. This

involves questions on code, including mental code execution,

side-effects detection, algorithmic complexity, and deadlock

detection. Moreover, using eye-tracking, we collect and share

a dataset of 92 valid sessions with developers.

On the neural model side, motivated by some recent suc-

cessful applications of few-shot learning in code generation

and code summarization [17, 18] and even zero-shot in pro-

gram repair [19], the sensemaking task is designed to be

a zero-shot task for the model with a speciﬁc prompt that

triggers it to reason about the question at hand. Then we

query three LLMs of code, namely CodeGen [2], InCoder [20]

arXiv:2210.05506v2 [cs.SE] 29 Aug 2024

and GPT-J [21] on the same sensemaking task and compare

their attention signal1to the attention of developers. The

correlation with CodeGen, the largest model, is the highest

among the LLMs studied (r=+0.23), motivating the use of raw

and processed versions of CodeGen’s attention signal for code

exploration. To that end, we experimentally evaluate how well

existing and novel attention post-processing methods align

with the code exploration patterns derived from our dataset’s

chronological sequence of eye-ﬁxation events. To the best of

our knowledge, this work is the ﬁrst to investigate the attention

signal of these pre-trained models to support code exploration,

a speciﬁc code-related task, directly related to code reading

work [22, 23].

We empirically demonstrate that post-processing methods

based on the attention signal can be well aligned with the

way developers explore code. In particular, using the novel

concept of follow-up attention, we achieve the highest overlap

with the developers’ ground truth on which line to explore

next.

Contributions: This paper makes the following contributions:

⋆Sensemaking Task A novel task and setup to deepen our

understanding of how the LLM attention connects to the

temporal sequence of location shifts regarding developer

focus.

⋆Eye-Tracking Dataset A novel dataset of 92 eye tracking

sessions of 25 developers engaged in sensemaking tasks

while using a common code editor with code written in three

popular programming languages (Python, C++, and C#).

⋆Follow-up Attention The analytical formula for follow-up

attention, a novel post-processing approach derived solely

from the attention signal, which aligns well with the devel-

oper interaction of which line to look at next when exploring

code.

⋆Empirical Study The ﬁrst comparison of both effectiveness

and visual attention of LLMs and developers when rea-

soning on sensemaking questions. An empirical evaluation

comprising ten post-processing approaches of the attention

signal, ﬁve heuristics, and an ablation study of the follow-

up attention against the collected ground truth of developers

exploring code.

II. RELATED WORK

This section provides an overview of related work around

the explanatory role of attention and previous studies of the

attention of neural models and developers when reasoning on

code.

Attention as explanation. Initially, preliminary work [24]

studying attention weights of recurrent neural models has

found that the attention weights do not always agree with

other explanation methods and that alternative weights can

be adversarially constructed while still preserving the same

model prediction. However, in response, Wiegreffe and Pin-

ter [25] have shown how the alternative attention weights

can be constructed only per a single instance prediction,

whereas obtaining a model which is consistently wrong in

1Attention signal refers to the attention weights produced during a forward

pass by the transformer blocks.

its explanations is very unlikely to happen. On the same

line, Tutek and ˇ

Snajder [26] have proposed four regularization

methods to mitigate the adversarial exploitation of attention

weights for recurrent models, including the use of residual

connections which are natively embedded into transform-

ers [5], the building blocks of the LLMs studied in this work.

To further corroborate this connection between attention and

explanation, Rabin et al. [27] have shown how even Sivand,

an explainability technique based on program simpliﬁcation,

pinpoint important tokens which largely overlap with those

reported by the attention mechanism.

Attention studies of neural models of code. Paltenghi

and Pradel [8] have compared the attention weights of neural

models of code and developers’ visual attention when perform-

ing a code summarization task, and found a strong positive

correlation on the copy attention mechanism for an instance of

a pointer network [28]. Further works [9, 11] have then shown

how the attention weights of pre-trained models on source

code capture important properties of the abstract syntax tree

of the program. However, none of them considered the use

of the attention signal for a code-related task, such as code

exploration. Moreover, they are limited to relatively small self-

attention transformer models, whereas we study the attention

of CodeGen [2], InCoder [20] and GPT-J [21], large generative

models with masked self-attention.

Eye-Tracking Studies Turner et al. [29] conducted an eye-

tracking study involving 38 students ﬁxing or describing ﬁve

simple Python and C++ programs (5-13 LoC) showing that the

ﬁxation duration is comparable between the two languages.

Beelders [30] has qualitatively observed the eye movement

of 36 students and four lecturers when reading and mentally

executing a short C# program (12 LoC). An eye-tracking

dataset with 216 participants has been collected by [16],

however, they only consider two short snippets (11-22 LoC) of

code, since they do not support scrolling. Similarly, Blascheck

and Sharif [22] and Busjahn et al. [23] have studied the reading

order in C++ and Java code comprehension task focusing

on six small programs that could ﬁt into a single screen,

whereas we consider longer snippets and a much larger dataset

of 45 unique tasks. Shariﬁ et al. [31] have recently studied

code navigation strategies on Java code with eye tracking

involving 36 participants focusing on the bug ﬁxing process,

however, we study the sensemaking task which might elicit a

different kind of reasoning compared to bug-ﬁxing. To more

closely mimic real-world setups in integrated development

environments (IDEs), Guarnera et al. [32] propose iTrace,

an eye-tracking plugin for IDEs that can track developers’

eye movements in more realistic and dynamic coding envi-

ronments beyond a single screen of code. Further studies,

including Fakhoury et al. [33], have proposed Gazel, an IDE

plugin that supports eye tracking in the context of source code

editing. Following this latest trend, we also use an IDE plugin

to collect the eye-tracking data, allowing for a more realistic

coding environment.

III. SENSEMAKING TASK

To study developers’ and models’ attention, we prepare a

code understanding task called sensemaking task because the

1#************************************************

2# The following code reasons about triangles in the

geometrical sense.

3class point:

4def __init__(self, x, y):

5self.x = x

6self.y = y

7def square(x):

8return x*x

9def order(a, b, c):

10 copy = [a, b, c]

11 copy.sort()

12 return copy[0], copy[1], copy[2]

13 ...

14 p1 = point(0, 0)

15 p2 = point(1, 1)

16 p3 = point(1, 2)

17 classifyTriangle(p1, p2, p3)

18 # Question: What could happen if the call to ‘order

()‘ were omitted from

19 # ‘classifyTriangle‘?

20 # Answer:

Fig. 1: Example of sensemaking task with code and question

to be answered in the bottom comment. Completely empty

lines have been removed for space reasons.

developer has to “make sense” of code to answer the question

correctly. One sensemaking task is contained in a single source

code ﬁle pcomposed of four sections: (1) a brief description

of the context of the main code snippet (e.g., The following

code reasons about triangles in the geometrical sense.), (2) the

main code snippet, either sourced on the internet or written

from scratch by the authors, (3) a sensemaking question to

stimulate the reasoning (i.e., Question:), and (4) a ﬁnal

prompt to trigger the model’s answer (i.e., Answer:). Note

that all the sections except the main snippet are in the form

of code comments. Figure 1 shows an example task, whereas

the full list of questions can be seen in the Table I.

To source the tasks for our study, we rely on Geeks-

forGeeks2, a well-known website for programming education

and practice. This website offers a variety of problem state-

ments that are commonly used in typical technical interviews

by modern software companies, as shown by previous re-

search [34]. Therefore, we expect that the software developers

would have some familiarity with the type of these programs.

We then create speciﬁc sense-making questions about these

programs, inspired by the kind of questions that an interviewer

might pose, such as asking about the output, complexity, cor-

rectness, or code modiﬁcation. Indeed, many of our questions

are concrete instances of question templates such as “What

is the purpose of the code?” (nqueens_Q1), “What is the

program supposed to do?” (tree_Q3) or “What code could

have caused this behavior?” (triangle_Q1), which also have

been identiﬁed as questions that software engineers often ask

themselves in a real working setting [35]. To stimulate code

exploration, many of them are also instances of reachability

questions [36]; namely, they involve the search over all feasible

paths of a program to locate target statements matching

search criteria. Some examples of these are “What are the

2https://www.geeksforgeeks.org/

implications of this change?” (triangle_Q3) or “How does

application behavior vary in these different situations that

might occur?” (triangle_Q2,tree_Q1,multithread_Q3).

We prepare ﬁve main snippets and create three unique ques-

tions for each of them. Then we translate the same task into

three programming languages: Python, C++, and C#. In total,

we have 45 unique tasks. Although the sensemaking task

includes questions that might have also been asked in studies

focused on code comprehension [37], the main difference is

those studies typically restrict the scope of their questions to

either bottom-up [37] or top-down comprehension [38] tasks.

Whereas, in our sensemaking task, beside code snippet and

question, participants receive also the header of the ﬁle with

some contextual information, which creates an unusual blend

of bottom-up and top-down comprehension tasks which is

typically not seen in code comprehension studies which focus

on either one or the other. This decision is motivated by our

goal of stimulating code exploration, where the participants

have to integrate different pieces of information at different

locations and create an integrated mental model.

Neural Model’s Task. We feed the entire source ﬁle of

a single task as input, also referred to as prompt, to the

generative model and query it for three different answers in

the form of text completion. A model processes the input

ﬁle pby splitting it in tokens via a deterministic tokenizer

(p=t1, ..., tn) and then generates a sequence of tokens as

output, as shown on the left of Figure 2. We allow the models

to generate an answer of length 100 tokens at maximum, which

is more than enough to respond to all the questions. We use

three widely used open source pre-trained models namely:

CodeGen [2] in its language-agnostic variant3, InCoder [20]

and GPT-J [21], all in their largest variants of 16B, 6B and 6B

of parameters respectively. To query the model multiple times

we use the temperature sampling strategy with a temperature

of 0.2.

Developers’ Task. We recruit 25 software developers via di-

rect contacts at a large software company, ranging from interns

to more senior software engineers, thus having diverse degrees

of familiarity with software development and programming.

We track the eye gaze of each participant during a 19 minutes

session (on average) while they answer as many questions as

possible, typically three or four. We ensure they see each main

code snippet only once to avoid bias in answering a question

on a snippet they have already explored in a previous task.

The eye-tracking setup is calibrated at the beginning of each

task to ensure consistent data collection.

IV. PROBLEM FORMULATION

The majority of modern large language models (LLMs)

are based on the architecture of generative pre-trained trans-

formers (GPT) [39], such as Codex [1], CodeGen [2], and

AlphaCode [3]. Self-attention is a mechanism used in these

models that allows each processed token to weigh its own

importance with respect to other tokens in the same sequence,

enabling the model to capture relationships and dependencies

within the sequence. In particular, the representation of each

3CodeGen-16-multi from https://github.com/salesforce/codegen

Snippet Name Content LoC Question

nqueens Q1 N queens

problem

78-100 What does ‘solveNQ(-13)‘ return?

nqueens Q2 78-101 What are valid dimensions and values for the array ‘board‘?

nqueens Q3 78-100 How would you expect the run time of ‘solveNQ(n)‘ to scale with ‘n‘?

hannoi Q1 Tower of

Hanoi

problem

28-49 How does the algorithm moves disks from the starting rod to the ending rod?

hannoi Q2 26-47 Which is the base case of the algorithm?

hannoi Q3 28-50 Which is the name of the auxiliary rod in the call TowerOfHanoi(n, ’Mark’, ’Mat’,

’Luke’)?

multithread Q1 Consumer-

producer

threads

106-116 Is it possible that consumer and producers threads end up in a deadlock state; namely

they both wait for each other to ﬁnish, but none of them is doing anything?

multithread Q2 104-112 Is there any line of code in the consumer or producer code that will never be executed?

If yes, report it below.

multithread Q3 104-113 Will the queue object ever raise an exception in this program? If yes, which condition(s)

should be met for the exception to be raised?

tree Q1 Recursive

tree

construction

87-99 How many calls to ‘constructTreeUtil‘ will ‘constructTree([1, 2, 3], [ 1, 2, 3], 2)‘ make?

tree Q2 87-99 Under which conditions could the check ‘if i ¡= h‘ in ‘constructTreeUtil‘ be false?

tree Q3 89-101 A part of the code you don’t have direct access to has called ‘constructTree‘ with unknown

parameters. What can you ﬁnd out about those parameters?

triangle Q1 Triangle

classiﬁcation

66-112 Which of the functions have side effects (namely it modiﬁes some state variable value

outside its local environment?

triangle Q2 66-113 Which output will you get for the three points [1, 2], [1, 3], and [1, 4]?

triangle Q3 66-112 What could happen if the call to ‘order()‘ were omitted from ‘classifyTriangle‘?

TABLE I: Code snippets and related questions for each sensemaking task.

token can incorporate information from tokens that come

earlier in the sequence, and on the contrary cannot incorporate

information from tokens that come later in the sequence. In

this work, when a token Aincorporates information from

another token B, we say that Aattends to B, or equally that

token Apays attention to token B. This attention is usually

quantiﬁed by a scalar value, called attention weight, which is

computed by the model in its attention mechanism.

When the model takes as input a sequence of xtokens, the

attention mechanism is applied to each token in the sequence.

Figure 2 on the left shows a toy example with a model of

three layers and two attention heads, together with the attention

generated by the model. For each token, the attention is com-

puted sequentially through the Llayers of the neural model

and, at each layer, the attention is computed in parallel H

times, once for each sub-network called attention head. Fixing

a combination of layer and head, the attention given by i-th

token to the other tokens of the sequence can be represented

by a vector of weights: ai= (ai,1, ai,2, ..., ai,i,0i,i+1, ..., 0i,x)

where ai,j is the weight given by token at position ito token

at position j. Note that the token cannot attend any token that

come later in the sequence, thus the weights ai,j are zero for

j > i. Stacking the attention vectors one after the other as

row, we obtain an attention matrix A= (a1,a2, ..., ax)for

the speciﬁc combination of layer and attention head, note that

it is a lower triangular matrix.

Thus, when the input ﬁle comprising ntokens (t1, ..., tn)

is fed to the model f, beside a predicted answer of mnewly

generated tokens (tn+1, ..., tn+m), the model also computes an

attention tensor Aof shape (L, H, n +m, n +m), where Lis

the number of layers and His the number of attention heads.

In particular, when comparing developers’ and the model’s

attention, we focus on studying the attention weights referring

to the prompt tokens only, even if some post-processing

approach may use the entire tensor A.

Note that by construction, not all tokens can attend to all

other tokens, thus we deﬁne the notions of followers of a token

tias the set of tokens that can pay attention to ti. This set is

deﬁned as F(ti) = {tj|j > i}, where the subscript represents

the position of the token in the sequence.

A. Views of Attention

In our problem formulation, we model an extraction func-

tion gthat takes as input the attention tensor Aand returns

either a measure of how much attention the model pays to

each part of the prompt or a measure of how much each part

is linked to other parts of the prompt. Depending on the case,

we refer to the outputs as visual attention vector or interaction

matrix respectively.

Visual Attention Vector. It is a static view telling us which

part of the input is important for the model when solving the

sensemaking task. We deﬁne a visual attention of a model as

a vector a= (a1, ..., ac)over the ccharacters of the prompt,

where each aiintuitively tells us how much attention was

given to that the i-th character when solving the task. We use

gviz (A)to model a function that takes as input the attention

tensor Aand returns a visual attention vector a.

Interaction Matrix. It is a dynamic view that tells us, given

a position in the prompt, which other position of the prompt is

more deeply connected to it. We deﬁne an interaction matrix

Sas a right stochastic matrix with size n×pwhere nis

the number of tokens in the prompt and pis the number

of admissible target positions in the prompt. We distinguish

two kinds of interaction matrices depending on the granularity

of the target position p, either pointing to another token or

line in the source code (the latter being of interest primarily

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1Follow-upAttention:AnEmpiricalStudyofDeveloperandNeuralModelCodeExplorationMatteoPaltenghi,RahulPandita,AustinZ.Henley,AlbertZieglerAbstract—Recentneuralmodelsofcode,suchasOpenAICodexandAlphaCode,havedemonstratedremarkablepro-ficiencyatcodegenerationduetotheunderlyingattentionmechanism.However,itof...

展开>> 收起<<

1 Follow-up Attention An Empirical Study of Developer and Neural Model Code Exploration.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 Follow-up Attention An Empirical Study of Developer and Neural Model Code Exploration

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: