1 Follow-up Attention An Empirical Study of Developer and Neural Model Code Exploration

2025-04-24 0 0 2.48MB 16 页 10玖币
侵权投诉
1
Follow-up Attention: An Empirical Study of
Developer and Neural Model Code Exploration
Matteo Paltenghi, Rahul Pandita, Austin Z. Henley, Albert Ziegler
Abstract—Recent neural models of code, such as OpenAI
Codex and AlphaCode, have demonstrated remarkable pro-
ficiency at code generation due to the underlying attention
mechanism. However, it often remains unclear how the models
actually process code, and to what extent their reasoning and
the way their attention mechanism scans the code matches the
patterns of developers. A poor understanding of the model
reasoning process limits the way in which current neural models
are leveraged today, so far mostly for their raw prediction. To
fill this gap, this work studies how the processed attention signal
of three open large language models - CodeGen, InCoder and
GPT-J - agrees with how developers look at and explore code
when each answers the same sensemaking questions about code.
Furthermore, we contribute an open-source eye-tracking dataset
comprising 92 manually-labeled sessions from 25 developers
engaged in sensemaking tasks. We empirically evaluate five
heuristics that do not use the attention and ten attention-based
post-processing approaches of the attention signal of CodeGen
against our ground truth of developers exploring code, including
the novel concept of follow-up attention which exhibits the highest
agreement between model and human attention. Our follow-up
attention method can predict the next line a developer will look
at with 47% accuracy. This outperforms the baseline prediction
accuracy of 42.3%, which uses the session history of other
developers to recommend the next line. These results demonstrate
the potential of leveraging the attention signal of pre-trained
models for effective code exploration.
I. INTRODUCTION
Large language models (LLMs) pre-trained on code such
as Codex [1], CodeGen [2], and AlphaCode [3] have demon-
strated remarkable proficiency at program synthesis and com-
petitive programming tasks. Yet our understanding of why they
produce a particular solution is limited. In large-scale practical
applications, the models are often used for their prediction
alone, i.e., as generative models, and the way they reason about
code internally largely remains untapped.
These models are often based on the attention mechanism
[4], a key component of the transformer architecture [5].
Besides providing substantial performance benefits, attention
weights have been used to provide interpretability of neural
models [6, 7, 8]. Additionally, existing work [9, 10, 11, 12]
also suggests that the attention mechanism reflects or encodes
objective properties of the source code processed by the model.
We argue that just as software developers consider different
locations in the code individually and follow meaningful
Matteo Paltenghi is with the University of Stuttgart, Stuttgart, Germany. E-
mail: mattepalte@live.it. Work done while at GitHub Next for a research
internship. Rahul Pandita and Albert Ziegler are with GitHub Inc, San
Francisco, CA, USA. E-mail: {rahulpandita, wunderalbert}@github.com.
Austin Z. Henley is with Microsoft Research, Redmond, WA, USA. E-mail:
azh321@gmail.com.
connections between them, the self-attention of transformers
connects and creates information flow between similar and
linked code locations. This raises a question:
Are human attention and model attention comparable? And
if so, can the knowledge about source code conveyed by the
attention weights of neural models be leveraged to support
code exploration?
Although there are other observable signals that might cap-
ture the concept of relevance, such as gradients-based [13, 14]
or layer-wise relevance propagation [15], this work focuses on
approaches using only the attention signal. The reasons for
this choice are two: (1) almost all state-of-the-art models of
code are based on the transformer block [5], and the attention
mechanism is ultimately its fundamental component, so we
expect the corresponding attention weights to carry directly
meaningful information about the models’ decision process;
(2) attention weights can be extracted almost for free during
the generation with little runtime overhead since the attention
is computed automatically during a single forward pass.
Answering the main question of this study requires a dataset
tracking developers’ attention. In this work, we use visual
attention as a proxy for the elements to which developers
are paying mental attention while looking at code. However,
the existing datasets of visual attention are not suitable for
our purposes. Indeed, they either put the developers in an
unnatural, and thus possibly biasing, environment where most
of the vision is blurred [8], requiring participants to move the
mouse over tokens to reveal them, or they contain few and
very specific code comprehension tasks [16] on code snippets
too short to exhibit any interesting code navigation pattern.
This blurring method can introduce bias by forcing unnatural
interactions, potentially affecting how developers naturally
explore and understand code. To address these limitations
and stimulate developers to not only glance at code, but
also to deeply reason about it, we prepare an ad-hoc code
understanding assignment called the sensemaking task. This
involves questions on code, including mental code execution,
side-effects detection, algorithmic complexity, and deadlock
detection. Moreover, using eye-tracking, we collect and share
a dataset of 92 valid sessions with developers.
On the neural model side, motivated by some recent suc-
cessful applications of few-shot learning in code generation
and code summarization [17, 18] and even zero-shot in pro-
gram repair [19], the sensemaking task is designed to be
a zero-shot task for the model with a specific prompt that
triggers it to reason about the question at hand. Then we
query three LLMs of code, namely CodeGen [2], InCoder [20]
arXiv:2210.05506v2 [cs.SE] 29 Aug 2024
2
and GPT-J [21] on the same sensemaking task and compare
their attention signal1to the attention of developers. The
correlation with CodeGen, the largest model, is the highest
among the LLMs studied (r=+0.23), motivating the use of raw
and processed versions of CodeGen’s attention signal for code
exploration. To that end, we experimentally evaluate how well
existing and novel attention post-processing methods align
with the code exploration patterns derived from our dataset’s
chronological sequence of eye-fixation events. To the best of
our knowledge, this work is the first to investigate the attention
signal of these pre-trained models to support code exploration,
a specific code-related task, directly related to code reading
work [22, 23].
We empirically demonstrate that post-processing methods
based on the attention signal can be well aligned with the
way developers explore code. In particular, using the novel
concept of follow-up attention, we achieve the highest overlap
with the developers’ ground truth on which line to explore
next.
Contributions: This paper makes the following contributions:
Sensemaking Task A novel task and setup to deepen our
understanding of how the LLM attention connects to the
temporal sequence of location shifts regarding developer
focus.
Eye-Tracking Dataset A novel dataset of 92 eye tracking
sessions of 25 developers engaged in sensemaking tasks
while using a common code editor with code written in three
popular programming languages (Python, C++, and C#).
Follow-up Attention The analytical formula for follow-up
attention, a novel post-processing approach derived solely
from the attention signal, which aligns well with the devel-
oper interaction of which line to look at next when exploring
code.
Empirical Study The first comparison of both effectiveness
and visual attention of LLMs and developers when rea-
soning on sensemaking questions. An empirical evaluation
comprising ten post-processing approaches of the attention
signal, five heuristics, and an ablation study of the follow-
up attention against the collected ground truth of developers
exploring code.
II. RELATED WORK
This section provides an overview of related work around
the explanatory role of attention and previous studies of the
attention of neural models and developers when reasoning on
code.
Attention as explanation. Initially, preliminary work [24]
studying attention weights of recurrent neural models has
found that the attention weights do not always agree with
other explanation methods and that alternative weights can
be adversarially constructed while still preserving the same
model prediction. However, in response, Wiegreffe and Pin-
ter [25] have shown how the alternative attention weights
can be constructed only per a single instance prediction,
whereas obtaining a model which is consistently wrong in
1Attention signal refers to the attention weights produced during a forward
pass by the transformer blocks.
its explanations is very unlikely to happen. On the same
line, Tutek and ˇ
Snajder [26] have proposed four regularization
methods to mitigate the adversarial exploitation of attention
weights for recurrent models, including the use of residual
connections which are natively embedded into transform-
ers [5], the building blocks of the LLMs studied in this work.
To further corroborate this connection between attention and
explanation, Rabin et al. [27] have shown how even Sivand,
an explainability technique based on program simplification,
pinpoint important tokens which largely overlap with those
reported by the attention mechanism.
Attention studies of neural models of code. Paltenghi
and Pradel [8] have compared the attention weights of neural
models of code and developers’ visual attention when perform-
ing a code summarization task, and found a strong positive
correlation on the copy attention mechanism for an instance of
a pointer network [28]. Further works [9, 11] have then shown
how the attention weights of pre-trained models on source
code capture important properties of the abstract syntax tree
of the program. However, none of them considered the use
of the attention signal for a code-related task, such as code
exploration. Moreover, they are limited to relatively small self-
attention transformer models, whereas we study the attention
of CodeGen [2], InCoder [20] and GPT-J [21], large generative
models with masked self-attention.
Eye-Tracking Studies Turner et al. [29] conducted an eye-
tracking study involving 38 students fixing or describing five
simple Python and C++ programs (5-13 LoC) showing that the
fixation duration is comparable between the two languages.
Beelders [30] has qualitatively observed the eye movement
of 36 students and four lecturers when reading and mentally
executing a short C# program (12 LoC). An eye-tracking
dataset with 216 participants has been collected by [16],
however, they only consider two short snippets (11-22 LoC) of
code, since they do not support scrolling. Similarly, Blascheck
and Sharif [22] and Busjahn et al. [23] have studied the reading
order in C++ and Java code comprehension task focusing
on six small programs that could fit into a single screen,
whereas we consider longer snippets and a much larger dataset
of 45 unique tasks. Sharifi et al. [31] have recently studied
code navigation strategies on Java code with eye tracking
involving 36 participants focusing on the bug fixing process,
however, we study the sensemaking task which might elicit a
different kind of reasoning compared to bug-fixing. To more
closely mimic real-world setups in integrated development
environments (IDEs), Guarnera et al. [32] propose iTrace,
an eye-tracking plugin for IDEs that can track developers’
eye movements in more realistic and dynamic coding envi-
ronments beyond a single screen of code. Further studies,
including Fakhoury et al. [33], have proposed Gazel, an IDE
plugin that supports eye tracking in the context of source code
editing. Following this latest trend, we also use an IDE plugin
to collect the eye-tracking data, allowing for a more realistic
coding environment.
III. SENSEMAKING TASK
To study developers’ and models’ attention, we prepare a
code understanding task called sensemaking task because the
3
1#************************************************
2# The following code reasons about triangles in the
geometrical sense.
3class point:
4def __init__(self, x, y):
5self.x = x
6self.y = y
7def square(x):
8return x*x
9def order(a, b, c):
10 copy = [a, b, c]
11 copy.sort()
12 return copy[0], copy[1], copy[2]
13 ...
14 p1 = point(0, 0)
15 p2 = point(1, 1)
16 p3 = point(1, 2)
17 classifyTriangle(p1, p2, p3)
18 # Question: What could happen if the call to ‘order
()‘ were omitted from
19 # ‘classifyTriangle‘?
20 # Answer:
Fig. 1: Example of sensemaking task with code and question
to be answered in the bottom comment. Completely empty
lines have been removed for space reasons.
developer has to “make sense” of code to answer the question
correctly. One sensemaking task is contained in a single source
code file pcomposed of four sections: (1) a brief description
of the context of the main code snippet (e.g., The following
code reasons about triangles in the geometrical sense.), (2) the
main code snippet, either sourced on the internet or written
from scratch by the authors, (3) a sensemaking question to
stimulate the reasoning (i.e., Question:), and (4) a final
prompt to trigger the model’s answer (i.e., Answer:). Note
that all the sections except the main snippet are in the form
of code comments. Figure 1 shows an example task, whereas
the full list of questions can be seen in the Table I.
To source the tasks for our study, we rely on Geeks-
forGeeks2, a well-known website for programming education
and practice. This website offers a variety of problem state-
ments that are commonly used in typical technical interviews
by modern software companies, as shown by previous re-
search [34]. Therefore, we expect that the software developers
would have some familiarity with the type of these programs.
We then create specific sense-making questions about these
programs, inspired by the kind of questions that an interviewer
might pose, such as asking about the output, complexity, cor-
rectness, or code modification. Indeed, many of our questions
are concrete instances of question templates such as “What
is the purpose of the code?” (nqueens_Q1), “What is the
program supposed to do?” (tree_Q3) or “What code could
have caused this behavior?” (triangle_Q1), which also have
been identified as questions that software engineers often ask
themselves in a real working setting [35]. To stimulate code
exploration, many of them are also instances of reachability
questions [36]; namely, they involve the search over all feasible
paths of a program to locate target statements matching
search criteria. Some examples of these are “What are the
2https://www.geeksforgeeks.org/
implications of this change?” (triangle_Q3) or “How does
application behavior vary in these different situations that
might occur?” (triangle_Q2,tree_Q1,multithread_Q3).
We prepare five main snippets and create three unique ques-
tions for each of them. Then we translate the same task into
three programming languages: Python, C++, and C#. In total,
we have 45 unique tasks. Although the sensemaking task
includes questions that might have also been asked in studies
focused on code comprehension [37], the main difference is
those studies typically restrict the scope of their questions to
either bottom-up [37] or top-down comprehension [38] tasks.
Whereas, in our sensemaking task, beside code snippet and
question, participants receive also the header of the file with
some contextual information, which creates an unusual blend
of bottom-up and top-down comprehension tasks which is
typically not seen in code comprehension studies which focus
on either one or the other. This decision is motivated by our
goal of stimulating code exploration, where the participants
have to integrate different pieces of information at different
locations and create an integrated mental model.
Neural Model’s Task. We feed the entire source file of
a single task as input, also referred to as prompt, to the
generative model and query it for three different answers in
the form of text completion. A model processes the input
file pby splitting it in tokens via a deterministic tokenizer
(p=t1, ..., tn) and then generates a sequence of tokens as
output, as shown on the left of Figure 2. We allow the models
to generate an answer of length 100 tokens at maximum, which
is more than enough to respond to all the questions. We use
three widely used open source pre-trained models namely:
CodeGen [2] in its language-agnostic variant3, InCoder [20]
and GPT-J [21], all in their largest variants of 16B, 6B and 6B
of parameters respectively. To query the model multiple times
we use the temperature sampling strategy with a temperature
of 0.2.
Developers’ Task. We recruit 25 software developers via di-
rect contacts at a large software company, ranging from interns
to more senior software engineers, thus having diverse degrees
of familiarity with software development and programming.
We track the eye gaze of each participant during a 19 minutes
session (on average) while they answer as many questions as
possible, typically three or four. We ensure they see each main
code snippet only once to avoid bias in answering a question
on a snippet they have already explored in a previous task.
The eye-tracking setup is calibrated at the beginning of each
task to ensure consistent data collection.
IV. PROBLEM FORMULATION
The majority of modern large language models (LLMs)
are based on the architecture of generative pre-trained trans-
formers (GPT) [39], such as Codex [1], CodeGen [2], and
AlphaCode [3]. Self-attention is a mechanism used in these
models that allows each processed token to weigh its own
importance with respect to other tokens in the same sequence,
enabling the model to capture relationships and dependencies
within the sequence. In particular, the representation of each
3CodeGen-16-multi from https://github.com/salesforce/codegen
4
Snippet Name Content LoC Question
nqueens Q1 N queens
problem
78-100 What does ‘solveNQ(-13)‘ return?
nqueens Q2 78-101 What are valid dimensions and values for the array ‘board‘?
nqueens Q3 78-100 How would you expect the run time of ‘solveNQ(n)‘ to scale with ‘n‘?
hannoi Q1 Tower of
Hanoi
problem
28-49 How does the algorithm moves disks from the starting rod to the ending rod?
hannoi Q2 26-47 Which is the base case of the algorithm?
hannoi Q3 28-50 Which is the name of the auxiliary rod in the call TowerOfHanoi(n, ’Mark’, ’Mat’,
’Luke’)?
multithread Q1 Consumer-
producer
threads
106-116 Is it possible that consumer and producers threads end up in a deadlock state; namely
they both wait for each other to finish, but none of them is doing anything?
multithread Q2 104-112 Is there any line of code in the consumer or producer code that will never be executed?
If yes, report it below.
multithread Q3 104-113 Will the queue object ever raise an exception in this program? If yes, which condition(s)
should be met for the exception to be raised?
tree Q1 Recursive
tree
construction
87-99 How many calls to ‘constructTreeUtil‘ will ‘constructTree([1, 2, 3], [ 1, 2, 3], 2)‘ make?
tree Q2 87-99 Under which conditions could the check ‘if i ¡= h‘ in ‘constructTreeUtil‘ be false?
tree Q3 89-101 A part of the code you don’t have direct access to has called ‘constructTree‘ with unknown
parameters. What can you find out about those parameters?
triangle Q1 Triangle
classification
66-112 Which of the functions have side effects (namely it modifies some state variable value
outside its local environment?
triangle Q2 66-113 Which output will you get for the three points [1, 2], [1, 3], and [1, 4]?
triangle Q3 66-112 What could happen if the call to ‘order()‘ were omitted from ‘classifyTriangle‘?
TABLE I: Code snippets and related questions for each sensemaking task.
token can incorporate information from tokens that come
earlier in the sequence, and on the contrary cannot incorporate
information from tokens that come later in the sequence. In
this work, when a token Aincorporates information from
another token B, we say that Aattends to B, or equally that
token Apays attention to token B. This attention is usually
quantified by a scalar value, called attention weight, which is
computed by the model in its attention mechanism.
When the model takes as input a sequence of xtokens, the
attention mechanism is applied to each token in the sequence.
Figure 2 on the left shows a toy example with a model of
three layers and two attention heads, together with the attention
generated by the model. For each token, the attention is com-
puted sequentially through the Llayers of the neural model
and, at each layer, the attention is computed in parallel H
times, once for each sub-network called attention head. Fixing
a combination of layer and head, the attention given by i-th
token to the other tokens of the sequence can be represented
by a vector of weights: ai= (ai,1, ai,2, ..., ai,i,0i,i+1, ..., 0i,x)
where ai,j is the weight given by token at position ito token
at position j. Note that the token cannot attend any token that
come later in the sequence, thus the weights ai,j are zero for
j > i. Stacking the attention vectors one after the other as
row, we obtain an attention matrix A= (a1,a2, ..., ax)for
the specific combination of layer and attention head, note that
it is a lower triangular matrix.
Thus, when the input file comprising ntokens (t1, ..., tn)
is fed to the model f, beside a predicted answer of mnewly
generated tokens (tn+1, ..., tn+m), the model also computes an
attention tensor Aof shape (L, H, n +m, n +m), where Lis
the number of layers and His the number of attention heads.
In particular, when comparing developers’ and the model’s
attention, we focus on studying the attention weights referring
to the prompt tokens only, even if some post-processing
approach may use the entire tensor A.
Note that by construction, not all tokens can attend to all
other tokens, thus we define the notions of followers of a token
tias the set of tokens that can pay attention to ti. This set is
defined as F(ti) = {tj|j > i}, where the subscript represents
the position of the token in the sequence.
A. Views of Attention
In our problem formulation, we model an extraction func-
tion gthat takes as input the attention tensor Aand returns
either a measure of how much attention the model pays to
each part of the prompt or a measure of how much each part
is linked to other parts of the prompt. Depending on the case,
we refer to the outputs as visual attention vector or interaction
matrix respectively.
Visual Attention Vector. It is a static view telling us which
part of the input is important for the model when solving the
sensemaking task. We define a visual attention of a model as
a vector a= (a1, ..., ac)over the ccharacters of the prompt,
where each aiintuitively tells us how much attention was
given to that the i-th character when solving the task. We use
gviz (A)to model a function that takes as input the attention
tensor Aand returns a visual attention vector a.
Interaction Matrix. It is a dynamic view that tells us, given
a position in the prompt, which other position of the prompt is
more deeply connected to it. We define an interaction matrix
Sas a right stochastic matrix with size n×pwhere nis
the number of tokens in the prompt and pis the number
of admissible target positions in the prompt. We distinguish
two kinds of interaction matrices depending on the granularity
of the target position p, either pointing to another token or
line in the source code (the latter being of interest primarily
摘要:

1Follow-upAttention:AnEmpiricalStudyofDeveloperandNeuralModelCodeExplorationMatteoPaltenghi,RahulPandita,AustinZ.Henley,AlbertZieglerAbstract—Recentneuralmodelsofcode,suchasOpenAICodexandAlphaCode,havedemonstratedremarkablepro-ficiencyatcodegenerationduetotheunderlyingattentionmechanism.However,itof...

展开>> 收起<<
1 Follow-up Attention An Empirical Study of Developer and Neural Model Code Exploration.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:2.48MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注