Reading Between the Lines Modeling User Behavior and Costs in AI-Assisted Programming Hussein Mozannar1 Gagan Bansal2 Adam Fourney2 and Eric Horvitz2

2025-04-29 0 0 2.86MB 41 页 10玖币
侵权投诉
Reading Between the Lines: Modeling User Behavior and
Costs in AI-Assisted Programming
Hussein Mozannar1, Gagan Bansal2, Adam Fourney2, and Eric Horvitz2
1Massachusetts Institute of Technology, Cambridge, USA
2Microsoft Research, Redmond, USA
Abstract
Code-recommendation systems, such as Copilot and CodeWhisperer, have the potential to
improve programmer productivity by suggesting and auto-completing code. However, to fully
realize their potential, we must understand how programmers interact with these systems and
identify ways to improve that interaction. To seek insights about human-AI collaboration with
code recommendations systems, we studied GitHub Copilot, a code-recommendation system used
by millions of programmers daily. We developed CUPS, a taxonomy of common programmer
activities when interacting with Copilot. Our study of 21 programmers, who completed coding
tasks and retrospectively labeled their sessions with CUPS, showed that CUPS can help us
understand how programmers interact with code-recommendation systems, revealing inefficiencies
and time costs. Our insights reveal how programmers interact with Copilot and motivate new
interface designs and metrics.
Thinking/
Verifying
Suggestion
22.4%
Deferring
thought
for later
1.39%
Looking up
Documentation
7.45%
Debugging/
Testing Code
11.31%
Prompt
crafting
11.56%
Writing
Documentation
0.53%
Editing Last
Suggestion
11.90%
Editing
Written Code
4.28%
Writing New
Functionality
14.05%
Waiting For
Suggestion
4.20%
Not Thinking
0.01%
Thinking
about New
Code to Write
10.91%
import numpy as np
class LogisticRegression:
def __init(self):
self.w = None
self.b = None
# implement the fit method
def fit(self, X, y):
# initialize the parameters
self.w = np.zeros(X.shape[1])
self.b = 0
for iin range(100):
# calculate the gradient
dw = (1/X.shape[0]) * np.dot(X.T,
(self.sigmoid(np.dot(X, self.w) + self.b) -y))
db = (1/X.shape[0]) *
np.sum(self.sigmoid(np.dot(X, self.w) + self.b)
-y)
# update the parameters
self.w = self.w -dw
self.b = self.b -db
|# implement the predict method suggestion
prompt
shown shown shownrejected accepted
shown rejected
1
2
4
3
5
7
6
(a) (b)
(c)
2143 5 76
Figure 1: Profiling a coding session with the CodeRec User Programming States (CUPS). In (a) we
show the operating mode of CodeRec inside Visual Studio Code. In (b) we show the CUPS taxonomy
used to describe CodeRec related programmer activities. A coding session can be summarized as a
timeline in (c) where the programmer transitions between states.
1
arXiv:2210.14306v5 [cs.SE] 22 Apr 2024
1 Introduction
Programming-assistance systems based on the adaptation of large language models (LLMs) to code
recommendations have been recently introduced to the public. Popular systems, including Copilot
Github [2022], CodeWhisperer Amazon [2022], and AlphaCodeLi et al. [2022], signal a potential
shift in how software is developed. Though there are differences in specific interaction mechanisms,
the programming-assistance systems generally extend existing IDE code completion mechanisms
(e.g., IntelliSense
1
) by producing suggestions using neural models trained on billions of lines of code
Chen et al. [2021]. The LLM-based completion models can suggest sentence-level completions to
entire functions and classes in a wide array of programming languages. These large neural models
are deployed with the goal of accelerating the efforts of software engineers, reducing their workloads,
and improving their productivity.
Early assessments suggest that programmers do feel more productive when assisted by the code
recommendation models Ziegler et al. [2022] and that they prefer these systems to earlier code
completion engines Vaithilingam et al. [2022]. In fact, a recent study from GitHub, found that
Copilot could potentially reduce task completion time by a factor of two Peng et al. [2023]. While
these studies help us understand the benefits of code-recommendation systems, they do not allow us
to identify avenues to improve and understand the nature of interaction with these systems.
In particular, the neural models introduce new tasks into a developer’s workflow, such as writing AI
prompts Jiang et al. [2022] and verifying AI suggestions Vaithilingam et al. [2022], which can be
lengthy. Existing interaction metrics, such as suggestion acceptance rates, time to accept (i.e., the
time a suggestion remains onscreen), and reduction of tokens typed, tell only part of this interaction
story. For example, when suggestions are presented in monochrome popups (Figure 1), programmers
may choose to accept them into their codebases so that they can be read with code highlighting
enabled. Likewise, when models suggest only one line of code at a time, programmers may accept
sequences before evaluating them together as a unit. In both scenarios, considerable work verifying
and editing suggestions occurs after the programmer has accepted the recommended code. Prior
interaction metrics also largely miss user effort invested in devising and refining prompts used to
query the models. When code completion tools are evaluated using coarser task-level metrics such
as task completion time Kalliamvakou [2022], we begin to see signals of the benefits of AI-driven
code completion but lack sufficient detail to understand the nature of these gains, as well as possible
remaining inefficiencies. We argue that an ideal approach would be sufficiently low level to support
interaction profiling while sufficiently high level to capture meaningful programmer activities.
Given the nascent nature of these systems, numerous questions exist regarding the behavior of their
users:
What activities do users undertake in anticipation for, or to trigger a suggestion?
What mental processes occur while the suggestions are onscreen, and, do people double-check
1https://code.visualstudio.com/docs/editor/intellisense
1
suggestions before or after acceptance?
How costly for users are these various new tasks, and which take the most time?
To answer these and related questions in a systematic manner, we apply a mixed-methods approach
to analyze interactions with a popular code suggestion model, GiHub Copilot
2
which has more than
a million users. To emphasize that our analysis is not restricted to the specifics of Copilot, we use
the term CodeRec to refer to any instance of code suggestion models, including Copilot. Through
small-scale pilot studies and our first-hand experience using Copilot for development, we develop a
novel taxonomy of common states of a programmer when interacting with CodeRec models (such as
Copilot), which we refer to as CodeRec User Programming States (CUPS). The CUPS taxonomy
serves as the main tool to answer our research questions.
Given the initial taxonomy, we conducted a user study with 21 developers who were asked to
retrospectively review videos of their coding sessions and explicitly label their intents and actions
using this model, with an option to add new states if necessary. The study participants labeled a
total of 3137 coding segments and interacted with 1096 suggestions. The study confirmed that the
taxonomy was sufficiently expressive, and we further learned transition weights and state dwell times
—something we could not do without this experimental setting. Together, these data can be assembled
into various instruments, such as the CUPS diagram (Figure 1), to facilitate profiling interactions
and identify inefficiencies. Moreover, we show that such analysis nearly doubles our estimates for how
much developer time can be attributed to interactions with code suggestion systems, as compared
with existing metrics. We believe that identifying the current CUPS state during a programming
session can help serve programmer needs. This can be accomplished using custom keyboard macros
or automated prediction of CUPS states, as discussed in our future work section and the Appendix.
Overall, we leverage the CUPS diagram to identify some opportunities to address inefficiencies in
the current version of Copilot.
In sum, our main contributions are the following:
A novel taxonomy of common activities of programmers (called CUPS) when interacting with
code recommendation systems (Section 4)
A dataset of coding sessions annotated with user actions, CUPS, and video recordings of
programmers coding with Copilot (Section 5).
Analysis of which CUPS states programmers spend their time in when completing coding tasks
(Subsection 6.1).
An instrument to analyze programmer behavior (and patterns in behavior) based on a finite-
state machine on CUPS states (Subsection 6.2).
An adjustment formula to properly account for how much time do programmers spend
verifying CodeRec suggestions (Subsection 6.4) inspired by the CUPS state of deferring thought
2https://github.com/features/copilot
2
(Subsection 6.3).
The remainder of this paper is structured as follows: We first review related work on AI-assisted pro-
gramming (Section 2) and formally describe Copilot, along with a high-level overview of programmer-
CodeRec interaction (Section 3). To further understand this interaction, we define our model of
CodeRec User Programming States (CUPS) (Section 3) and then describe a user study designed to
collect programmer annotations of their states (Section 5). We use the collected data to analyze the
interactions using CUPS diagram revealing new insights into programmer behavior (Section 6). We
then discuss limitations and future work and conclude in (Section 7).
2 Background and Related Work
Large language models based on the Transformer network Vaswani et al. [2017], such as GPT-3
Brown et al. [2020], have found numerous applications in natural language processing. Codex Chen
et al. [2021], a GPT model trained on 54 million GitHub repositories, demonstrates that LLMs can
very effectively solve various programming tasks. Specifically, Codex was initially tested on the
HumanEval dataset containing 164 programming problems, where it is asked to write the function
body from a docstring Chen et al. [2021] and achieves 37.7% accuracy with a single generation.
Various metrics and datasets have been proposed to measure the performance of code generation
models Hendrycks et al. [2021], Li et al. [2022], Evtikhiev et al. [2023], Dakhel et al. [2023]. However,
in each case, these metrics test how well the model can complete code in an offline setting without
developer input rather than evaluating how well such recommendations assist programmers in situ.
This issue has also been noted in earlier work on non-LLM based code completion models where
performance on completion benchmarks overestimates the model’s utility to developers Hellendoorn
et al. [2019]. Importantly, however, these results may not hold to LLM-based approaches, which are
radically different Sarkar et al. [2022].
One straightforward approach to understanding the utility of neural code completion services,
including their propensity to deliver incomplete or imperfect suggestions, is to simply ask developers.
To this end, Weisz et al. interviewed developers and found that they did not require a perfect
recommendation model for the model to be useful Weisz et al. [2021]. Likewise, Ziegler et al.
surveyed over 2,000 Copilot users Ziegler et al. [2022] and asked about perceived productivity gains
using a survey instrument based on the SPACE framework Forsgren et al. [2021b] – we incorporate
the same survey design for our own study. They found both that developers felt more productive
using Copilot and that these self-reported perceptions were reasonably correlated with suggestion
acceptance rates. Liang et al. [2023] administered a survey to 410 programmers who use various
AI programming assistants, including Copilot, and highlighted why the programmers use the AI
assistants and numerous usability issues. Similarly, Prather et al. [2023] surveyed how introductory
programming students utilize Copilot.
While these self-reported measures of utility and preference are promising, we would expect gains
to be reflected in objective metrics of productivity. Indeed, one ideal method would be to conduct
3
randomized control trials where one set of participants writes code with a recommendation engine
while another set codes without it. GitHub performed such an experiment where 95 participants
were split into two groups and asked to write a web server. The study concluded by finding that
task completion was reduced by 55.8% in the Copilot condition Peng et al. [2023]. Likewise, a study
by Google showed that an internal CodeRec model had a 6% reduction in ’coding iteration time’
Tabachnyk and Nikolov [2022]. On the other hand, Vaithilingam et al. [2022] showed in a study
of 24 participants showed no significant improvement in task completion time – yet participants
stated a clear preference for Copilot. An interesting comparison to Copilot is Human-Human pair
programming, which Wu et al. [2023] details.
A significant amount of work has tried to understand the behavior of programmersBrooks [1980,
1977], Sheil [1981], Lieberman and Fry [1995] using structured user studies under the name of
"psychology of programming." This line of work tries to understand the effect of programming tools
on the time to solve a task or ease of writing code and how programmers read and write code.
Researchers often use telemetry with detailed logging on keystrokes Velart and Šaloun [2006], Ju
and Fox [2018] to understand behavior. Moreover, eye-tracking is also used to understand how
programmers read codePeitek et al. [2020], Obaidellah et al. [2018]. Our research uses raw telemetry
alongside user-labeled states to understand behavior; future research could also utilize eye-tracking
and raw video to get deeper insights into behavior.
This wide dispersion of results raises interesting questions about the nature of the utility afforded
by neural code completion engines: how, and when, are such systems most helpful; and conversely,
when do they add additional overhead? This is the central question to our work. The related
work closest to answering this question is that of Barke et al. Barke et al. [2023], who showed that
interaction with Copilot falls into two broad categories: the programmer is either in “acceleration
mode” where they know what they want to do, and Copilot serves to make them faster; or they are in
“exploration mode”, where they are unsure what code to write and Copilot helps them explore. The
taxonomy we present in this paper, CUPS, enriches this further with granular labels for programmers’
intents. Moreover, the data collected in this work was labeled by the participants themselves rather
than by the researchers interpreting their actions, allowing for more faithful intent and activity
labeling and the data collected in our study can also be used to build predictive models as in Sun
et al. [2023]. The next section describes the Copilot system formally and describes the data collected
when interacting with Copilot.
3 Copilot System Description
To better understand how code recommendation systems influence the effort of programming, we
focus on GiHub Copilot, a popular and representative example of this class of tools. Copilot
3
is based
on a Large Language Model (LLM) and assists programmers inside an IDE by recommending code
suggestions any time the programmer pauses their typing. Figure 1 shows an example of Copilot
3The version of Copilot that this manuscript refers to is Copilot as of August 2022.
4
摘要:

ReadingBetweentheLines:ModelingUserBehaviorandCostsinAI-AssistedProgrammingHusseinMozannar1,GaganBansal2,AdamFourney2,andEricHorvitz21MassachusettsInstituteofTechnology,Cambridge,USA2MicrosoftResearch,Redmond,USAAbstractCode-recommendationsystems,suchasCopilotandCodeWhisperer,havethepotentialtoimpro...

展开>> 收起<<
Reading Between the Lines Modeling User Behavior and Costs in AI-Assisted Programming Hussein Mozannar1 Gagan Bansal2 Adam Fourney2 and Eric Horvitz2.pdf

共41页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:41 页 大小:2.86MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 41
客服
关注