Reading Between the Lines Modeling User Behavior and Costs in AI-Assisted Programming Hussein Mozannar1 Gagan Bansal2 Adam Fourney2 and Eric Horvitz2

2025-04-29 0 0 2.86MB 41 页 10玖币

侵权投诉

Reading Between the Lines: Modeling User Behavior and

Costs in AI-Assisted Programming

Hussein Mozannar1, Gagan Bansal2, Adam Fourney2, and Eric Horvitz2

1Massachusetts Institute of Technology, Cambridge, USA

2Microsoft Research, Redmond, USA

Abstract

Code-recommendation systems, such as Copilot and CodeWhisperer, have the potential to

improve programmer productivity by suggesting and auto-completing code. However, to fully

realize their potential, we must understand how programmers interact with these systems and

identify ways to improve that interaction. To seek insights about human-AI collaboration with

code recommendations systems, we studied GitHub Copilot, a code-recommendation system used

by millions of programmers daily. We developed CUPS, a taxonomy of common programmer

activities when interacting with Copilot. Our study of 21 programmers, who completed coding

tasks and retrospectively labeled their sessions with CUPS, showed that CUPS can help us

understand how programmers interact with code-recommendation systems, revealing ineﬃciencies

and time costs. Our insights reveal how programmers interact with Copilot and motivate new

interface designs and metrics.

Thinking/

Verifying

Suggestion

22.4%

Deferring

thought

for later

1.39%

Looking up

Documentation

7.45%

Debugging/

Testing Code

11.31%

Prompt

crafting

11.56%

Writing

Documentation

0.53%

Editing Last

Suggestion

11.90%

Editing

Written Code

4.28%

Writing New

Functionality

14.05%

Waiting For

Suggestion

4.20%

Not Thinking

0.01%

Thinking

about New

Code to Write

10.91%

import numpy as np

class LogisticRegression:

def __init(self):

self.w = None

self.b = None

# implement the fit method

def fit(self, X, y):

# initialize the parameters

self.w = np.zeros(X.shape[1])

self.b = 0

for iin range(100):

# calculate the gradient

dw = (1/X.shape[0]) * np.dot(X.T,

(self.sigmoid(np.dot(X, self.w) + self.b) -y))

db = (1/X.shape[0]) *

np.sum(self.sigmoid(np.dot(X, self.w) + self.b)

-y)

# update the parameters

self.w = self.w -dw

self.b = self.b -db

|# implement the predict method suggestion

prompt

shown shown shownrejected accepted

shown rejected

(a) (b)

(c)

2143 5 76

Figure 1: Proﬁling a coding session with the CodeRec User Programming States (CUPS). In (a) we

show the operating mode of CodeRec inside Visual Studio Code. In (b) we show the CUPS taxonomy

used to describe CodeRec related programmer activities. A coding session can be summarized as a

timeline in (c) where the programmer transitions between states.

arXiv:2210.14306v5 [cs.SE] 22 Apr 2024

1 Introduction

Programming-assistance systems based on the adaptation of large language models (LLMs) to code

recommendations have been recently introduced to the public. Popular systems, including Copilot

Github [2022], CodeWhisperer Amazon [2022], and AlphaCodeLi et al. [2022], signal a potential

shift in how software is developed. Though there are diﬀerences in speciﬁc interaction mechanisms,

the programming-assistance systems generally extend existing IDE code completion mechanisms

(e.g., IntelliSense

) by producing suggestions using neural models trained on billions of lines of code

Chen et al. [2021]. The LLM-based completion models can suggest sentence-level completions to

entire functions and classes in a wide array of programming languages. These large neural models

are deployed with the goal of accelerating the eﬀorts of software engineers, reducing their workloads,

and improving their productivity.

Early assessments suggest that programmers do feel more productive when assisted by the code

recommendation models Ziegler et al. [2022] and that they prefer these systems to earlier code

completion engines Vaithilingam et al. [2022]. In fact, a recent study from GitHub, found that

Copilot could potentially reduce task completion time by a factor of two Peng et al. [2023]. While

these studies help us understand the beneﬁts of code-recommendation systems, they do not allow us

to identify avenues to improve and understand the nature of interaction with these systems.

In particular, the neural models introduce new tasks into a developer’s workﬂow, such as writing AI

prompts Jiang et al. [2022] and verifying AI suggestions Vaithilingam et al. [2022], which can be

lengthy. Existing interaction metrics, such as suggestion acceptance rates, time to accept (i.e., the

time a suggestion remains onscreen), and reduction of tokens typed, tell only part of this interaction

story. For example, when suggestions are presented in monochrome popups (Figure 1), programmers

may choose to accept them into their codebases so that they can be read with code highlighting

enabled. Likewise, when models suggest only one line of code at a time, programmers may accept

sequences before evaluating them together as a unit. In both scenarios, considerable work verifying

and editing suggestions occurs after the programmer has accepted the recommended code. Prior

interaction metrics also largely miss user eﬀort invested in devising and reﬁning prompts used to

query the models. When code completion tools are evaluated using coarser task-level metrics such

as task completion time Kalliamvakou [2022], we begin to see signals of the beneﬁts of AI-driven

code completion but lack suﬃcient detail to understand the nature of these gains, as well as possible

remaining ineﬃciencies. We argue that an ideal approach would be suﬃciently low level to support

interaction proﬁling while suﬃciently high level to capture meaningful programmer activities.

Given the nascent nature of these systems, numerous questions exist regarding the behavior of their

users:

•What activities do users undertake in anticipation for, or to trigger a suggestion?

•

What mental processes occur while the suggestions are onscreen, and, do people double-check

1https://code.visualstudio.com/docs/editor/intellisense

suggestions before or after acceptance?

•How costly for users are these various new tasks, and which take the most time?

To answer these and related questions in a systematic manner, we apply a mixed-methods approach

to analyze interactions with a popular code suggestion model, GiHub Copilot

which has more than

a million users. To emphasize that our analysis is not restricted to the speciﬁcs of Copilot, we use

the term CodeRec to refer to any instance of code suggestion models, including Copilot. Through

small-scale pilot studies and our ﬁrst-hand experience using Copilot for development, we develop a

novel taxonomy of common states of a programmer when interacting with CodeRec models (such as

Copilot), which we refer to as CodeRec User Programming States (CUPS). The CUPS taxonomy

serves as the main tool to answer our research questions.

Given the initial taxonomy, we conducted a user study with 21 developers who were asked to

retrospectively review videos of their coding sessions and explicitly label their intents and actions

using this model, with an option to add new states if necessary. The study participants labeled a

total of 3137 coding segments and interacted with 1096 suggestions. The study conﬁrmed that the

taxonomy was suﬃciently expressive, and we further learned transition weights and state dwell times

—something we could not do without this experimental setting. Together, these data can be assembled

into various instruments, such as the CUPS diagram (Figure 1), to facilitate proﬁling interactions

and identify ineﬃciencies. Moreover, we show that such analysis nearly doubles our estimates for how

much developer time can be attributed to interactions with code suggestion systems, as compared

with existing metrics. We believe that identifying the current CUPS state during a programming

session can help serve programmer needs. This can be accomplished using custom keyboard macros

or automated prediction of CUPS states, as discussed in our future work section and the Appendix.

Overall, we leverage the CUPS diagram to identify some opportunities to address ineﬃciencies in

the current version of Copilot.

In sum, our main contributions are the following:

•

A novel taxonomy of common activities of programmers (called CUPS) when interacting with

code recommendation systems (Section 4)

•

A dataset of coding sessions annotated with user actions, CUPS, and video recordings of

programmers coding with Copilot (Section 5).

•

Analysis of which CUPS states programmers spend their time in when completing coding tasks

(Subsection 6.1).

•

An instrument to analyze programmer behavior (and patterns in behavior) based on a ﬁnite-

state machine on CUPS states (Subsection 6.2).

•

An adjustment formula to properly account for how much time do programmers spend

verifying CodeRec suggestions (Subsection 6.4) inspired by the CUPS state of deferring thought

2https://github.com/features/copilot

(Subsection 6.3).

The remainder of this paper is structured as follows: We ﬁrst review related work on AI-assisted pro-

gramming (Section 2) and formally describe Copilot, along with a high-level overview of programmer-

CodeRec interaction (Section 3). To further understand this interaction, we deﬁne our model of

CodeRec User Programming States (CUPS) (Section 3) and then describe a user study designed to

collect programmer annotations of their states (Section 5). We use the collected data to analyze the

interactions using CUPS diagram revealing new insights into programmer behavior (Section 6). We

then discuss limitations and future work and conclude in (Section 7).

2 Background and Related Work

Large language models based on the Transformer network Vaswani et al. [2017], such as GPT-3

Brown et al. [2020], have found numerous applications in natural language processing. Codex Chen

et al. [2021], a GPT model trained on 54 million GitHub repositories, demonstrates that LLMs can

very eﬀectively solve various programming tasks. Speciﬁcally, Codex was initially tested on the

HumanEval dataset containing 164 programming problems, where it is asked to write the function

body from a docstring Chen et al. [2021] and achieves 37.7% accuracy with a single generation.

Various metrics and datasets have been proposed to measure the performance of code generation

models Hendrycks et al. [2021], Li et al. [2022], Evtikhiev et al. [2023], Dakhel et al. [2023]. However,

in each case, these metrics test how well the model can complete code in an oﬄine setting without

developer input rather than evaluating how well such recommendations assist programmers in situ.

This issue has also been noted in earlier work on non-LLM based code completion models where

performance on completion benchmarks overestimates the model’s utility to developers Hellendoorn

et al. [2019]. Importantly, however, these results may not hold to LLM-based approaches, which are

radically diﬀerent Sarkar et al. [2022].

One straightforward approach to understanding the utility of neural code completion services,

including their propensity to deliver incomplete or imperfect suggestions, is to simply ask developers.

To this end, Weisz et al. interviewed developers and found that they did not require a perfect

recommendation model for the model to be useful Weisz et al. [2021]. Likewise, Ziegler et al.

surveyed over 2,000 Copilot users Ziegler et al. [2022] and asked about perceived productivity gains

using a survey instrument based on the SPACE framework Forsgren et al. [2021b] – we incorporate

the same survey design for our own study. They found both that developers felt more productive

using Copilot and that these self-reported perceptions were reasonably correlated with suggestion

acceptance rates. Liang et al. [2023] administered a survey to 410 programmers who use various

AI programming assistants, including Copilot, and highlighted why the programmers use the AI

assistants and numerous usability issues. Similarly, Prather et al. [2023] surveyed how introductory

programming students utilize Copilot.

While these self-reported measures of utility and preference are promising, we would expect gains

to be reﬂected in objective metrics of productivity. Indeed, one ideal method would be to conduct

randomized control trials where one set of participants writes code with a recommendation engine

while another set codes without it. GitHub performed such an experiment where 95 participants

were split into two groups and asked to write a web server. The study concluded by ﬁnding that

task completion was reduced by 55.8% in the Copilot condition Peng et al. [2023]. Likewise, a study

by Google showed that an internal CodeRec model had a 6% reduction in ’coding iteration time’

Tabachnyk and Nikolov [2022]. On the other hand, Vaithilingam et al. [2022] showed in a study

of 24 participants showed no signiﬁcant improvement in task completion time – yet participants

stated a clear preference for Copilot. An interesting comparison to Copilot is Human-Human pair

programming, which Wu et al. [2023] details.

A signiﬁcant amount of work has tried to understand the behavior of programmersBrooks [1980,

1977], Sheil [1981], Lieberman and Fry [1995] using structured user studies under the name of

"psychology of programming." This line of work tries to understand the eﬀect of programming tools

on the time to solve a task or ease of writing code and how programmers read and write code.

Researchers often use telemetry with detailed logging on keystrokes Velart and Šaloun [2006], Ju

and Fox [2018] to understand behavior. Moreover, eye-tracking is also used to understand how

programmers read codePeitek et al. [2020], Obaidellah et al. [2018]. Our research uses raw telemetry

alongside user-labeled states to understand behavior; future research could also utilize eye-tracking

and raw video to get deeper insights into behavior.

This wide dispersion of results raises interesting questions about the nature of the utility aﬀorded

by neural code completion engines: how, and when, are such systems most helpful; and conversely,

when do they add additional overhead? This is the central question to our work. The related

work closest to answering this question is that of Barke et al. Barke et al. [2023], who showed that

interaction with Copilot falls into two broad categories: the programmer is either in “acceleration

mode” where they know what they want to do, and Copilot serves to make them faster; or they are in

“exploration mode”, where they are unsure what code to write and Copilot helps them explore. The

taxonomy we present in this paper, CUPS, enriches this further with granular labels for programmers’

intents. Moreover, the data collected in this work was labeled by the participants themselves rather

than by the researchers interpreting their actions, allowing for more faithful intent and activity

labeling and the data collected in our study can also be used to build predictive models as in Sun

et al. [2023]. The next section describes the Copilot system formally and describes the data collected

when interacting with Copilot.

3 Copilot System Description

To better understand how code recommendation systems inﬂuence the eﬀort of programming, we

focus on GiHub Copilot, a popular and representative example of this class of tools. Copilot

is based

on a Large Language Model (LLM) and assists programmers inside an IDE by recommending code

suggestions any time the programmer pauses their typing. Figure 1 shows an example of Copilot

3The version of Copilot that this manuscript refers to is Copilot as of August 2022.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ReadingBetweentheLines:ModelingUserBehaviorandCostsinAI-AssistedProgrammingHusseinMozannar1,GaganBansal2,AdamFourney2,andEricHorvitz21MassachusettsInstituteofTechnology,Cambridge,USA2MicrosoftResearch,Redmond,USAAbstractCode-recommendationsystems,suchasCopilotandCodeWhisperer,havethepotentialtoimpro...

展开>> 收起<<

Reading Between the Lines Modeling User Behavior and Costs in AI-Assisted Programming Hussein Mozannar1 Gagan Bansal2 Adam Fourney2 and Eric Horvitz2.pdf

共41页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Reading Between the Lines Modeling User Behavior and Costs in AI-Assisted Programming Hussein Mozannar1 Gagan Bansal2 Adam Fourney2 and Eric Horvitz2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: