
1 Introduction
Programming-assistance systems based on the adaptation of large language models (LLMs) to code
recommendations have been recently introduced to the public. Popular systems, including Copilot
Github [2022], CodeWhisperer Amazon [2022], and AlphaCodeLi et al. [2022], signal a potential
shift in how software is developed. Though there are differences in specific interaction mechanisms,
the programming-assistance systems generally extend existing IDE code completion mechanisms
(e.g., IntelliSense
1
) by producing suggestions using neural models trained on billions of lines of code
Chen et al. [2021]. The LLM-based completion models can suggest sentence-level completions to
entire functions and classes in a wide array of programming languages. These large neural models
are deployed with the goal of accelerating the efforts of software engineers, reducing their workloads,
and improving their productivity.
Early assessments suggest that programmers do feel more productive when assisted by the code
recommendation models Ziegler et al. [2022] and that they prefer these systems to earlier code
completion engines Vaithilingam et al. [2022]. In fact, a recent study from GitHub, found that
Copilot could potentially reduce task completion time by a factor of two Peng et al. [2023]. While
these studies help us understand the benefits of code-recommendation systems, they do not allow us
to identify avenues to improve and understand the nature of interaction with these systems.
In particular, the neural models introduce new tasks into a developer’s workflow, such as writing AI
prompts Jiang et al. [2022] and verifying AI suggestions Vaithilingam et al. [2022], which can be
lengthy. Existing interaction metrics, such as suggestion acceptance rates, time to accept (i.e., the
time a suggestion remains onscreen), and reduction of tokens typed, tell only part of this interaction
story. For example, when suggestions are presented in monochrome popups (Figure 1), programmers
may choose to accept them into their codebases so that they can be read with code highlighting
enabled. Likewise, when models suggest only one line of code at a time, programmers may accept
sequences before evaluating them together as a unit. In both scenarios, considerable work verifying
and editing suggestions occurs after the programmer has accepted the recommended code. Prior
interaction metrics also largely miss user effort invested in devising and refining prompts used to
query the models. When code completion tools are evaluated using coarser task-level metrics such
as task completion time Kalliamvakou [2022], we begin to see signals of the benefits of AI-driven
code completion but lack sufficient detail to understand the nature of these gains, as well as possible
remaining inefficiencies. We argue that an ideal approach would be sufficiently low level to support
interaction profiling while sufficiently high level to capture meaningful programmer activities.
Given the nascent nature of these systems, numerous questions exist regarding the behavior of their
users:
•What activities do users undertake in anticipation for, or to trigger a suggestion?
•
What mental processes occur while the suggestions are onscreen, and, do people double-check
1https://code.visualstudio.com/docs/editor/intellisense
1