
SIGCSE ’23, March 15–18, 2023, Toronto, Ontario, Canada Paul Denny, Viraj Kumar, and Nasser Giacaman
CodeCheck is to assist instructors in oering students exercises
they can use for self-practice [15].
Consider the following problem which appears as Problem 5
under the “Sums, Averages, Products” category of the “Lists (Simple
Exercises)” group in the Python programming problem bank. The
problem is displayed below exactly as it appears on the CodeCheck
website, with the problem description appearing as a comment
within the code editor, the function signature provided, and the
statement “Your code here...” as a comment prompting the user to
enter their solution:
1# Given a l i s t o f s t r i n g s and a s t r i n g s ,
2# r e t u r n t h e a v e r a g e l e n g t h o f a l l s t r i n g s
3# containing s .
4
5d e f a v e r a g e C o n t a i n i n g S t r i n g S ( s t r i n g s , s ) :
6# Your code h e re . . .
What might happen if a student used Copilot as their “AI pair
programmer” here? We provided the problem description to Copi-
lot (in July 2022, shortly after its general availability release), by
pasting the comments and function header, excluding the “Your
code here...” prompt, into a Visual Studio Code editor with the Copi-
lot extension enabled. The following suggested solution appeared
almost instantly:
1# Given a l i s t o f s t r i n g s and a s t r i n g s ,
2# r e t u r n t h e a v e r a g e l e n g t h o f a l l s t r i n g s
3# containing s .
4
5d e f a v e r a g e C o n t a i n i n g S t r i n g S ( s t r i n g s , s ) :
6count = 0
7f o r string i n s t r i n g s :
8i f si n string :
9count += 1
10 r e t u r n l e n (s) ∗count / l e n ( s t r i n g s )
This is, of course, incorrect—the combined length of the matching
strings should be divided by their frequency. The arithmetic above is
somewhat nonsensical. If the student who used Copilot to generate
this solution submitted it to the CodeCheck website for evaluation,
they would see that all of the provided test cases fail.
One way this error could be resolved is for the student to x the
arithmetic in the code manually. In this case, Copilot has still served
a useful purpose—it helped the student most of the way to a solu-
tion by producing code that iterates over the list of strings and that
counts the number of matching strings correctly. However a dier-
ent approach, and one that we envisage as becoming increasingly
common, involves modifying the natural language description of
the problem and querying Copilot again. How could the prompt be
engineered to help guide Copilot towards generating a correct solu-
tion? The current (incorrect solution) is not computing the lengths
of the matching strings. Consider the following modication to
the prompt which adds comment lines 4–6 outlining a simplistic
algorithm for solving the problem. With these modications in
place, Copilot produced the following solution:
1# Given a l i s t o f s t r i n g s and a s t r i n g s ,
2# r e t u r n t h e a v e r a g e l e n g t h o f a l l s t r i n g s
3# containing s .
4# Count th e number of s t r i n g s t h a t c o n ta in s
5# Als o , sum t he l e ng t h o f a l l s t r i n g s c o n t a i n i n g s
6# F i n a l l y , d i v i d e t he t o t a l l e ng t h by th e cou n t
7
8d e f a v e r a g e C o n t a i n i n g S t r i n g S ( s t r i n g s , s ) :
9count = 0
10 totalLength = 0
11 f o r string i n s t r i n g s :
12 i f si n string :
13 count += 1
14 totalLength += l e n ( s t r i n g )
15 return totalLength / count
The above solution is correct provided the nal value of
count
is positive (the CodeCheck website does not specify, or test for, the
return value if none of the strings in
strings
contain
s
). The com-
ments added to lines 4–6 of this example illustrate one potentially
valuable pedagogical use of Copilot: describing computational steps
in natural language as a way of guiding code generation models.
Although prior work in computing education has shown that
tools like Codex (which powers Copilot) perform well on typical
CS1 problems, little is known about the types of problems for which
they tend to fail. In addition, there is currently no work exploring
prompt engineering as a strategy for students to resolve errors. We
explore both of these ideas using a dataset of publicly accessible
problems, thus establishing a baseline for future evaluations of code
generation models which we expect will rapidly improve.
3 RELATED WORK
Large language models, or foundation models, are deep neural
networks trained with self-supervised learning on broad data sets
at a very large scale [
4
]. These models can then be adapted, or
ne-tuned, for application to a wide range of tasks including the
generation of natural language, digital images, and source code.
While their ability to generate novel human-like outputs is on
the one hand fascinating, their rapidly increasing deployment has
caused alarm among some researchers and led to calls for better
understanding of their implications and risks [3, 22].
GPT-3, released by OpenAI in May 2020, is a groundbreaking
large language model that is trained to predict the next token in
a text sequence [
5
]. The Codex model is the result of ne-tuning
GPT-3 with an enormous amount of code samples—159GB of code
from 54 million GitHub repositories [
6
]. Copilot is a production
version of Codex that has been released as an extension for devel-
opment environments like Visual Studio Code. It became generally
available to all developers in June of 2022, at which time GitHub an-
nounced it would be free for students
3
. The impact on educational
practice of such technologies is unknown, with arguments on both
sides—highlighting concerns of over-reliance by novices [
6
], and
suggesting that the ability to synthesize code automatically could
play a revolutionary role in teaching [11].
In the computing education literature, there have been very few
evaluations to date of code generation models. Finnie-Ansley et al.
explored the performance of Codex on a private dataset of CS1 exam
problems and on several common variations of the well-known
3
https://github.blog/2022-06-21-github-copilot-is-generally-available-to-all-
developers