Conversing with Copilot Exploring Prompt Engineering for Solving CS1 Problems Using Natural Language

2025-05-06 0 0 648.31KB 7 页 10玖币

侵权投诉

Conversing with Copilot: Exploring Prompt Engineering for

Solving CS1 Problems Using Natural Language

Paul Denny

p.denny@auckland.ac.nz

University of Auckland

Auckland, New Zealand

Viraj Kumar

viraj@iisc.ac.in

Indian Institute of Science

Bengaluru, India

Nasser Giacaman

n.giacaman@auckland.ac.nz

University of Auckland

Auckland, New Zealand

ABSTRACT

GitHub Copilot is an articial intelligence model for automatically

generating source code from natural language problem descriptions.

Since June 2022, Copilot has ocially been available for free to all

students as a plug-in to development environments like Visual

Studio Code. Prior work exploring OpenAI Codex, the underlying

model that powers Copilot, has shown it performs well on typical

CS1 problems thus raising concerns about the impact it will have

on how introductory programming courses are taught. However,

little is known about the types of problems for which Copilot does

not perform well, or about the natural language interactions that a

student might have with Copilot when resolving errors. We explore

these questions by evaluating the performance of Copilot on a

publicly available dataset of 166 programming problems. We nd

that it successfully solves around half of these problems on its very

rst attempt, and that it solves 60% of the remaining problems

using only natural language changes to the problem description.

We argue that this type of prompt engineering, which we believe

will become a standard interaction between human and Copilot

when it initially fails, is a potentially useful learning activity that

promotes computational thinking skills, and is likely to change the

nature of code writing skill development.

CCS CONCEPTS

•Applied computing →Education

;

•Social and professional

topics

;

•Software and its engineering →Designing software

;

KEYWORDS

OpenAI, GitHub Copilot, foundation models, large language models,

CS1, articial intelligence, introductory programming.

ACM Reference Format:

Paul Denny, Viraj Kumar, and Nasser Giacaman. 2023. Conversing with

Copilot: Exploring Prompt Engineering for Solving CS1 Problems Using

Natural Language. In SIGCSE ’23: ACM Technical Symposium on Computer

Science Education, March 15–18, 2023, Toronto, Ontario, Canada. ACM, New

York, NY, USA, 7 pages.

1 INTRODUCTION

Recent breakthroughs in deep learning have led to the emergence

of transformer language models that exhibit extraordinary perfor-

mance at generating novel human-like content such as text (e.g.,

SIGCSE ’23, March 15–18, 2023, Toronto, Ontario, Canada

This is the author’s version of the work. It is posted here for your personal use. Not

for redistribution. The denitive Version of Record was published in SIGCSE ’23:

ACM Technical Symposium on Computer Science Education, March 15–18, 2023, Toronto,

Ontario, Canada.

GPT-3 [

]), images (e.g., DALL-E [

]) and source code (e.g., Codex

[

]). Producing source code automatically from natural language

prompts promises to greatly improve the eciency of professional

developers [

], and is being actively explored by groups such as

OpenAI (Codex), Amazon (CodeWhisperer) and Google (Alpha-

Code). After less than one year in technical preview, a production

version of Codex called Copilot

has recently been released as an

extension for development environments such as Visual Studio

Code. This extension is available for free to students, and claims to

be their “AI pair programmer”. Just how students will adopt and

make use of tools like Copilot is unclear [

], but it seems certain

they will play an increasing role inside and outside the classroom.

Very recent work has shown that these code generation models

are good at solving simple programming tasks. For instance, Finnie-

Ansley et al. evaluated the performance of OpenAI’s Codex on a

private repository of CS1 exam questions, nding that roughly half

of the questions were solved by Codex on its very rst attempt [

However, very little is known about the types of problems for

which these models tend to fail, or about how students will interact

with code generation tools when such failures occur. One hypothe-

sized interaction that seems very likely is that students will learn

to modify, or engineer, natural language problem descriptions to

guide the model into generating solutions that “work” (at least in

the sense of passing available test cases). Indeed, it is well known

that language model outputs are very sensitive to their inputs [

For example, when using Codex to solve probability and statistics

problems, engineering the prompt to include explicit hints on the

strategy for solving a problem is extremely eective [

]. An inter-

esting open question in computing education is how well this kind

of prompt engineering [

] works in practice, and whether it might

be a useful activity for students who are learning to program. In

this paper, we explore the current performance of Copilot and the

ecacy of prompt engineering guided by the following research

questions:

RQ1:

How well does Copilot perform, at the current time, on a

public repository of CS1 programming problems?

RQ2:

To what extent do natural language modications to the

problem description lead to the generation of successful

solutions after Copilot initially fails?

RQ3:

What commonalities exist amongst problems that Copilot

cannot solve, despite reasonable prompt engineering eort?

2 MOTIVATING EXAMPLE

CodeCheck

is an online autograder that includes a large collec-

tion of publicly accessible programming problems. The goal of

1https://github.com/features/copilot

2https://horstmann.com/codecheck/python-questions.html

arXiv:2210.15157v1 [cs.HC] 27 Oct 2022

SIGCSE ’23, March 15–18, 2023, Toronto, Ontario, Canada Paul Denny, Viraj Kumar, and Nasser Giacaman

CodeCheck is to assist instructors in oering students exercises

they can use for self-practice [15].

Consider the following problem which appears as Problem 5

under the “Sums, Averages, Products” category of the “Lists (Simple

Exercises)” group in the Python programming problem bank. The

problem is displayed below exactly as it appears on the CodeCheck

website, with the problem description appearing as a comment

within the code editor, the function signature provided, and the

statement “Your code here...” as a comment prompting the user to

enter their solution:

1# Given a l i s t o f s t r i n g s and a s t r i n g s ,

2# r e t u r n t h e a v e r a g e l e n g t h o f a l l s t r i n g s

3# containing s .

5d e f a v e r a g e C o n t a i n i n g S t r i n g S ( s t r i n g s , s ) :

6# Your code h e re . . .

What might happen if a student used Copilot as their “AI pair

programmer” here? We provided the problem description to Copi-

lot (in July 2022, shortly after its general availability release), by

pasting the comments and function header, excluding the “Your

code here...” prompt, into a Visual Studio Code editor with the Copi-

lot extension enabled. The following suggested solution appeared

almost instantly:

1# Given a l i s t o f s t r i n g s and a s t r i n g s ,

2# r e t u r n t h e a v e r a g e l e n g t h o f a l l s t r i n g s

3# containing s .

5d e f a v e r a g e C o n t a i n i n g S t r i n g S ( s t r i n g s , s ) :

6count = 0

7f o r string i n s t r i n g s :

8i f si n string :

9count += 1

10 r e t u r n l e n (s) ∗count / l e n ( s t r i n g s )

This is, of course, incorrect—the combined length of the matching

strings should be divided by their frequency. The arithmetic above is

somewhat nonsensical. If the student who used Copilot to generate

this solution submitted it to the CodeCheck website for evaluation,

they would see that all of the provided test cases fail.

One way this error could be resolved is for the student to x the

arithmetic in the code manually. In this case, Copilot has still served

a useful purpose—it helped the student most of the way to a solu-

tion by producing code that iterates over the list of strings and that

counts the number of matching strings correctly. However a dier-

ent approach, and one that we envisage as becoming increasingly

common, involves modifying the natural language description of

the problem and querying Copilot again. How could the prompt be

engineered to help guide Copilot towards generating a correct solu-

tion? The current (incorrect solution) is not computing the lengths

of the matching strings. Consider the following modication to

the prompt which adds comment lines 4–6 outlining a simplistic

algorithm for solving the problem. With these modications in

place, Copilot produced the following solution:

1# Given a l i s t o f s t r i n g s and a s t r i n g s ,

2# r e t u r n t h e a v e r a g e l e n g t h o f a l l s t r i n g s

3# containing s .

4# Count th e number of s t r i n g s t h a t c o n ta in s

5# Als o , sum t he l e ng t h o f a l l s t r i n g s c o n t a i n i n g s

6# F i n a l l y , d i v i d e t he t o t a l l e ng t h by th e cou n t

8d e f a v e r a g e C o n t a i n i n g S t r i n g S ( s t r i n g s , s ) :

9count = 0

10 totalLength = 0

11 f o r string i n s t r i n g s :

12 i f si n string :

13 count += 1

14 totalLength += l e n ( s t r i n g )

15 return totalLength / count

The above solution is correct provided the nal value of

count

is positive (the CodeCheck website does not specify, or test for, the

return value if none of the strings in

strings

contain

). The com-

ments added to lines 4–6 of this example illustrate one potentially

valuable pedagogical use of Copilot: describing computational steps

in natural language as a way of guiding code generation models.

Although prior work in computing education has shown that

tools like Codex (which powers Copilot) perform well on typical

CS1 problems, little is known about the types of problems for which

they tend to fail. In addition, there is currently no work exploring

prompt engineering as a strategy for students to resolve errors. We

explore both of these ideas using a dataset of publicly accessible

problems, thus establishing a baseline for future evaluations of code

generation models which we expect will rapidly improve.

3 RELATED WORK

Large language models, or foundation models, are deep neural

networks trained with self-supervised learning on broad data sets

at a very large scale [

]. These models can then be adapted, or

ne-tuned, for application to a wide range of tasks including the

generation of natural language, digital images, and source code.

While their ability to generate novel human-like outputs is on

the one hand fascinating, their rapidly increasing deployment has

caused alarm among some researchers and led to calls for better

understanding of their implications and risks [3, 22].

GPT-3, released by OpenAI in May 2020, is a groundbreaking

large language model that is trained to predict the next token in

a text sequence [

]. The Codex model is the result of ne-tuning

GPT-3 with an enormous amount of code samples—159GB of code

from 54 million GitHub repositories [

]. Copilot is a production

version of Codex that has been released as an extension for devel-

opment environments like Visual Studio Code. It became generally

available to all developers in June of 2022, at which time GitHub an-

nounced it would be free for students

. The impact on educational

practice of such technologies is unknown, with arguments on both

sides—highlighting concerns of over-reliance by novices [

], and

suggesting that the ability to synthesize code automatically could

play a revolutionary role in teaching [11].

In the computing education literature, there have been very few

evaluations to date of code generation models. Finnie-Ansley et al.

explored the performance of Codex on a private dataset of CS1 exam

problems and on several common variations of the well-known

https://github.blog/2022-06-21-github-copilot-is-generally-available-to-all-

developers

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ConversingwithCopilot:ExploringPromptEngineeringforSolvingCS1ProblemsUsingNaturalLanguagePaulDennyp.denny@auckland.ac.nzUniversityofAucklandAuckland,NewZealandVirajKumarviraj@iisc.ac.inIndianInstituteofScienceBengaluru,IndiaNasserGiacamann.giacaman@auckland.ac.nzUniversityofAucklandAuckland,NewZeala...

展开>> 收起<<

Conversing with Copilot Exploring Prompt Engineering for Solving CS1 Problems Using Natural Language.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Conversing with Copilot Exploring Prompt Engineering for Solving CS1 Problems Using Natural Language

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: