Piloting Copilot and Codex Hot Temperature Cold Prompts or Black Magic Jean-Baptiste D oderlein

2025-05-02 0 0 1.6MB 14 页 10玖币

侵权投诉

Piloting Copilot and Codex: Hot Temperature, Cold

Prompts, or Black Magic?

Jean-Baptiste D¨

oderlein

ENS Rennes

Rennes, France

jean-baptiste.doderlein@ens-rennes.fr

Djamel Eddine Khelladi

CNRS, Univ Rennes, IRISA, Inria

Rennes, France

djamel-eddine.khelladi@irisa.fr

Mathieu Acher

Univ Rennes, IUF, CNRS, Inria, IRISA

Rennes, France

mathieu.acher@irisa.fr

Benoit Combemale

Univ Rennes, CNRS, Inria, IRISA

Rennes, France

benoit.combemale@irisa.fr

Abstract—Language models are promising solutions for tack-

ling increasing complex problems. In software engineering, they

recently attracted attention in code assistants, with programs

automatically written in a given programming language from

a programming task description in natural language. They have

the potential to save time and effort when writing code. However,

these systems are currently poorly understood, preventing them

from being used optimally. In this article, we investigate the

various inputs of two conﬁgurations of a language model, and

conduct a study to understand if variations of these input

parameters (e.g. programming task description and the sur-

rounding context, creativity of the language model, number

of generated solutions) can have a signiﬁcant impact on the

quality of the generated programs. We design speciﬁc operators

for varying input parameters and apply them over two code

assistants (Copilot and Codex) and two benchmarks representing

algorithmic problems (HumanEval and LeetCode). Our results

showed that varying the input parameters can signiﬁcantly

improve the performance of language models, with, for example,

up to 79.27% of success rate in one-shot compared to 22.44% for

Codex in default settings and 31.1% for Copilot. Actioning this

potential in practice is, however, highly challenging due to the

complex interplay raised in our study – the optimal settings of the

temperature, the prompt and the number of generated solutions

differ from one problem to the other. Our study also yielded

surprising and startling results (e.g. fully removing the prompt

can be an effective strategy), suggesting some brittleness and

room for improving language models. Overall, this work opens

opportunities to envision (automated) strategies for enhancing

performance of language model-based code assistants, but also

questions their robustness.

I. INTRODUCTION

Language models are gaining momentum and capable of

tackling more and more problems from linguistics, maths,

commonsense reasoning, biology, physics, etc. BERT [1],

GPT-2 [2], GPT-3 [3], PaLM [4], to name a few, are scaling

to support a variety of tasks such as text generation, question-

answering, text classiﬁcation, arithmetic on numbers, and

many others [5]–[9]. In software engineering, code assistants

based on language models have been proposed and are now

deployed at scale for supporting programmers, such as GitHub

Copilot [10]. Based on prompts, composed of both the descrip-

tion of a programming task written in natural language and the

surrounding context (e.g. existing code, function signatures,

targeted language, cursor, authors, shebang), programs are au-

tomatically written in a given programming language (Python,

Java, C++, etc.). The promise is to provide a comprehensive

working solution (or a set of candidate programs) for a given

programming task. Tools like Copilot or Codex hence have

the potential to save time and effort when writing code.

However, the strengths and weaknesses of these systems

are currently poorly understood, preventing them from be-

ing used optimally. On the one hand, there are impressive

demonstrations, showing the ability to produce programs on

non-trivial programming problems or tasks. But there also is

the nagging assumption that these systems are simply reciting

code that is already on the Internet (e.g. Github). Furthermore,

early studies suggest that the quality of solutions appears

to vary greatly in some problems and targeted programming

languages [11]. These assistants seem in particular sensitive

to the way the developer communicates and interacts.

Given a programming task, developers can pilot/drive code

assistants in different directions to achieve their goal. They

can vary the prompt, for example, formulate the programming

task differently or change the context. They can also change

other language models input parameters, such as augmenting

the creativity of the assistant (through the temperature of

language models), or change the number of expected solutions

that are eventually proposed. With high ﬂexibility, developers

can communicate at a high level of abstraction, in a declarative

way, focusing on the goal rather than the how. The counterpart

is that the speciﬁcation might be brittle and not properly or

systematically understood by code assistants. There is also

the question of what strategy to choose when developers try

to ﬁnd a solution: changing some terms in the programming

task description? changing the signature of the function?

augmenting or decreasing the temperature? etc. There are

anecdotes here and there about ”language model engineering”

(e.g., prompt engineering), but this has not been systematically

studied in the context of code assistants.

arXiv:2210.14699v2 [cs.SE] 15 Feb 2023

In this article, we hypothesize that variations of input param-

eters of language models (e.g. prompts and temperatures) on

the same problem can have a signiﬁcant impact on the quality

of the generated programs. These variations can be leveraged

to (1) assess and understand the sensitivity (or robustness)

of code assistants (hence their potential and limitations); (2)

envision (automated) strategies for improving performance.

To do so, we ﬁrst design and develop a set of operators to

automatically vary the input parameters. These operators can

remove, augment, or simply rewrite an original programming

task description, as well as varying the context and other input

parameters, such as temperature and the number of expected

solutions. The idea is to feed code assistants with different

variations, observe the effects on generated programs, and

eventually better understand the impact of the input parameters

on the resulting performance of the language model. Perfor-

mance is deﬁned in this article as the ability or not to ﬁnd at

least one solution among the proposed ones that pass all the

test cases.

We conducted a study that considers two code assistants

(Copilot and Codex) and leverages two datasets (HumanEval

and LeetCode) mostly representing algorithmic problems, as

well as our set of operators. Our experiments span numerous

programming tasks (446 problems) with different difﬁculties

and six programming languages. We also vary the number k

of code samples generated per task, from k= 1 (one shot) to

k= 100. Similarly, we study the sensitivity of code assistants

and the effectiveness of our variations in different settings and

usage scenarios.

Our contributions can be summarized as follows.

•The design and development of a set of operators for

automatically varying language models input parameters.

The inspiration of our work both comes from software

testing techniques (e.g. mutation testing, test ampliﬁca-

tion as variants of existing ones) and recent advances in

language models for tuning prompts [12], [13] ;

•The design of a study over two code assistants and two

benchmarks. Prior studies considered a limited number

of problems, programming languages, and conﬁgurations

of code assistants. We are also unaware of works that

leverage prompt variations and temperatures’ values in

the context of code generation.

•The analysis of results that demonstrate that varying input

parameters can signiﬁcantly improve the performance of

language models. However, there is a tight dependency

when varying the temperature, the prompt and the num-

ber of generated solutions, making potentially hard for

developers to properly control the parameters to obtain

an optimal result.

II. BACKGROUND AND MOTIVATION

A. Language models and Code Suggestions

A language model (LM) is a probabilistic model over

natural language. In practice, from a given prompt, a language

model provides a set of results. In software engineering,

language models such as Codex are called from a prompt

including a description of a programming task in natural

language, as well as all the surrounding context. Such a context

is composed of information such as the existing code (e.g.

imports...), the offset of the cursor, the targeted language, the

authors, the shebang...

Beyond the prompt, language models might be tuned re-

garding speciﬁc parameters. In particular, the temperature

hyperparameter controls the creativity of the language model.

The temperature value typically varies from 0 to 1. The

temperature is used to adjust the distribution of the model’s

predicted next word: higher values lead to more diverse

and unpredictable outputs, and lower values lead to more

conservative, predictable outputs. Another parameter is the

expected number kof generated solutions. The principle is

to sample from the kmost probable programs. When kequals

to 1, only one program is generated and is hopefully a valid

solution.

B. Prompts

When using Copilot and Codex to generate code, the user

provides a context to the model. This could be a text in natural

language or some pieces of existing code.

The prompt is the text/code that the model needs to

complete. This includes the comment (e.g. the docstring of

the function in Python) but also the function signature. The

function signature consists of the function name, the number

of arguments, their name, and also in type in the case of typed

languages.

Prompt sensitivity: An example. When using language

model-based assistant tools to generate code, there are several

ways to express what should be completed. The most common

is to use the comment of the function. But there are multiple

ways to phrase the prompt. One can change the prompt while

keeping the same meaning. Yet, sometimes, a small variation

could drastically change the model performance.

In Figure 1 we use an example of a prompt from HumanEval

[14] dataset. The function takes two integers and returns the

biggest even number in the range of the two integers, −1if no

even number is found. With Copilot, the model fails to provide

a correct answer (it returns the smallest even number instead).

However, the model provides a correct answer when we add

the sentence ”Write a quick algorithm to solve this problem.”

at the end of the prompt. This sentence does not provide any

additional information to the model about the problem or the

algorithm to solve the problem, but it improves the model’s

output. Furthermore, we observe that if we further modify the

prompt by removing the examples, the result is not correct.

These examples suggest that code generation is sensitive

to prompt variation, and that modiﬁcation of the prompt can

cause the model to provide a correct or wrong answer within

different contexts.

III. EXPERIMENTAL SET UP

This section describes our experimental setup to explore the

effect of prompt variations and temperature parameter on the

performance of language model-based code assistants.

def choose_num(x, y):

"""

This function takes two positive

numbers x and y and returns the

biggest even integer number that

is in the range [x, y] inclusive.

If there's no such number, then

the function should return -1.

For example:

choose_num(12, 15) = 14

choose_num(13, 12) = -1

"""

if x%2!=0:

x+= 1

if y%2!=0:

y-= 1

if x>y:

return -1

if x%2==0:

x+= 1

while x<= y:

if x%2==0:

return x

x+= 2

return -1

(a) Original prompt

def choose_num(x, y):

"""

This function takes two positive

numbers x and y and returns the

biggest even integer number that

is in the range [x, y] inclusive.

If there's no such number, then

the function should return -1.

For example:

choose_num(12, 15) = 14

choose_num(13, 12) = -1

Write a quick algorithm to

solve this problem.

"""

if x%2==0:

x=x

else:

x=x+ 1

if y%2==0:

y=y

else:

y=y- 1

if x>y:

return -1

else:

return max(x, y)

(b) Modiﬁed prompt 1

def choose_num(x, y):

"""

This function takes two positive

numbers x and y and returns the

biggest even integer number that

is in the range [x, y] inclusive.

If there's no such number, then

the function should return -1.

"""

if x%2==1:

x+= 1

if y%2==0:

y-= 1

for iin range(x, y + 1,2):

if x<= i<= y:

return i

return -1

Fig. 1: Code generation of HumanEval problem choose_num with original prompt (a) and modiﬁed prompts (b and c).

There is the prompt (blue background) and the code generated by Copilot (green background)

A. Two code assistants, two conﬁgurations of one language

model

Several language models and code assistant exist for code

suggestion. We consider two popular code assistants, Codex

and Copilot. There are several reasons why we chose these

two tools: they provide state-of-the-art performance; they are

capable of targeting different programming languages; they

are rather mature tools, already widely used, and come with

a set of APIs. Another interesting property is that Codex

and Copilot rely on the same language model. This apparent

similarity in fact hides many differences that we wish to ex-

plore. Copilot can be thought of as a thorough engineering of

Codex, providing a chance for research to evaluate sensitivity

in various conﬁguration settings and acquire insights about

how to conﬁgure (pilot) language models.

1) Copilot: Copilot has been built on top of one of the

Codex language model as a code assistant with ﬁxed param-

eters (e.g. temperature) and an associated development tool

that is available from Github as an IDE extension. Copilot

is described as an ”AI pair programmer” that offers code

suggestions in real time [10]. It can produce code in different

ways: It can write code from comments, write tests for code

that is already written, or complete code that is being written

like tools like Intellisens [15] does.

The focus here is on generating code from comments (and

the function signature). In regular use, Copilot communicates

in real time about all the changes made on the document

and sends additional information about the context (document

name, path, cursor position, language, etc.). Previous studies

evaluating the performance [11] or security [16] of the code

produced by Copilot retrieved the results for each prompt,

manually from VS Code. In order to increase the speed of

the tests and to signiﬁcantly increase the number of samples

evaluated, we directly made calls to the backend of the Visual

Studio Code module. The user is authenticated once manually

from the Neovim extension, then the calls are made to Copilot

executed using NodeJS with LSP. When used in the IDE,

the Copilot extension does not simply make requests, but

builds contexts during generation and decides according to

the context if a request is needed [17]. However, as we use a

primitive in the LSP to request a query, no context other than

the name of the ﬁle and programming language are provided.

In the Copilot evaluation, we restricted ourselves to the

default parameter (temperature, top p). Although these pa-

rameters seem to be modiﬁable [16], the default values are

not known, and the default parameters better reﬂect the codes

proposed to the end user. Moreover, we did not use the

possibility of the panel mode allowing generating several

codes. Indeed, the codes retrieved by the backend sometimes

correspond to the code of the whole ﬁle and sometimes to

the simple completion of the code, thus preventing a good

automation of the process. Overall, all Copilot codes are

evaluated in one-shot.

2) Codex: Codex, more precisely code-davinci-002, is a

language model developed by OpenAI and available in private

beta at the time of writing. It is part of the same family as the

model used by Copilot. Contrary to Copilot, it has an API that

facilitates the automation of the evaluation of produced codes.

The API offers many parameters like: temperature, top p,

number of code produced. For the evaluation, OpenAI advises

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PilotingCopilotandCodex:HotTemperature,ColdPrompts,orBlackMagic?Jean-BaptisteD¨oderleinENSRennesRennes,Francejean-baptiste.doderlein@ens-rennes.frDjamelEddineKhelladiCNRS,UnivRennes,IRISA,InriaRennes,Francedjamel-eddine.khelladi@irisa.frMathieuAcherUnivRennes,IUF,CNRS,Inria,IRISARennes,Francemathieu...

展开>> 收起<<

Piloting Copilot and Codex Hot Temperature Cold Prompts or Black Magic Jean-Baptiste D oderlein.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Piloting Copilot and Codex Hot Temperature Cold Prompts or Black Magic Jean-Baptiste D oderlein

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: