Piloting Copilot and Codex Hot Temperature Cold Prompts or Black Magic Jean-Baptiste D oderlein

2025-05-02 0 0 1.6MB 14 页 10玖币
侵权投诉
Piloting Copilot and Codex: Hot Temperature, Cold
Prompts, or Black Magic?
Jean-Baptiste D¨
oderlein
ENS Rennes
Rennes, France
jean-baptiste.doderlein@ens-rennes.fr
Djamel Eddine Khelladi
CNRS, Univ Rennes, IRISA, Inria
Rennes, France
djamel-eddine.khelladi@irisa.fr
Mathieu Acher
Univ Rennes, IUF, CNRS, Inria, IRISA
Rennes, France
mathieu.acher@irisa.fr
Benoit Combemale
Univ Rennes, CNRS, Inria, IRISA
Rennes, France
benoit.combemale@irisa.fr
Abstract—Language models are promising solutions for tack-
ling increasing complex problems. In software engineering, they
recently attracted attention in code assistants, with programs
automatically written in a given programming language from
a programming task description in natural language. They have
the potential to save time and effort when writing code. However,
these systems are currently poorly understood, preventing them
from being used optimally. In this article, we investigate the
various inputs of two configurations of a language model, and
conduct a study to understand if variations of these input
parameters (e.g. programming task description and the sur-
rounding context, creativity of the language model, number
of generated solutions) can have a significant impact on the
quality of the generated programs. We design specific operators
for varying input parameters and apply them over two code
assistants (Copilot and Codex) and two benchmarks representing
algorithmic problems (HumanEval and LeetCode). Our results
showed that varying the input parameters can significantly
improve the performance of language models, with, for example,
up to 79.27% of success rate in one-shot compared to 22.44% for
Codex in default settings and 31.1% for Copilot. Actioning this
potential in practice is, however, highly challenging due to the
complex interplay raised in our study – the optimal settings of the
temperature, the prompt and the number of generated solutions
differ from one problem to the other. Our study also yielded
surprising and startling results (e.g. fully removing the prompt
can be an effective strategy), suggesting some brittleness and
room for improving language models. Overall, this work opens
opportunities to envision (automated) strategies for enhancing
performance of language model-based code assistants, but also
questions their robustness.
I. INTRODUCTION
Language models are gaining momentum and capable of
tackling more and more problems from linguistics, maths,
commonsense reasoning, biology, physics, etc. BERT [1],
GPT-2 [2], GPT-3 [3], PaLM [4], to name a few, are scaling
to support a variety of tasks such as text generation, question-
answering, text classification, arithmetic on numbers, and
many others [5]–[9]. In software engineering, code assistants
based on language models have been proposed and are now
deployed at scale for supporting programmers, such as GitHub
Copilot [10]. Based on prompts, composed of both the descrip-
tion of a programming task written in natural language and the
surrounding context (e.g. existing code, function signatures,
targeted language, cursor, authors, shebang), programs are au-
tomatically written in a given programming language (Python,
Java, C++, etc.). The promise is to provide a comprehensive
working solution (or a set of candidate programs) for a given
programming task. Tools like Copilot or Codex hence have
the potential to save time and effort when writing code.
However, the strengths and weaknesses of these systems
are currently poorly understood, preventing them from be-
ing used optimally. On the one hand, there are impressive
demonstrations, showing the ability to produce programs on
non-trivial programming problems or tasks. But there also is
the nagging assumption that these systems are simply reciting
code that is already on the Internet (e.g. Github). Furthermore,
early studies suggest that the quality of solutions appears
to vary greatly in some problems and targeted programming
languages [11]. These assistants seem in particular sensitive
to the way the developer communicates and interacts.
Given a programming task, developers can pilot/drive code
assistants in different directions to achieve their goal. They
can vary the prompt, for example, formulate the programming
task differently or change the context. They can also change
other language models input parameters, such as augmenting
the creativity of the assistant (through the temperature of
language models), or change the number of expected solutions
that are eventually proposed. With high flexibility, developers
can communicate at a high level of abstraction, in a declarative
way, focusing on the goal rather than the how. The counterpart
is that the specification might be brittle and not properly or
systematically understood by code assistants. There is also
the question of what strategy to choose when developers try
to find a solution: changing some terms in the programming
task description? changing the signature of the function?
augmenting or decreasing the temperature? etc. There are
anecdotes here and there about ”language model engineering”
(e.g., prompt engineering), but this has not been systematically
studied in the context of code assistants.
arXiv:2210.14699v2 [cs.SE] 15 Feb 2023
In this article, we hypothesize that variations of input param-
eters of language models (e.g. prompts and temperatures) on
the same problem can have a significant impact on the quality
of the generated programs. These variations can be leveraged
to (1) assess and understand the sensitivity (or robustness)
of code assistants (hence their potential and limitations); (2)
envision (automated) strategies for improving performance.
To do so, we first design and develop a set of operators to
automatically vary the input parameters. These operators can
remove, augment, or simply rewrite an original programming
task description, as well as varying the context and other input
parameters, such as temperature and the number of expected
solutions. The idea is to feed code assistants with different
variations, observe the effects on generated programs, and
eventually better understand the impact of the input parameters
on the resulting performance of the language model. Perfor-
mance is defined in this article as the ability or not to find at
least one solution among the proposed ones that pass all the
test cases.
We conducted a study that considers two code assistants
(Copilot and Codex) and leverages two datasets (HumanEval
and LeetCode) mostly representing algorithmic problems, as
well as our set of operators. Our experiments span numerous
programming tasks (446 problems) with different difficulties
and six programming languages. We also vary the number k
of code samples generated per task, from k= 1 (one shot) to
k= 100. Similarly, we study the sensitivity of code assistants
and the effectiveness of our variations in different settings and
usage scenarios.
Our contributions can be summarized as follows.
The design and development of a set of operators for
automatically varying language models input parameters.
The inspiration of our work both comes from software
testing techniques (e.g. mutation testing, test amplifica-
tion as variants of existing ones) and recent advances in
language models for tuning prompts [12], [13] ;
The design of a study over two code assistants and two
benchmarks. Prior studies considered a limited number
of problems, programming languages, and configurations
of code assistants. We are also unaware of works that
leverage prompt variations and temperatures’ values in
the context of code generation.
The analysis of results that demonstrate that varying input
parameters can significantly improve the performance of
language models. However, there is a tight dependency
when varying the temperature, the prompt and the num-
ber of generated solutions, making potentially hard for
developers to properly control the parameters to obtain
an optimal result.
II. BACKGROUND AND MOTIVATION
A. Language models and Code Suggestions
A language model (LM) is a probabilistic model over
natural language. In practice, from a given prompt, a language
model provides a set of results. In software engineering,
language models such as Codex are called from a prompt
including a description of a programming task in natural
language, as well as all the surrounding context. Such a context
is composed of information such as the existing code (e.g.
imports...), the offset of the cursor, the targeted language, the
authors, the shebang...
Beyond the prompt, language models might be tuned re-
garding specific parameters. In particular, the temperature
hyperparameter controls the creativity of the language model.
The temperature value typically varies from 0 to 1. The
temperature is used to adjust the distribution of the model’s
predicted next word: higher values lead to more diverse
and unpredictable outputs, and lower values lead to more
conservative, predictable outputs. Another parameter is the
expected number kof generated solutions. The principle is
to sample from the kmost probable programs. When kequals
to 1, only one program is generated and is hopefully a valid
solution.
B. Prompts
When using Copilot and Codex to generate code, the user
provides a context to the model. This could be a text in natural
language or some pieces of existing code.
The prompt is the text/code that the model needs to
complete. This includes the comment (e.g. the docstring of
the function in Python) but also the function signature. The
function signature consists of the function name, the number
of arguments, their name, and also in type in the case of typed
languages.
Prompt sensitivity: An example. When using language
model-based assistant tools to generate code, there are several
ways to express what should be completed. The most common
is to use the comment of the function. But there are multiple
ways to phrase the prompt. One can change the prompt while
keeping the same meaning. Yet, sometimes, a small variation
could drastically change the model performance.
In Figure 1 we use an example of a prompt from HumanEval
[14] dataset. The function takes two integers and returns the
biggest even number in the range of the two integers, 1if no
even number is found. With Copilot, the model fails to provide
a correct answer (it returns the smallest even number instead).
However, the model provides a correct answer when we add
the sentence ”Write a quick algorithm to solve this problem.
at the end of the prompt. This sentence does not provide any
additional information to the model about the problem or the
algorithm to solve the problem, but it improves the model’s
output. Furthermore, we observe that if we further modify the
prompt by removing the examples, the result is not correct.
These examples suggest that code generation is sensitive
to prompt variation, and that modification of the prompt can
cause the model to provide a correct or wrong answer within
different contexts.
III. EXPERIMENTAL SET UP
This section describes our experimental setup to explore the
effect of prompt variations and temperature parameter on the
performance of language model-based code assistants.
def choose_num(x, y):
"""
This function takes two positive
numbers x and y and returns the
biggest even integer number that
is in the range [x, y] inclusive.
If there's no such number, then
the function should return -1.
For example:
choose_num(12, 15) = 14
choose_num(13, 12) = -1
"""
if x%2!=0:
x+= 1
if y%2!=0:
y-= 1
if x>y:
return -1
if x%2==0:
x+= 1
while x<= y:
if x%2==0:
return x
x+= 2
return -1
(a) Original prompt
def choose_num(x, y):
"""
This function takes two positive
numbers x and y and returns the
biggest even integer number that
is in the range [x, y] inclusive.
If there's no such number, then
the function should return -1.
For example:
choose_num(12, 15) = 14
choose_num(13, 12) = -1
Write a quick algorithm to
solve this problem.
"""
if x%2==0:
x=x
else:
x=x+ 1
if y%2==0:
y=y
else:
y=y- 1
if x>y:
return -1
else:
return max(x, y)
(b) Modified prompt 1
def choose_num(x, y):
"""
This function takes two positive
numbers x and y and returns the
biggest even integer number that
is in the range [x, y] inclusive.
If there's no such number, then
the function should return -1.
"""
if x%2==1:
x+= 1
if y%2==0:
y-= 1
for iin range(x, y + 1,2):
if x<= i<= y:
return i
return -1
(c) Modified prompt 2
Fig. 1: Code generation of HumanEval problem choose_num with original prompt (a) and modified prompts (b and c).
There is the prompt (blue background) and the code generated by Copilot (green background)
A. Two code assistants, two configurations of one language
model
Several language models and code assistant exist for code
suggestion. We consider two popular code assistants, Codex
and Copilot. There are several reasons why we chose these
two tools: they provide state-of-the-art performance; they are
capable of targeting different programming languages; they
are rather mature tools, already widely used, and come with
a set of APIs. Another interesting property is that Codex
and Copilot rely on the same language model. This apparent
similarity in fact hides many differences that we wish to ex-
plore. Copilot can be thought of as a thorough engineering of
Codex, providing a chance for research to evaluate sensitivity
in various configuration settings and acquire insights about
how to configure (pilot) language models.
1) Copilot: Copilot has been built on top of one of the
Codex language model as a code assistant with fixed param-
eters (e.g. temperature) and an associated development tool
that is available from Github as an IDE extension. Copilot
is described as an ”AI pair programmer” that offers code
suggestions in real time [10]. It can produce code in different
ways: It can write code from comments, write tests for code
that is already written, or complete code that is being written
like tools like Intellisens [15] does.
The focus here is on generating code from comments (and
the function signature). In regular use, Copilot communicates
in real time about all the changes made on the document
and sends additional information about the context (document
name, path, cursor position, language, etc.). Previous studies
evaluating the performance [11] or security [16] of the code
produced by Copilot retrieved the results for each prompt,
manually from VS Code. In order to increase the speed of
the tests and to significantly increase the number of samples
evaluated, we directly made calls to the backend of the Visual
Studio Code module. The user is authenticated once manually
from the Neovim extension, then the calls are made to Copilot
executed using NodeJS with LSP. When used in the IDE,
the Copilot extension does not simply make requests, but
builds contexts during generation and decides according to
the context if a request is needed [17]. However, as we use a
primitive in the LSP to request a query, no context other than
the name of the file and programming language are provided.
In the Copilot evaluation, we restricted ourselves to the
default parameter (temperature, top p). Although these pa-
rameters seem to be modifiable [16], the default values are
not known, and the default parameters better reflect the codes
proposed to the end user. Moreover, we did not use the
possibility of the panel mode allowing generating several
codes. Indeed, the codes retrieved by the backend sometimes
correspond to the code of the whole file and sometimes to
the simple completion of the code, thus preventing a good
automation of the process. Overall, all Copilot codes are
evaluated in one-shot.
2) Codex: Codex, more precisely code-davinci-002, is a
language model developed by OpenAI and available in private
beta at the time of writing. It is part of the same family as the
model used by Copilot. Contrary to Copilot, it has an API that
facilitates the automation of the evaluation of produced codes.
The API offers many parameters like: temperature, top p,
number of code produced. For the evaluation, OpenAI advises
摘要:

PilotingCopilotandCodex:HotTemperature,ColdPrompts,orBlackMagic?Jean-BaptisteD¨oderleinENSRennesRennes,Francejean-baptiste.doderlein@ens-rennes.frDjamelEddineKhelladiCNRS,UnivRennes,IRISA,InriaRennes,Francedjamel-eddine.khelladi@irisa.frMathieuAcherUnivRennes,IUF,CNRS,Inria,IRISARennes,Francemathieu...

展开>> 收起<<
Piloting Copilot and Codex Hot Temperature Cold Prompts or Black Magic Jean-Baptiste D oderlein.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:1.6MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注