In this article, we hypothesize that variations of input param-
eters of language models (e.g. prompts and temperatures) on
the same problem can have a significant impact on the quality
of the generated programs. These variations can be leveraged
to (1) assess and understand the sensitivity (or robustness)
of code assistants (hence their potential and limitations); (2)
envision (automated) strategies for improving performance.
To do so, we first design and develop a set of operators to
automatically vary the input parameters. These operators can
remove, augment, or simply rewrite an original programming
task description, as well as varying the context and other input
parameters, such as temperature and the number of expected
solutions. The idea is to feed code assistants with different
variations, observe the effects on generated programs, and
eventually better understand the impact of the input parameters
on the resulting performance of the language model. Perfor-
mance is defined in this article as the ability or not to find at
least one solution among the proposed ones that pass all the
test cases.
We conducted a study that considers two code assistants
(Copilot and Codex) and leverages two datasets (HumanEval
and LeetCode) mostly representing algorithmic problems, as
well as our set of operators. Our experiments span numerous
programming tasks (446 problems) with different difficulties
and six programming languages. We also vary the number k
of code samples generated per task, from k= 1 (one shot) to
k= 100. Similarly, we study the sensitivity of code assistants
and the effectiveness of our variations in different settings and
usage scenarios.
Our contributions can be summarized as follows.
•The design and development of a set of operators for
automatically varying language models input parameters.
The inspiration of our work both comes from software
testing techniques (e.g. mutation testing, test amplifica-
tion as variants of existing ones) and recent advances in
language models for tuning prompts [12], [13] ;
•The design of a study over two code assistants and two
benchmarks. Prior studies considered a limited number
of problems, programming languages, and configurations
of code assistants. We are also unaware of works that
leverage prompt variations and temperatures’ values in
the context of code generation.
•The analysis of results that demonstrate that varying input
parameters can significantly improve the performance of
language models. However, there is a tight dependency
when varying the temperature, the prompt and the num-
ber of generated solutions, making potentially hard for
developers to properly control the parameters to obtain
an optimal result.
II. BACKGROUND AND MOTIVATION
A. Language models and Code Suggestions
A language model (LM) is a probabilistic model over
natural language. In practice, from a given prompt, a language
model provides a set of results. In software engineering,
language models such as Codex are called from a prompt
including a description of a programming task in natural
language, as well as all the surrounding context. Such a context
is composed of information such as the existing code (e.g.
imports...), the offset of the cursor, the targeted language, the
authors, the shebang...
Beyond the prompt, language models might be tuned re-
garding specific parameters. In particular, the temperature
hyperparameter controls the creativity of the language model.
The temperature value typically varies from 0 to 1. The
temperature is used to adjust the distribution of the model’s
predicted next word: higher values lead to more diverse
and unpredictable outputs, and lower values lead to more
conservative, predictable outputs. Another parameter is the
expected number kof generated solutions. The principle is
to sample from the kmost probable programs. When kequals
to 1, only one program is generated and is hopefully a valid
solution.
B. Prompts
When using Copilot and Codex to generate code, the user
provides a context to the model. This could be a text in natural
language or some pieces of existing code.
The prompt is the text/code that the model needs to
complete. This includes the comment (e.g. the docstring of
the function in Python) but also the function signature. The
function signature consists of the function name, the number
of arguments, their name, and also in type in the case of typed
languages.
Prompt sensitivity: An example. When using language
model-based assistant tools to generate code, there are several
ways to express what should be completed. The most common
is to use the comment of the function. But there are multiple
ways to phrase the prompt. One can change the prompt while
keeping the same meaning. Yet, sometimes, a small variation
could drastically change the model performance.
In Figure 1 we use an example of a prompt from HumanEval
[14] dataset. The function takes two integers and returns the
biggest even number in the range of the two integers, −1if no
even number is found. With Copilot, the model fails to provide
a correct answer (it returns the smallest even number instead).
However, the model provides a correct answer when we add
the sentence ”Write a quick algorithm to solve this problem.”
at the end of the prompt. This sentence does not provide any
additional information to the model about the problem or the
algorithm to solve the problem, but it improves the model’s
output. Furthermore, we observe that if we further modify the
prompt by removing the examples, the result is not correct.
These examples suggest that code generation is sensitive
to prompt variation, and that modification of the prompt can
cause the model to provide a correct or wrong answer within
different contexts.
III. EXPERIMENTAL SET UP
This section describes our experimental setup to explore the
effect of prompt variations and temperature parameter on the
performance of language model-based code assistants.