Toward Trustworthy Neural Program Synthesis

2025-04-15
0
0
6.51MB
26 页
10玖币
侵权投诉
Toward Trustworthy Neural Program Synthesis
Wen-Ding Li*1, Darren Key*1, Kevin Ellis1
1Department of Computer Science, Cornell University, USA
wl678@cornell.edu, dyk34@cornell.edu, kellis@cornell.edu
Abstract
We develop an approach to estimate the probability that a pro-
gram sampled from a large language model is correct. Given a
natural language description of a programming problem, our
method samples both candidate programs as well as candidate
predicates specifying how the program should behave. This
allows learning a model that forms a well-calibrated proba-
bilistic prediction of program correctness. Our system also
infers the which predicates are useful to explain the behav-
ior of the generated code, and humans preferred these in a
human study over raw language model outputs. Our method
is simple, easy to implement, and maintains state of the art
generation accuracy results.
Introduction
Picture a future where AI systems attempt to close GitHub
issues by generating source code given only the natural lan-
guage of the GitHub issue. Such systems might not come
out next year, but when or if they ever do, they will likely
leverage large neural network language models for source
code (Chen et al. 2021; Austin et al. 2021). These neural
systems are good, but not perfect. Suppose 75% of the time,
such systems propose a correct fix to the GitHub issue. The
other 25% of the time, they produce plausible looking code
containing subtle bugs. Would you use this system?
Most engineers would be reluctant to use such a sys-
tem, because it fails to build trust with the user. When it
fails, it cannot detect its own failure. When it succeeds, it
doesn’t present a human-understandable explanation of why
its program behaves as intended. In this paper we seek steps
towards rectifying this lack of trust by building natural-
language conditioned program synthesizers that are more
trustworthy in several complimentary ways:
•Calibration: We want systems that, when they cannot
solve a programming problem, simply return no answer,
rather than return a (possibly subtly) incorrect program.
We conjecture that it is better to fall back on the hu-
man programmer, rather than risk introducing bugs. Con-
trast the situation with natural language translation: Un-
like natural language, programs are brittle, and more time-
consuming to look into and understand. And debugging
bad code, unlike proofreading language, can be harder
*These authors contributed equally.
than just writing it yourself. We will accomplish this
by having a classifier that predicts whether the program
is correct, and ensuring that the classifier is well cali-
brated (Platt et al. 1999; Kuhn, Gal, and Farquhar 2023).
•Explainability: To help humans understand the output of
a neural network that is writing code, we want a system
that can explain its outputs by generating informative and
interpretable checks on program behavior. We propose a
characterization of what makes an explanation of program
behavior ‘good’, and validate in a human user study that
this characterization produces better explanations than a
raw large language model by itself.
•Accuracy: Ideally, trustworthy systems should be more
accurate, solving more programming problems. This goal
might seem to be in tension with the previous two. Sur-
prisingly, we find our methods also boost overall accuracy
on natural language to code generation problems.
Our high-level approach has a neural network propose
candidate program solutions and independently propose
predicates that correct solutions should satisfy, such as In-
put/Output tests, known as specifications (‘specs’, Fig. 1).
Although specs can refer to a broad range of ways to specify
the behavior of programs, here we only consider two kinds:
(1) input-output test cases, and (2) parameterized logical re-
lation tests which execute the program and check some rela-
tion between input and output. In general, a spec can be any
mechanically checkable property. We check the programs
against the specs, and learn to use this checking to predict if
the system knows how to solve the problem at all, and if so,
which program(s) are probably the right solution. Intuitively,
we ask the language model to ‘check its work’ by generat-
ing specs. We call our approach speculyzer, short for
‘Specification Synthesizer’, because in addition to synthe-
sizing programs, it synthesizes specs.
Our work makes the following contributions:
1. Calibration A method to give a well-calibrated proba-
bilistic estimate of whether a program is correct which
enables analysis of metrics connected to trust and safety
2. Explainability A method for identifying the specifica-
tions most likely be useful to humans as an explanation
of program behavior, and a validation of that approach in
a human study
arXiv:2210.00848v2 [cs.SE] 9 Oct 2023
3. Accuracy Demonstration that the above contributions do
not impair the overall accuracy of programs synthesizers,
and can sometimes let them solve more problems overall
Related Work
Program synthesis. Automatically constructing software
has been a longstanding goal of computer science (Manna
and Waldinger 1979; Gulwani et al. 2017). Classic program
synthesizers input a formal specification, and then either
search or logically derive a program guaranteed to satisfy
that formal specification (Alur et al. 2013).
Large language models for source code. Our work uses
large language models for source code (Chen et al. 2021;
Austin et al. 2021). These neural networks generate source
code conditioned or ‘prompted’ by a mix of code and natu-
ral language (the natural language is usually represented as
code comments). Such models are typically implemented as
large transformers (Vaswani et al. 2017; Brown et al. 2020).
Following the introduction of large transformer-based
language models for source code, there has been work on
how to boost the accuracy of those models. Here, accuracy
means the probability of sampling a correct program con-
ditioned on a natural-language prompt. Accuracy is often
measured by functional correctness with the pass@k metric,
which considers drawing kIID samples from the language
model and testing if any of those samples pass a set of hold-
out test cases. Toward boosting pass@k, researchers have
considered clustering sampled programs according to the
outputs they produce on test inputs (Shi et al. 2022; Li et al.
2022). For example, AlphaCode prioritizes large ‘clusters’
of samples with the exact same input-output behavior (Li
et al. 2022), effectively reranking the samples from the lan-
guage model according to how likely they are to solve the
task. A complementary reranking strategy is to train a sec-
ond neural network to predict program correctness, as ex-
plored in (Inala et al. 2022).
The closest work to ours is CodeT (Chen et al. 2023),
which also generates programs as well as input-output
test cases, with the goal of boosting pass@k. The quali-
tative difference between our systems is that we designed
speculyzer to build trust in a variety of ways by syn-
thesizing specifications, centered around first forming well-
calibrated probability estimates, only boosting pass@k as a
side effect. We also incorporate input-output test cases as a
special case of specs in general.
Engineering safe, trustworthy language models has re-
ceived considerable attention by the AI safety (Thoppi-
lan et al. 2022) and AI alignment communities (Kadavath
et al. 2022). These works find that one can train classi-
fiers which predict the truthfulness or safety of language
outputs by inspecting the hidden activations of the model
or even by simply ‘asking’ the model if its output is cor-
rect or safe. We see this family of efforts as complemen-
tary: For programs, it is possible to formally specify correct-
ness properties, which is not generally true in NLP, so we
focus on formal properties (specifications) here. Nonethe-
less, one can train statistical predictors of program correct-
ness (Inala et al. 2022). Broadly however, we think that pro-
gram synthesis offers unique opportunities for building trust
through symbolic methods. Although statistically reranking
language model outputs via a second neural network im-
proves raw performance, we believe it is a suboptimal trust-
builder: an inscrutable neural network cannot guarantee the
correctness of another inscrutable network. Here we advo-
cate that properties which are symbolically checkable and
human-comprehensible should play a role, and examine cer-
tain specifications as basic examples of such properties.
Methods
Given a natural-language prompt describing a programming
problem, our goal is to produce a well-calibrated estimate
probability of each program sample from language model
being correct.Our approach independently samples a set
of candidate programs Pand a set of candidate specs S.
Specs can be either input-output testcases, or logical rela-
tions (Fig. 1). We write Tfor the set of test cases and R
for the set of logical relations, so S=T ∪ R. Each pro-
gram p∈ P is checked against each spec s∈ S, and
basic statistics of program-spec agreement are computed.
These statistics are aggregated by a learned model into an
estimated probability that the program is correct. Programs
whose probability falls below a threshold are discarded. Any
remaining programs are sorted by probability and returned
to the user as possible solutions, together with certain specs
they pass. Returning specifications allows the user to check
that the code has the intended behavior. This architecture
lets the system learn how to predict when it cannot solve a
problem, and also learn to rank candidate solutions and their
corresponding specs.
Sampling programs and specs
Given a string prompt describing a programming prob-
lem, we sample n= 100 candidate programs (the set P)
and candidate specs (the set S). Both sets are sampled us-
ing a pretrained language model, which can probabilistically
generate program source code conditioned on a prompt.
We write PLM(·|prompt)for the conditional distribution
over programs, given prompt. If a program p∈ P, then
p∼PLM(·|prompt). To sample specs, we deterministi-
cally transform the prompt as in Fig. 2 and Appendix, then
draw iid samples from the language model to construct rela-
tions Rand input-output test cases T.
One motivation for generating logical relations is that
large language models are famously bad at predicting the
output of a novel program (Nye et al. 2021). However,
given a generic input, they can produce a variety of reason-
able constraints on that the output should obey, if they are
prompted in a program-of-thought style (Chen et al. 2022;
Gao et al. 2023). Logical relations also resemble unit test
harnesses, like those used in property based testing, (Fink
and Bishop 1997) which are likely part of the model’s train-
ing data. We therefore suspected that although input-outputs
are the superior form of test for typical programming tasks,
logical relations could serve the long tail of novel tasks that
the language model cannot reliably predict outputs for.
Figure 1: Our speculyzer system inputs a natural language description of a programming problem. It uses large language
models to independently sample candidate programs, and candidate specifications of what the program should do. Because
natural language is informal, we cannot mechanically check programs against it, but logical relations and input-outputs can be
mechanically checked against. The result of this validation step is fed to a learned model which predicts whether the problem
can be solved; if so, which program is correct; and which specs best explain the behavior of the program, am would be useful
for judging whether that program is correct or incorrect.
Checking specs against programs
To judge the correctness of a program p, we would like to
know whether each specification s∈ S is actually true of
p, notated p|=s. If the specification sis an input-output
test, this checking is trivial: the program pcan be run on
the input and checked to see if it yields the desired output.
But if sis a logical relation, checking if p|=sover the en-
tire input space becomes undecidable. Thus we require an
approximate validation approach for logical relations. As a
heuristic approximation we ask the language model to gen-
erate a few candidate inputs on which to run the relational
test, effectively using the language model as a fuzzer. This
causes us to overapproximate the set of relations a program
satisfies. Notionally we write p⊢Tsto mean that spec sis
inferred to be true of program paccording to testing method-
ology T. If T=IO, then p|=siff p⊢IO s. If sis a re-
lation, we instead have p|=simplies p⊢Rel s. To avoid
confusion, we always show the concrete inputs on which a
logical relation was tested.
Scoring and analyzing test coverage
Given programs Pand specs S, we produce an estimated
probability for each p∈ P that pis correct. Assuming,
on average, specs correctly formalize the informal natural-
language intention, satisfying more specs should increase
our confidence in a program. Additionally, if many sam-
pled programs exhibit identical behavior on the specs, then
we should increase our confidence in those programs, be-
cause this indicates high marginal probability of that behav-
ior under PLM(·|prompt). This ‘clustering’ of candidate
solutions according to their execution behavior, and prior-
itizing large clusters, has been successfully used by Alpha-
Code (Li et al. 2022), Minerva (Lewkowycz et al. 2022),
and others (Shi et al. 2022). It is also related to observa-
tional equivalence from classic program synthesis (Udupa
et al. 2013), which treats programs as identical if they have
the same outputs on test inputs.
Therefore, we estimate the probability of correctness for
each program p∈ P using a logistic regressor over features
ϕ(p, P,S)and learned parameters θ. The features include
i/o pass rate (input-output specs passed), relation pass rate
(logical-relation specs passed, under fuzzing), cluster size
(# of other programs satisfying the same specs), the ordinal
rank of the preceding features and the normalized log prob-
ability of the sampled programs:
scoreθ(p|P,S) = Sigmoid (θ·ϕ(p, P,S) + θ0)(1)
where the components of ϕ(p, P,S)include the follow
-ing and their ordinal ranks:
IOPass(p, P,T ∪ R) = 1
|T | X
s∈T
1[p⊢IO s]
RelPass(p, P,T ∪ R) = 1
|R| X
s∈R
1[p⊢Rel s]
ClusterSize(T, p, P,S)
=X
p′∈P Y
s∈S
1[(p⊢Ts) = (p′⊢Ts)] where T∈ {IO, Rel}
NormLogProb(p) = 1
length(p)log PLM(p|prompt)
Figure 2: Our systems uses different prompts to generate programs, input-output tests, and logical relations. Here we also show
the example completion from the model in bold.
We fit θvia maximum likelihood on a corpus Dcontain-
ing triples ⟨P,S,G⟩ of programs Pand specifications S,
both sampled from the same program problem, and ground-
truth testcases G, which serve as a proxy for program cor-
rectness. The ground-truth testcases Gare assumed to be un-
available at test time, because our goal is synthesis from in-
formal specifications like natural language. We use gradient
ascent to maximize the log likelihood, L:
L=X
⟨P,S,G⟩∈D
p∈P
1[p⊢IO G] log scoreθ(p|P,S)
+1[p̸⊢IO G] log (1 −scoreθ(p|P,S))!(2)
Test time metrics
Precision-Recall. Ultimately our goal is a trustworthy sys-
tem that proposes program solutions whenever it can, but
avoids proposing buggy code. Toward those ends, we seek
high precision without sacrificing recall. High precision
means that when the system suggests a program, it is prob-
ably correct. Precise systems foster trust because they don’t
propose wrong answers, though they may decline to provide
an answer in the first place. High recall means a correct pro-
gram achieves the top rank: In other words, the system can
solve a lot of programming problems, though it might make
more mistakes in the process.
The tradeoff between precision and recall can be tuned
by a thresholding parameter, τ. A candidate program is dis-
carded if its score falls below the threshold τ. If all programs
are discarded, the system declines to provide an output for
the programming problem, and otherwise the system outputs
a ranked list of programs sorted by score.
We define Precision and Recall as follows, which respec-
tively measure (1) whether a correct program is top-ranked
whenever any program scores above τand (2) how often a
correct program scoring above τis the top ranked. Note that
these definitions are generalizations of precision/recall for
binary classification.
Precision@k=TruePositives@k
PredictedPositives
Recall@k=TruePositives@k
ActualPositives
TruePositives@k
=X
⟨P,S,G⟩∈D
1
∃p∈ P :p⊢IO G∧
τ≤scoreθ(p|P,S)∧
p∈top-kp′∈P score(p′|P,S)
PredictedPositives
=X
⟨P,S,G⟩∈D
1[∃p∈ P :τ≤scoreθ(p|P,S)]
ActualPositives =X
⟨P,S,G⟩∈D
1[∃p∈ P :p⊢IO G]
We sweep possible values for τto compute a precision-recall
curve. Generically, there is no ‘true’ best trade-off between
these desiderata.
Pass rate. The pass@k metric (Austin et al. 2021; Chen
et al. 2021) measures the probability of ksamples from
PLM(·|prompt)passing the ground-truth test cases, G:
pass@k =Ep1,··· ,pk∼PLM(·|prompt)1[∃pi:pi⊢IO G](3)
Note that pass@k is proportional to ActualPositives (Eq. 3):
The (fraction of) problems where there is at least one correct
answer in the sampled programs.
It is also conventional to combine pass@k with a scoring
function that reranks the sampled programs. This general-
izes pass@k to pass@k,n, which measures the probability
that, after generating ncandidate programs, a correct one is
in the top-kunder our scoring function:
pass@k,n =
E⟨P,T,G⟩∼D1∃p∈top-kp′∈P scoreθ(p′|P,T)
where p|=G
Results
We study our approach on two popular datasets while using
Codex models (Chen et al. 2021) of different sizes, seeking
to answer the following research questions:
• Is the classifier well-calibrated? If so, how trustworthy
can we make the system (precision), and how much does
that require sacrificing coverage (recall)?
• How can we use the synthesized specifications to act as
human-interpretable explanations of the behavior of the
programs constructed by the language model?
• How does our learned reranking impact raw rate of suc-
cess (pass@k,n)?
We evaluate on programming problems from the Mostly Ba-
sic Python Problems (MBPP: (Austin et al. 2021), sanitized
version) and HumanEval datasets (Chen et al. 2021). Each
of these datasets contains natural language descriptions of
programming problems, and holdout tests to judge program
correctness. An important difference between them is that
HumanEval sometimes includes example input-outputs as
part of the natural language description, while MBPP does
not. Having I/O examples in the problem description makes
spec generation easier: some specs are given for free. On the
other hand, humans sometimes spontaneously mix natural
language and examples (Acquaviva et al. 2021). Therefore,
using both MBPP and HumanEval gives a more robust eval-
uation, but we note this qualitative difference between them.
We gives further experimental setup details, such as hyper-
parameters and example prompts in the Appendix.
Calibration, Precision, and Recall
Trustworthy systems should avoid predicting any programs
at all when they cannot solve a problem. Doing so increases
precision, the fraction of outputs which are correct, but at the
expense of recall, the fraction of solvable problems where
we output a correct solution. Fig. 3 illustrates how one can
adjust this trade-off. For example, we can achieve 100% pre-
cision on HumanEval (zero error rate), in exchange for drop-
ping our recall from 93.4% to 44.4%. Note this zero error
rate does not come from our learned score function memo-
rizing the data: we use cross validation to test each program
using weights trained on other folds. Less extreme tradeoffs
are possible, such as 90% precision in exchange for 90.7%
recall.
Striking favorable points on these tradeoffs is, in the-
ory, a result of having a well-calibrated model: whenever
our probabilistic scoring function assigns probability x%
to a program being correct, then approximately x% of the
time, that program should actually be correct. We confirm
this calibration property in Fig. 4, also finding that raw log
likelihoods from the language model are substantially less
well-calibrated. Calibration allows tuning the threshold τto
achieve a desired precision, because the free parameter τ
acts as a threshold on the probability of correctness needed
to output a program.
HumanEval MBPP
AUC max R@ AUC max R@
F1 P=.9 F1 P=.9
Davinci (Largest Codex model)
full 0.91 0.91 0.91 0.74 0.80 0.43
IO only 0.86 0.88 0.83 0.69 0.75 0.41
rels only 0.66 0.73 0.27 0.69 0.77 0
random 0.15 0.39 0 0.23 0.48 0
Cushman (Smaller Codex model)
full 0.67 0.72 0.28 0.57 0.65 0.18
IO only 0.66 0.71 0.33 0.55 0.63 0.25
rels only 0.34 0.44 0.14 0.51 0.61 0.14
random 0.08 0.28 0 0.16 0.40 0
Figure 3: Top: Precision-Recall curves with k= 1. Up:
Statistics of these curves, measuring Area Under Curve
(AUC), max F1 (harmonic mean of precision and recall),
recall in the high-trust regime: R@P=.9 is recall when pre-
cision=90%. The below figure gives further results.
Fig. 3 also shows that our full method typically performs
best on precision/recall statistics in the most realistic set-
摘要:
展开>>
收起<<
TowardTrustworthyNeuralProgramSynthesisWen-DingLi*1,DarrenKey*1,KevinEllis11DepartmentofComputerScience,CornellUniversity,USAwl678@cornell.edu,dyk34@cornell.edu,kellis@cornell.eduAbstractWedevelopanapproachtoestimatetheprobabilitythatapro-gramsampledfromalargelanguagemodeliscorrect.Givenanaturallang...
声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
相关推荐
-
VIP免费2024-11-14 22
-
VIP免费2024-11-23 3
-
VIP免费2024-11-23 4
-
VIP免费2024-11-23 3
-
VIP免费2024-11-23 4
-
VIP免费2024-11-23 28
-
VIP免费2024-11-23 11
-
VIP免费2024-11-23 21
-
VIP免费2024-11-23 12
-
VIP免费2024-11-23 5
分类:学术论文
价格:10玖币
属性:26 页
大小:6.51MB
格式:PDF
时间:2025-04-15
作者详情
-
Voltage-Controlled High-Bandwidth Terahertz Oscillators Based On Antiferromagnets Mike A. Lund1Davi R. Rodrigues2Karin Everschor-Sitte3and Kjetil M. D. Hals1 1Department of Engineering Sciences University of Agder 4879 Grimstad Norway10 玖币0人下载
-
Voltage-controlled topological interface states for bending waves in soft dielectric phononic crystal plates10 玖币0人下载