Toward Trustworthy Neural Program Synthesis

2025-04-15 2 0 6.51MB 26 页 10玖币

侵权投诉

Wen-Ding Li*1, Darren Key*1, Kevin Ellis1

1Department of Computer Science, Cornell University, USA

wl678@cornell.edu, dyk34@cornell.edu, kellis@cornell.edu

Abstract

We develop an approach to estimate the probability that a pro-

gram sampled from a large language model is correct. Given a

natural language description of a programming problem, our

method samples both candidate programs as well as candidate

predicates specifying how the program should behave. This

allows learning a model that forms a well-calibrated proba-

bilistic prediction of program correctness. Our system also

infers the which predicates are useful to explain the behav-

ior of the generated code, and humans preferred these in a

human study over raw language model outputs. Our method

is simple, easy to implement, and maintains state of the art

generation accuracy results.

Introduction

Picture a future where AI systems attempt to close GitHub

issues by generating source code given only the natural lan-

guage of the GitHub issue. Such systems might not come

out next year, but when or if they ever do, they will likely

leverage large neural network language models for source

code (Chen et al. 2021; Austin et al. 2021). These neural

systems are good, but not perfect. Suppose 75% of the time,

such systems propose a correct ﬁx to the GitHub issue. The

other 25% of the time, they produce plausible looking code

containing subtle bugs. Would you use this system?

Most engineers would be reluctant to use such a sys-

tem, because it fails to build trust with the user. When it

fails, it cannot detect its own failure. When it succeeds, it

doesn’t present a human-understandable explanation of why

its program behaves as intended. In this paper we seek steps

towards rectifying this lack of trust by building natural-

language conditioned program synthesizers that are more

trustworthy in several complimentary ways:

•Calibration: We want systems that, when they cannot

solve a programming problem, simply return no answer,

rather than return a (possibly subtly) incorrect program.

We conjecture that it is better to fall back on the hu-

man programmer, rather than risk introducing bugs. Con-

trast the situation with natural language translation: Un-

like natural language, programs are brittle, and more time-

consuming to look into and understand. And debugging

bad code, unlike proofreading language, can be harder

*These authors contributed equally.

than just writing it yourself. We will accomplish this

by having a classiﬁer that predicts whether the program

is correct, and ensuring that the classiﬁer is well cali-

brated (Platt et al. 1999; Kuhn, Gal, and Farquhar 2023).

•Explainability: To help humans understand the output of

a neural network that is writing code, we want a system

that can explain its outputs by generating informative and

interpretable checks on program behavior. We propose a

characterization of what makes an explanation of program

behavior ‘good’, and validate in a human user study that

this characterization produces better explanations than a

raw large language model by itself.

•Accuracy: Ideally, trustworthy systems should be more

accurate, solving more programming problems. This goal

might seem to be in tension with the previous two. Sur-

prisingly, we ﬁnd our methods also boost overall accuracy

on natural language to code generation problems.

Our high-level approach has a neural network propose

candidate program solutions and independently propose

predicates that correct solutions should satisfy, such as In-

put/Output tests, known as speciﬁcations (‘specs’, Fig. 1).

Although specs can refer to a broad range of ways to specify

the behavior of programs, here we only consider two kinds:

(1) input-output test cases, and (2) parameterized logical re-

lation tests which execute the program and check some rela-

tion between input and output. In general, a spec can be any

mechanically checkable property. We check the programs

against the specs, and learn to use this checking to predict if

the system knows how to solve the problem at all, and if so,

which program(s) are probably the right solution. Intuitively,

we ask the language model to ‘check its work’ by generat-

ing specs. We call our approach speculyzer, short for

‘Speciﬁcation Synthesizer’, because in addition to synthe-

sizing programs, it synthesizes specs.

Our work makes the following contributions:

1. Calibration A method to give a well-calibrated proba-

bilistic estimate of whether a program is correct which

enables analysis of metrics connected to trust and safety

2. Explainability A method for identifying the speciﬁca-

tions most likely be useful to humans as an explanation

of program behavior, and a validation of that approach in

a human study

arXiv:2210.00848v2 [cs.SE] 9 Oct 2023

3. Accuracy Demonstration that the above contributions do

not impair the overall accuracy of programs synthesizers,

and can sometimes let them solve more problems overall

Related Work

Program synthesis. Automatically constructing software

has been a longstanding goal of computer science (Manna

and Waldinger 1979; Gulwani et al. 2017). Classic program

synthesizers input a formal speciﬁcation, and then either

search or logically derive a program guaranteed to satisfy

that formal speciﬁcation (Alur et al. 2013).

Large language models for source code. Our work uses

large language models for source code (Chen et al. 2021;

Austin et al. 2021). These neural networks generate source

code conditioned or ‘prompted’ by a mix of code and natu-

ral language (the natural language is usually represented as

code comments). Such models are typically implemented as

large transformers (Vaswani et al. 2017; Brown et al. 2020).

Following the introduction of large transformer-based

language models for source code, there has been work on

how to boost the accuracy of those models. Here, accuracy

means the probability of sampling a correct program con-

ditioned on a natural-language prompt. Accuracy is often

measured by functional correctness with the pass@k metric,

which considers drawing kIID samples from the language

model and testing if any of those samples pass a set of hold-

out test cases. Toward boosting pass@k, researchers have

considered clustering sampled programs according to the

outputs they produce on test inputs (Shi et al. 2022; Li et al.

2022). For example, AlphaCode prioritizes large ‘clusters’

of samples with the exact same input-output behavior (Li

et al. 2022), effectively reranking the samples from the lan-

guage model according to how likely they are to solve the

task. A complementary reranking strategy is to train a sec-

ond neural network to predict program correctness, as ex-

plored in (Inala et al. 2022).

The closest work to ours is CodeT (Chen et al. 2023),

which also generates programs as well as input-output

test cases, with the goal of boosting pass@k. The quali-

tative difference between our systems is that we designed

speculyzer to build trust in a variety of ways by syn-

thesizing speciﬁcations, centered around ﬁrst forming well-

calibrated probability estimates, only boosting pass@k as a

side effect. We also incorporate input-output test cases as a

special case of specs in general.

Engineering safe, trustworthy language models has re-

ceived considerable attention by the AI safety (Thoppi-

lan et al. 2022) and AI alignment communities (Kadavath

et al. 2022). These works ﬁnd that one can train classi-

ﬁers which predict the truthfulness or safety of language

outputs by inspecting the hidden activations of the model

or even by simply ‘asking’ the model if its output is cor-

rect or safe. We see this family of efforts as complemen-

tary: For programs, it is possible to formally specify correct-

ness properties, which is not generally true in NLP, so we

focus on formal properties (speciﬁcations) here. Nonethe-

less, one can train statistical predictors of program correct-

ness (Inala et al. 2022). Broadly however, we think that pro-

gram synthesis offers unique opportunities for building trust

through symbolic methods. Although statistically reranking

language model outputs via a second neural network im-

proves raw performance, we believe it is a suboptimal trust-

builder: an inscrutable neural network cannot guarantee the

correctness of another inscrutable network. Here we advo-

cate that properties which are symbolically checkable and

human-comprehensible should play a role, and examine cer-

tain speciﬁcations as basic examples of such properties.

Methods

Given a natural-language prompt describing a programming

problem, our goal is to produce a well-calibrated estimate

probability of each program sample from language model

being correct.Our approach independently samples a set

of candidate programs Pand a set of candidate specs S.

Specs can be either input-output testcases, or logical rela-

tions (Fig. 1). We write Tfor the set of test cases and R

for the set of logical relations, so S=T ∪ R. Each pro-

gram p∈ P is checked against each spec s∈ S, and

basic statistics of program-spec agreement are computed.

These statistics are aggregated by a learned model into an

estimated probability that the program is correct. Programs

whose probability falls below a threshold are discarded. Any

remaining programs are sorted by probability and returned

to the user as possible solutions, together with certain specs

they pass. Returning speciﬁcations allows the user to check

that the code has the intended behavior. This architecture

lets the system learn how to predict when it cannot solve a

problem, and also learn to rank candidate solutions and their

corresponding specs.

Sampling programs and specs

Given a string prompt describing a programming prob-

lem, we sample n= 100 candidate programs (the set P)

and candidate specs (the set S). Both sets are sampled us-

ing a pretrained language model, which can probabilistically

generate program source code conditioned on a prompt.

We write PLM(·|prompt)for the conditional distribution

over programs, given prompt. If a program p∈ P, then

p∼PLM(·|prompt). To sample specs, we deterministi-

cally transform the prompt as in Fig. 2 and Appendix, then

draw iid samples from the language model to construct rela-

tions Rand input-output test cases T.

One motivation for generating logical relations is that

large language models are famously bad at predicting the

output of a novel program (Nye et al. 2021). However,

given a generic input, they can produce a variety of reason-

able constraints on that the output should obey, if they are

prompted in a program-of-thought style (Chen et al. 2022;

Gao et al. 2023). Logical relations also resemble unit test

harnesses, like those used in property based testing, (Fink

and Bishop 1997) which are likely part of the model’s train-

ing data. We therefore suspected that although input-outputs

are the superior form of test for typical programming tasks,

logical relations could serve the long tail of novel tasks that

the language model cannot reliably predict outputs for.

Figure 1: Our speculyzer system inputs a natural language description of a programming problem. It uses large language

models to independently sample candidate programs, and candidate speciﬁcations of what the program should do. Because

natural language is informal, we cannot mechanically check programs against it, but logical relations and input-outputs can be

mechanically checked against. The result of this validation step is fed to a learned model which predicts whether the problem

can be solved; if so, which program is correct; and which specs best explain the behavior of the program, am would be useful

for judging whether that program is correct or incorrect.

Checking specs against programs

To judge the correctness of a program p, we would like to

know whether each speciﬁcation s∈ S is actually true of

p, notated p|=s. If the speciﬁcation sis an input-output

test, this checking is trivial: the program pcan be run on

the input and checked to see if it yields the desired output.

But if sis a logical relation, checking if p|=sover the en-

tire input space becomes undecidable. Thus we require an

approximate validation approach for logical relations. As a

heuristic approximation we ask the language model to gen-

erate a few candidate inputs on which to run the relational

test, effectively using the language model as a fuzzer. This

causes us to overapproximate the set of relations a program

satisﬁes. Notionally we write p⊢Tsto mean that spec sis

inferred to be true of program paccording to testing method-

ology T. If T=IO, then p|=siff p⊢IO s. If sis a re-

lation, we instead have p|=simplies p⊢Rel s. To avoid

confusion, we always show the concrete inputs on which a

logical relation was tested.

Scoring and analyzing test coverage

Given programs Pand specs S, we produce an estimated

probability for each p∈ P that pis correct. Assuming,

on average, specs correctly formalize the informal natural-

language intention, satisfying more specs should increase

our conﬁdence in a program. Additionally, if many sam-

pled programs exhibit identical behavior on the specs, then

we should increase our conﬁdence in those programs, be-

cause this indicates high marginal probability of that behav-

ior under PLM(·|prompt). This ‘clustering’ of candidate

solutions according to their execution behavior, and prior-

itizing large clusters, has been successfully used by Alpha-

Code (Li et al. 2022), Minerva (Lewkowycz et al. 2022),

and others (Shi et al. 2022). It is also related to observa-

tional equivalence from classic program synthesis (Udupa

et al. 2013), which treats programs as identical if they have

the same outputs on test inputs.

Therefore, we estimate the probability of correctness for

each program p∈ P using a logistic regressor over features

ϕ(p, P,S)and learned parameters θ. The features include

i/o pass rate (input-output specs passed), relation pass rate

(logical-relation specs passed, under fuzzing), cluster size

(# of other programs satisfying the same specs), the ordinal

rank of the preceding features and the normalized log prob-

ability of the sampled programs:

scoreθ(p|P,S) = Sigmoid (θ·ϕ(p, P,S) + θ0)(1)

where the components of ϕ(p, P,S)include the follow

-ing and their ordinal ranks:

IOPass(p, P,T ∪ R) = 1

|T | X

s∈T

1[p⊢IO s]

RelPass(p, P,T ∪ R) = 1

|R| X

s∈R

1[p⊢Rel s]

ClusterSize(T, p, P,S)

p′∈P Y

s∈S

1[(p⊢Ts) = (p′⊢Ts)] where T∈ {IO, Rel}

NormLogProb(p) = 1

length(p)log PLM(p|prompt)

Figure 2: Our systems uses different prompts to generate programs, input-output tests, and logical relations. Here we also show

the example completion from the model in bold.

We ﬁt θvia maximum likelihood on a corpus Dcontain-

ing triples ⟨P,S,G⟩ of programs Pand speciﬁcations S,

both sampled from the same program problem, and ground-

truth testcases G, which serve as a proxy for program cor-

rectness. The ground-truth testcases Gare assumed to be un-

available at test time, because our goal is synthesis from in-

formal speciﬁcations like natural language. We use gradient

ascent to maximize the log likelihood, L:

L=X

⟨P,S,G⟩∈D

p∈P

1[p⊢IO G] log scoreθ(p|P,S)

+1[p̸⊢IO G] log (1 −scoreθ(p|P,S))!(2)

Test time metrics

Precision-Recall. Ultimately our goal is a trustworthy sys-

tem that proposes program solutions whenever it can, but

avoids proposing buggy code. Toward those ends, we seek

high precision without sacriﬁcing recall. High precision

means that when the system suggests a program, it is prob-

ably correct. Precise systems foster trust because they don’t

propose wrong answers, though they may decline to provide

an answer in the ﬁrst place. High recall means a correct pro-

gram achieves the top rank: In other words, the system can

solve a lot of programming problems, though it might make

more mistakes in the process.

The tradeoff between precision and recall can be tuned

by a thresholding parameter, τ. A candidate program is dis-

carded if its score falls below the threshold τ. If all programs

are discarded, the system declines to provide an output for

the programming problem, and otherwise the system outputs

a ranked list of programs sorted by score.

We deﬁne Precision and Recall as follows, which respec-

tively measure (1) whether a correct program is top-ranked

whenever any program scores above τand (2) how often a

correct program scoring above τis the top ranked. Note that

these deﬁnitions are generalizations of precision/recall for

binary classiﬁcation.

Precision@k=TruePositives@k

PredictedPositives

Recall@k=TruePositives@k

ActualPositives

TruePositives@k

⟨P,S,G⟩∈D

1



∃p∈ P :p⊢IO G∧

τ≤scoreθ(p|P,S)∧

p∈top-kp′∈P score(p′|P,S)



PredictedPositives

⟨P,S,G⟩∈D

1[∃p∈ P :τ≤scoreθ(p|P,S)]

ActualPositives =X

⟨P,S,G⟩∈D

1[∃p∈ P :p⊢IO G]

We sweep possible values for τto compute a precision-recall

curve. Generically, there is no ‘true’ best trade-off between

these desiderata.

Pass rate. The pass@k metric (Austin et al. 2021; Chen

et al. 2021) measures the probability of ksamples from

PLM(·|prompt)passing the ground-truth test cases, G:

pass@k =Ep1,··· ,pk∼PLM(·|prompt)1[∃pi:pi⊢IO G](3)

Note that pass@k is proportional to ActualPositives (Eq. 3):

The (fraction of) problems where there is at least one correct

answer in the sampled programs.

It is also conventional to combine pass@k with a scoring

function that reranks the sampled programs. This general-

izes pass@k to pass@k,n, which measures the probability

that, after generating ncandidate programs, a correct one is

in the top-kunder our scoring function:

pass@k,n =

E⟨P,T,G⟩∼D1∃p∈top-kp′∈P scoreθ(p′|P,T)

where p|=G

Results

We study our approach on two popular datasets while using

Codex models (Chen et al. 2021) of different sizes, seeking

to answer the following research questions:

• Is the classiﬁer well-calibrated? If so, how trustworthy

can we make the system (precision), and how much does

that require sacriﬁcing coverage (recall)?

• How can we use the synthesized speciﬁcations to act as

human-interpretable explanations of the behavior of the

programs constructed by the language model?

• How does our learned reranking impact raw rate of suc-

cess (pass@k,n)?

We evaluate on programming problems from the Mostly Ba-

sic Python Problems (MBPP: (Austin et al. 2021), sanitized

version) and HumanEval datasets (Chen et al. 2021). Each

of these datasets contains natural language descriptions of

programming problems, and holdout tests to judge program

correctness. An important difference between them is that

HumanEval sometimes includes example input-outputs as

part of the natural language description, while MBPP does

not. Having I/O examples in the problem description makes

spec generation easier: some specs are given for free. On the

other hand, humans sometimes spontaneously mix natural

language and examples (Acquaviva et al. 2021). Therefore,

using both MBPP and HumanEval gives a more robust eval-

uation, but we note this qualitative difference between them.

We gives further experimental setup details, such as hyper-

parameters and example prompts in the Appendix.

Calibration, Precision, and Recall

Trustworthy systems should avoid predicting any programs

at all when they cannot solve a problem. Doing so increases

precision, the fraction of outputs which are correct, but at the

expense of recall, the fraction of solvable problems where

we output a correct solution. Fig. 3 illustrates how one can

adjust this trade-off. For example, we can achieve 100% pre-

cision on HumanEval (zero error rate), in exchange for drop-

ping our recall from 93.4% to 44.4%. Note this zero error

rate does not come from our learned score function memo-

rizing the data: we use cross validation to test each program

using weights trained on other folds. Less extreme tradeoffs

are possible, such as 90% precision in exchange for 90.7%

recall.

Striking favorable points on these tradeoffs is, in the-

ory, a result of having a well-calibrated model: whenever

our probabilistic scoring function assigns probability x%

to a program being correct, then approximately x% of the

time, that program should actually be correct. We conﬁrm

this calibration property in Fig. 4, also ﬁnding that raw log

likelihoods from the language model are substantially less

well-calibrated. Calibration allows tuning the threshold τto

achieve a desired precision, because the free parameter τ

acts as a threshold on the probability of correctness needed

to output a program.

HumanEval MBPP

AUC max R@ AUC max R@

F1 P=.9 F1 P=.9

Davinci (Largest Codex model)

full 0.91 0.91 0.91 0.74 0.80 0.43

IO only 0.86 0.88 0.83 0.69 0.75 0.41

rels only 0.66 0.73 0.27 0.69 0.77 0

random 0.15 0.39 0 0.23 0.48 0

Cushman (Smaller Codex model)

full 0.67 0.72 0.28 0.57 0.65 0.18

IO only 0.66 0.71 0.33 0.55 0.63 0.25

rels only 0.34 0.44 0.14 0.51 0.61 0.14

random 0.08 0.28 0 0.16 0.40 0

Figure 3: Top: Precision-Recall curves with k= 1. Up:

Statistics of these curves, measuring Area Under Curve

(AUC), max F1 (harmonic mean of precision and recall),

recall in the high-trust regime: R@P=.9 is recall when pre-

cision=90%. The below ﬁgure gives further results.

Fig. 3 also shows that our full method typically performs

best on precision/recall statistics in the most realistic set-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TowardTrustworthyNeuralProgramSynthesisWen-DingLi*1,DarrenKey*1,KevinEllis11DepartmentofComputerScience,CornellUniversity,USAwl678@cornell.edu,dyk34@cornell.edu,kellis@cornell.eduAbstractWedevelopanapproachtoestimatetheprobabilitythatapro-gramsampledfromalargelanguagemodeliscorrect.Givenanaturallang...

展开>> 收起<<

Toward Trustworthy Neural Program Synthesis.pdf

共26页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Toward Trustworthy Neural Program Synthesis

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: