3. Accuracy Demonstration that the above contributions do
not impair the overall accuracy of programs synthesizers,
and can sometimes let them solve more problems overall
Related Work
Program synthesis. Automatically constructing software
has been a longstanding goal of computer science (Manna
and Waldinger 1979; Gulwani et al. 2017). Classic program
synthesizers input a formal specification, and then either
search or logically derive a program guaranteed to satisfy
that formal specification (Alur et al. 2013).
Large language models for source code. Our work uses
large language models for source code (Chen et al. 2021;
Austin et al. 2021). These neural networks generate source
code conditioned or ‘prompted’ by a mix of code and natu-
ral language (the natural language is usually represented as
code comments). Such models are typically implemented as
large transformers (Vaswani et al. 2017; Brown et al. 2020).
Following the introduction of large transformer-based
language models for source code, there has been work on
how to boost the accuracy of those models. Here, accuracy
means the probability of sampling a correct program con-
ditioned on a natural-language prompt. Accuracy is often
measured by functional correctness with the pass@k metric,
which considers drawing kIID samples from the language
model and testing if any of those samples pass a set of hold-
out test cases. Toward boosting pass@k, researchers have
considered clustering sampled programs according to the
outputs they produce on test inputs (Shi et al. 2022; Li et al.
2022). For example, AlphaCode prioritizes large ‘clusters’
of samples with the exact same input-output behavior (Li
et al. 2022), effectively reranking the samples from the lan-
guage model according to how likely they are to solve the
task. A complementary reranking strategy is to train a sec-
ond neural network to predict program correctness, as ex-
plored in (Inala et al. 2022).
The closest work to ours is CodeT (Chen et al. 2023),
which also generates programs as well as input-output
test cases, with the goal of boosting pass@k. The quali-
tative difference between our systems is that we designed
speculyzer to build trust in a variety of ways by syn-
thesizing specifications, centered around first forming well-
calibrated probability estimates, only boosting pass@k as a
side effect. We also incorporate input-output test cases as a
special case of specs in general.
Engineering safe, trustworthy language models has re-
ceived considerable attention by the AI safety (Thoppi-
lan et al. 2022) and AI alignment communities (Kadavath
et al. 2022). These works find that one can train classi-
fiers which predict the truthfulness or safety of language
outputs by inspecting the hidden activations of the model
or even by simply ‘asking’ the model if its output is cor-
rect or safe. We see this family of efforts as complemen-
tary: For programs, it is possible to formally specify correct-
ness properties, which is not generally true in NLP, so we
focus on formal properties (specifications) here. Nonethe-
less, one can train statistical predictors of program correct-
ness (Inala et al. 2022). Broadly however, we think that pro-
gram synthesis offers unique opportunities for building trust
through symbolic methods. Although statistically reranking
language model outputs via a second neural network im-
proves raw performance, we believe it is a suboptimal trust-
builder: an inscrutable neural network cannot guarantee the
correctness of another inscrutable network. Here we advo-
cate that properties which are symbolically checkable and
human-comprehensible should play a role, and examine cer-
tain specifications as basic examples of such properties.
Methods
Given a natural-language prompt describing a programming
problem, our goal is to produce a well-calibrated estimate
probability of each program sample from language model
being correct.Our approach independently samples a set
of candidate programs Pand a set of candidate specs S.
Specs can be either input-output testcases, or logical rela-
tions (Fig. 1). We write Tfor the set of test cases and R
for the set of logical relations, so S=T ∪ R. Each pro-
gram p∈ P is checked against each spec s∈ S, and
basic statistics of program-spec agreement are computed.
These statistics are aggregated by a learned model into an
estimated probability that the program is correct. Programs
whose probability falls below a threshold are discarded. Any
remaining programs are sorted by probability and returned
to the user as possible solutions, together with certain specs
they pass. Returning specifications allows the user to check
that the code has the intended behavior. This architecture
lets the system learn how to predict when it cannot solve a
problem, and also learn to rank candidate solutions and their
corresponding specs.
Sampling programs and specs
Given a string prompt describing a programming prob-
lem, we sample n= 100 candidate programs (the set P)
and candidate specs (the set S). Both sets are sampled us-
ing a pretrained language model, which can probabilistically
generate program source code conditioned on a prompt.
We write PLM(·|prompt)for the conditional distribution
over programs, given prompt. If a program p∈ P, then
p∼PLM(·|prompt). To sample specs, we deterministi-
cally transform the prompt as in Fig. 2 and Appendix, then
draw iid samples from the language model to construct rela-
tions Rand input-output test cases T.
One motivation for generating logical relations is that
large language models are famously bad at predicting the
output of a novel program (Nye et al. 2021). However,
given a generic input, they can produce a variety of reason-
able constraints on that the output should obey, if they are
prompted in a program-of-thought style (Chen et al. 2022;
Gao et al. 2023). Logical relations also resemble unit test
harnesses, like those used in property based testing, (Fink
and Bishop 1997) which are likely part of the model’s train-
ing data. We therefore suspected that although input-outputs
are the superior form of test for typical programming tasks,
logical relations could serve the long tail of novel tasks that
the language model cannot reliably predict outputs for.