
Arithmetic Sampling: Parallel Diverse Decoding for Large Language Models
independently. However, high probability sequences are
often generated multiple times. Conversely, search-style
methods inherently avoid generating duplicate samples but
when parallelized require synchronizing across replicas to
sort the candidates at each step.
In response to this, we introduce arithmetic sampling,
a technique for sampling from large language models
that produces a set of non-independent samples from the
model, based on a coding scheme implicitly defined by the
model. With this coding scheme, distant codes in code
space represent different sequences. Further, decoding
each code can be done independently from all other codes.
Arithmetic sampling boasts provable beam diversity under
certain conditions, produces unbiased expectations from the
original model, and is embarrassingly parallel and as simple
to implement as normal sampling.
In addition to analyzing bias and consistency of estimators
based on arithmetic sampling, we present results on the
metric properties of the codebook space and the conditions
under which it improves sample diversity, as well as an
analysis of the estimator variance.
Comparing against equivalent hyperparameters for standard
sampling, we show improvements of nearly 1 point of BLEU
in oracle experiments for En/Fr WMT translation, closing
the gap between independent sampling and beam search
by up to 63%, reducing estimation variance by more than
half and improving beam diversity. We see comparable
improvements in En/Ro translation and variance reduction
for ROUGE score on CNN/DailyMail summarization. We
release an open-source implementation of our algorithm
1
in
the popular T5X transformer library (Roberts et al.,2022).
2. Related Work
This paper draws on three main threads of related work.
Coding theory, diverse sampling, and quasi-Monte Carlo.
The use of latent “codes” to represent data has a long
history in the neural network and representation learning
literature, from autoencoders (LeCun,1987) to sparse
coding algorithms like K-SVD (Aharon et al.,2006). Rather
than using a high dimensional code learned from data using
an iterative algorithm or backpropagation, we design a
simple one dimensional arithmetic code (Cover & Thomas,
2006) post-hoc from a trained language model.
Diverse sampling inference techniques for large language
models fall into two categories: techniques for producing
a diverse beam (sample of sequences), and techniques
for discouraging overlap (n-gram repetition) within a
1
Code is available at
https://github.com/
google-research/google-research/tree/
master/arithmetic_sampling
single sequence. The former encompasses methods like
determinantal beam search (Meister et al.,2021b), parallel
approximate decoding (Cho,2016), stochastic beam search
(Kool et al.,2019), and conditional poisson stochastic
beam search (Meister et al.,2021a). Our method differs
from (determinantal) beam search or parallel approximate
decoding in that it is designed to faithfully sample from
the underlying probability model in that sample means can
be used to compute unbiased expectations. Unlike beam
search or sampling-without-replacement based variants, our
algorithm is embarassingly parallel.
Methods such as temperature sampling, top-k sampling (Fan
et al.,2018), nucleus sampling (Holtzman et al.,2019),
Mirostat (Basu et al.,2020), and typical decoding (Meister
et al.,2022) are useful both for increasing diversity over
standard beam search both across the beam and within a
single long generation. These methods, and any others
that modify conditional logits, are fully compatible with
and complementary to our algorithm, and we provide
experiments with temperature sampling and top-k sampling
demonstrating improvements.
There is also work on changing the training objective using
unlikelihood (Welleck et al.,2019) or reinforcement learning
(Lagutin et al.,2021) so that standard generation schemes
produce more diverse outputs, which is also orthogonal to
our methods.
The final thread of related work is quasi-Monte Carlo
integration. (Randomized) Quasi-Monte Carlo techniques
(l’Ecuyer,2016) combine the adaptive anytime properties of
Monte Carlo estimation with the reduced variance of lattice
integration methods (L’Ecuyer & Lemieux,2000), and have
been used in machine learning applications such as lowering
the variance of randomized kernel approximations (Yang
et al.,2014) and neural latent variable models (Buchholz
et al.,2018). Quasi-Monte Carlo has not been applied to
the standard neural autoregressive discrete distributions we
describe here, to our knowledge.
3. Background
3.1. Arithmetic Coding
An arithmetic code is an optimal lossless coding
scheme—that is, a coding scheme with minimal expected
code length—for when the exact joint distribution of the
data is known. Given a total ordering of the items being
encoded and defining
wi=Pj<i P(X=xj)
the
cumulative probability of item
xi
, an arithmetic code for
xi
is a number in the interval
(wi, wi+1)
. To represent
this code as a sequence of bits it’s usual to pick a rational
number in the interval
(wi, wi+1)
whose binary fraction
representation requires a small number of digits. Larger
intervals then tend to contain more numbers with short
2