ing multiple models with randomized ini-
tializations and use as the final model
the one which achieved the best perfor-
mance on the validation set. . . (Björne
and Salakoski,2018)
The test results are derived from the 1-
best random seed on the validation set.
(Kuncoro et al.,2020)
2.2 Safe use: Ensemble creation
Ensemble methods are an effective way of com-
bining multiple machine-learning models to make
better predictions (Rokach,2010). A common ap-
proach to creating neural network ensembles is to
train the same architecture with different random
seeds, and have the resulting models vote (Perrone
and Cooper,1995). For example:
In order to improve the stability of the
RNNs, we ensemble five distinct models,
each initialized with a different random
seed. (Nicolai et al.,2017)
Our model is composed of the ensemble
of 8 single models. The hyperparameters
and the training procedure used in each
single model are the same except the ran-
dom seed. (Yang and Wang,2019)
2.3 Safe use: Sensitivity analysis
Sometimes it is useful to demonstrate how sensitive
a neural network architecture is to a particular hy-
perparameter. For example, Santurkar et al. (2018)
shows that batch normalization makes neural net-
work architectures less sensitive to the learning rate
hyperparameter. Similarly, it may be useful to show
how sensitive neural network architectures are to
their random seed hyperparameter. For example:
We next (§3.3) examine the expected vari-
ance in attention-produced weights by
initializing multiple training sequences
with different random seeds. . . (Wiegr-
effe and Pinter,2019)
Our model shows a lower standard de-
viation on each task, which means our
model is less sensitive to random seeds
than other models. (Hua et al.,2021)
2.4 Risky use: Single fixed seed
NLP articles sometimes pick a single fixed random
seed, claiming that this is done to improve consis-
tency or replicability. For example:
An arbitrary but fixed random seed was
used for each run to ensure reproducibil-
ity. . . (Le and Fokkens,2018)
For consistency, we used the same set
of hyperparameters and a fixed random
seed across all experiments. (Lin et al.,
2020)
Why is this risky? First, fixing the random seed
does not guarantee replicability. For example, the
tensorflow library has a history of producing differ-
ent results given the same random seeds, especially
on GPUs (Two Sigma,2017;Kanwar et al.,2021).
Second, not optimizing the random seed hyperpa-
rameter has the same drawbacks as not optimizing
any other hyperparameter: performance will be an
underestimate of the performance the architecture
is capable of with an optimized model.
What should one do instead? The random seed
should be optimized as any other hyperparameter.
Dodge et al. (2020), for example, show that doing
so leads to simpler models exceeding the published
results of more complex state-of-the-art models on
multiple GLUE tasks (Wang et al.,2018). The
space of hyperparameters explored (and thus the
number of random seeds explored) can be restricted
to match the availability of compute resources with
techniques such as random hyperparameter search
(Bergstra and Bengio,2012) where
n
hyperparame-
ter settings are sampled from the space of all hyper-
parameter settings (with random seeds treated the
same as all other hyperparameters). In an extremely
resource-limited scenario, random search might se-
lect only a single value of some hyperparameter
(such as random seed), which might be acceptable
given the constraints, but should probably be ac-
companied by an explicit acknowledgement of the
risks of underestimating performance.
2.5 Risky use: Performance comparison
It is a good idea to compare not just the point esti-
mate of a single model’s performance, but distribu-
tions of model performance, as comparing perfor-
mance distributions results in more reliable conclu-
sions (Reimers and Gurevych,2017;Dodge et al.,
2019;Radosavovic et al.,2020). However, it has
sometimes been suggested that such distributions
can be obtained by training the same architecture
and varying only the random seed. For example:
We re-ran both implementations multi-
ple times, each time only changing the