
2
there are classes of problems for which ML is not well-
suited [43,46]. We address this issue not to dissuade
researchers from using these methods, but rather to em-
power them to do so correctly, and to avoid wasting valu-
able time and resources. Understanding the limitations
and application spaces of our tools will help us build bet-
ter ones, and solve larger problems more confidently and
with more consistency.
II. THE DEVIL’S IN THE DISTANCE
Newcomers to the field of ML will find themselves im-
mediately buried under an avalanche of enticing algo-
rithms applicable to their scientific problem [47]. Many of
these choices are so sophisticated that it is unreasonable
to expect any ML non-expert to understand their finer
nuances and how/why they can fail. The steep learning
curve combined with their intrinsic complexity, mythical
“black box” nature and stunning ability to make accurate
predictions can make ML appear almost magical. It may
come as a surprise then that in spite of said complexi-
ties, almost all supervised ML models are paradigmat-
ically the same and are built upon a familiar quantity:
distance.
The supervised ML problem is one of minimizing the
distance between predicted and true values mapped by
an approximate function on the appropriate inputs. A
distance can be a proper metric, such as the Euclidean
or Manhattan norms, or something less pedestrian, such
as a divergence (a distance measure between two dis-
tributions) or a cross-entropy loss function. Regardless,
the principle is the same: consider un-regularized, super-
vised ML, where given a source of ground truth Fand
a ML model fθ,the goal is to find parameters θsuch
that the distance between F(x) and fθ(x) is as small as
possible for all possible xin some use case. While this
is only one type of ML, most techniques share this com-
mon theme. For example, Deep Q reinforcement learn-
ing [48] leverages neural networks to map states (inputs)
to decisions (outputs), and unsupervised learning algo-
rithms rely on the same notion of distance to perform
clustering and dimensionality reduction that supervised
learning techniques use to minimize loss functions. Vari-
ational autoencoders [49–51] try not only to minimize re-
construction loss, but simultaneously keep a compressed,
latent representation as close to some target distribution
as possible (usually for use in generative applications).
Numerical optimization is the engine that systematically
tunes model parameters θin gradient-based ML,1and its
only objective is to minimize some measure of distance
between ground truth and model predictions.
1The classic numerical optimizer is gradient/stochastic gradi-
ent descent, with more recently established developments showing
systematic improvements in deep learning, e.g. Adam [52].
Additionally, in order to increase the confidence that
ML models will be successful for a given task, it helps
if the desired function is smooth: i.e. a small change in
a feature should ideally correspond to a relatively small
change in the target. This idea is more readily defined
for regression than for classification, and the data be-
ing amenable to gradient-based methods is is not strictly
required for ML to be successful. For example, e.g. Ran-
dom Forests are not generally trained using gradient-
based optimizers, but satisfying this requirement will
usually help models generalize more effectively. The dis-
tance between the features of any two points of data is
informed entirely by their numerical vector representa-
tion, and while these representations can be intuitive or
human-interpretable, they must be mathematically rig-
orous.
The devil here, so to speak, is that what might appear
intuitive to the experimenter may not be to the machine.
For example, consider the problem of discovering novel
molecules with certain properties. Molecules can be first
encoded in string format (e.g. SMILES [53]), and then
a numerical latent representation. The structure of this
latent space is informed by some target property [16], and
because any point in the latent space is just a numeric
vector living in a vector space, a distance can be easily
defined. This powerful encoding method can be used to
“interpolate between molecules” and thus discover new
ones that perhaps we haven’t previously considered, but
it still relies on the principle of distance, both between
molecules in the compressed latent space, and their target
properties.
Concretely, the length scales for differentiating be-
tween data points in the feature space are set by the
corresponding targets. Large changes in target values
between data points can cause ML models to “focus” on
the changes in the input space that caused it, possibly
at the expense of failing to capture small changes. This
is often referred to as the bias-variance trade-off. Most
readers may be familiar with the concept of over-fitting:
for instance, essentially any set of observations can be
fit exactly by an arbitrarily high-order polynomial, but
doing so will produce wildly varying results during infer-
ence and be unlikely to have captured anything meaning-
ful about the underlying function. Conversely, a linear fit
will only capture the most simple trends to the point of
being useless for any nonlinear phenomena. Fig. 1show-
cases a common middle ground, where the primary trend
of the data is captured by a Gaussian Process [54], and
smaller fluctuations are understood as noisy variations
around that trend.
Consider a more realistic example: the Materials
Project [55] database contains many geometry-relaxed
structures, each with different compositions, space
groups and local symmetries at 0 Kelvin. Thus, within
this database, changes in e.g. the optical properties of
these materials is primarily due to the aforementioned
structural differences and not due to thermal disorder
(i.e. distortions) one would find when running a molec-