
Bridging Machine Learning and Sciences: Opportunities and Challenges
Taoli Cheng
Mila, University of Montreal
chengtaoli.1990@gmail.com
Abstract
The application of machine learning in sciences
has seen exciting advances in recent years. As
a widely applicable technique, anomaly detec-
tion has been long studied in the machine learn-
ing community. Especially, deep neural nets-
based out-of-distribution detection has made great
progress for high-dimensional data. Recently,
these techniques have been showing their poten-
tial in scientific disciplines. We take a critical
look at their applicative prospects including data
universality, experimental protocols, model ro-
bustness, etc. We discuss examples that display
transferable practices and domain-specific chal-
lenges simultaneously, providing a starting point
for establishing a novel interdisciplinary research
paradigm in the near future.
1. Introduction
The advances in the deep learning revolution have been ex-
panding their influence in many domains and accelerating re-
search in a generalized interdisciplinary manner. Neural nets
serving as general function approximators have been em-
ployed in scientific applications including object identifica-
tion/classification, anomaly/novelty detection, autonomous
control, and neural net-based simulation, etc. Great suc-
cesses have been made in multiple scientific disciplines.
A typical example is AlphaFold (Jumper et al.,2021) for
accurate protein structure prediction. At the same time, a
lot of progress has been made in physical sciences (Baldi
et al.,2014;Ribli et al.,2019), biology (Ching et al.,2018),
molecule generation / drug discovery (Gottipati et al.,2020),
medical imaging (Zhang et al.,2020), etc.
Despite the successes, there are many challenges or unrec-
ognized pitfalls in transporting machine learning techniques
into more traditional science domains. Not being aware
of these possible pitfalls could result in vain efforts and
sometimes catastrophic consequences in real-world model
deployment. In the following, we take a holistic look at the
current collaborative scheme in machine learning applica-
tions for sciences in this new cross-disciplinary research era.
The differences between general machine learning (mainly
focused on computer vision (CV) and natural language pro-
cessing (NLP)) and tailored scientific applications reside
in all parts of the pipeline. The following aspects build an
intertwined picture in the modern machine learning-assisted
scientific discovery:
•
Nature of the data: In natural sciences especially
physical sciences, scientists usually strictly design (and
usually simplify) the experimental environments to
probe the considered phenomena. Thus the nature of
scientific data is under control to be free of external
noises due to the laboratory settings. (And normally
systematic uncertainties can be well estimated through
control datasets) However, the format of the data can
be more complex (and non-human-readable) compared
with natural images.
•
Inference process: Model inference in real-world set-
tings can come in complicated and varying formats.
As for scientific applications, usually, the experimental
focus defines the inference process. And consequently,
the inference process affects the result interpretation.
•
Benchmarks: In the machine learning community
usually benchmark datasets used for model evaluation
is restricted to a few public datasets such as MNIST
(Lecun et al.,1998), ImageNet (Deng et al.,2009),
CIFAR (Krizhevsky,2009), or SVHN (Netzer et al.,
2011). This inevitably results in “overfitted” strategies
and research focuses. In contrast, domain sciences
haven’t yet built common datasets for model training
and evaluation, sometimes resulting in difficulties in
model comparison.
•
Uncertainty quantification: Uncertainty of deep neu-
ral net outputs can be hard to quantify due to the com-
plexity involved. When applied in sciences, the capabil-
ity to incorporate uncertainty estimates in the pipeline
is important for rigorous result interpretation and ro-
bust model prediction.
•
Generalization and Robustness: We would like the
deep models to perform well at various tasks in real-
world environments. Furthermore, a common approach
arXiv:2210.13441v2 [stat.ML] 2 Nov 2023