Bridging Machine Learning and Sciences Opportunities and Challenges Taoli Cheng Mila University of Montreal

2025-04-30 2 0 514.07KB 8 页 10玖币
侵权投诉
Bridging Machine Learning and Sciences: Opportunities and Challenges
Taoli Cheng
Mila, University of Montreal
chengtaoli.1990@gmail.com
Abstract
The application of machine learning in sciences
has seen exciting advances in recent years. As
a widely applicable technique, anomaly detec-
tion has been long studied in the machine learn-
ing community. Especially, deep neural nets-
based out-of-distribution detection has made great
progress for high-dimensional data. Recently,
these techniques have been showing their poten-
tial in scientific disciplines. We take a critical
look at their applicative prospects including data
universality, experimental protocols, model ro-
bustness, etc. We discuss examples that display
transferable practices and domain-specific chal-
lenges simultaneously, providing a starting point
for establishing a novel interdisciplinary research
paradigm in the near future.
1. Introduction
The advances in the deep learning revolution have been ex-
panding their influence in many domains and accelerating re-
search in a generalized interdisciplinary manner. Neural nets
serving as general function approximators have been em-
ployed in scientific applications including object identifica-
tion/classification, anomaly/novelty detection, autonomous
control, and neural net-based simulation, etc. Great suc-
cesses have been made in multiple scientific disciplines.
A typical example is AlphaFold (Jumper et al.,2021) for
accurate protein structure prediction. At the same time, a
lot of progress has been made in physical sciences (Baldi
et al.,2014;Ribli et al.,2019), biology (Ching et al.,2018),
molecule generation / drug discovery (Gottipati et al.,2020),
medical imaging (Zhang et al.,2020), etc.
Despite the successes, there are many challenges or unrec-
ognized pitfalls in transporting machine learning techniques
into more traditional science domains. Not being aware
of these possible pitfalls could result in vain efforts and
sometimes catastrophic consequences in real-world model
deployment. In the following, we take a holistic look at the
current collaborative scheme in machine learning applica-
tions for sciences in this new cross-disciplinary research era.
The differences between general machine learning (mainly
focused on computer vision (CV) and natural language pro-
cessing (NLP)) and tailored scientific applications reside
in all parts of the pipeline. The following aspects build an
intertwined picture in the modern machine learning-assisted
scientific discovery:
Nature of the data: In natural sciences especially
physical sciences, scientists usually strictly design (and
usually simplify) the experimental environments to
probe the considered phenomena. Thus the nature of
scientific data is under control to be free of external
noises due to the laboratory settings. (And normally
systematic uncertainties can be well estimated through
control datasets) However, the format of the data can
be more complex (and non-human-readable) compared
with natural images.
Inference process: Model inference in real-world set-
tings can come in complicated and varying formats.
As for scientific applications, usually, the experimental
focus defines the inference process. And consequently,
the inference process affects the result interpretation.
Benchmarks: In the machine learning community
usually benchmark datasets used for model evaluation
is restricted to a few public datasets such as MNIST
(Lecun et al.,1998), ImageNet (Deng et al.,2009),
CIFAR (Krizhevsky,2009), or SVHN (Netzer et al.,
2011). This inevitably results in “overfitted” strategies
and research focuses. In contrast, domain sciences
haven’t yet built common datasets for model training
and evaluation, sometimes resulting in difficulties in
model comparison.
Uncertainty quantification: Uncertainty of deep neu-
ral net outputs can be hard to quantify due to the com-
plexity involved. When applied in sciences, the capabil-
ity to incorporate uncertainty estimates in the pipeline
is important for rigorous result interpretation and ro-
bust model prediction.
Generalization and Robustness: We would like the
deep models to perform well at various tasks in real-
world environments. Furthermore, a common approach
arXiv:2210.13441v2 [stat.ML] 2 Nov 2023
in sciences is first training models on simulation data
with labeling information. When applied to real data,
the dataset shift seeks for remedies to retain accuracy
and robustness as in the cases of computer vision or
natural language modeling. However, the focus of the
simulation-to-real adaptation in sciences differs in the
sense that usually the two environments are known
however we normally need higher precision.
Keeping these differences in mind helps shape the research
guidelines toward a well-focused and suited technology
transfer. Adapting the workflow according to the needs
promotes and secures scientific applications and transforms
the research paradigm in a universal manner. Finally, the
interplay between machine learning and scientific discovery
benefits from a communal understanding of the field vocab-
ulary, the publishing traditions, the collaboration schemes,
and the academic setups (Wagstaff,2012). This new regime
solicits novel community infrastructures for more impactful
research works in the next few years.
2. Scientific Discovery
Modern scientific discovery highly depends on the practices
of hypothesis testing and corresponding statistical interpreta-
tion, which builds on and renovates the established theories.
A recent example of successful scientific discovery is the
Higgs boson (Englert & Brout,1964;Higgs,1964) observed
at the Large Hadron Collider (Aad et al.,2012;Chatrchyan
et al.,2012). The Higgs boson was predicted by physicists
about four decades ago. The properties of the Higgs boson
have been well studied during the last decades. This makes
dedicated searches possible at the particle colliders.
Figure 1.
Invariant mass distribution in the channel of Higgs boson
decaying to two photons. Image taken from Ref. (Aad et al.,2012)
.
The most typical search strategy includes 1) choosing a
search channel (i.e., defining the potential observational
space for signals), 2) estimating the background using estab-
lished simulation tools, 3) comparing observed events and
the estimated background, and 4) calculating the observed
Event Generation
(MadGraph)
Parton Shower
(Pythia/Herwig)
Detector Simulation
(Fast: Delphes; Full:
GEANT)
Data
hdf5
Kinematic analysis
Data cleaning
Preprocessing
[numpy arrays]
Training/test splitting
Preprocessing
Training
[tensorflow
/PyTorch]
numpy
arrays
Model Training
Trained Model
Test samples
Inference
ROC/AUC/SIC
.h5/
.json/
.pickle
Latent Representation
Image 4-vector
Jet clustering
Network Architecture
Visualization /
downstream tasks
Variations Outputs
Loss Function
FCN LSTM
...
GNN Transformer
Figure 2.
A typical pipeline for machine learning applications in
Large Hadron Collider physics. The pipeline is composed of
event generation, data preprocessing, model training, and model
inference.
confidence level, setting the observed exclusion limits or
claiming an observation under the statistical interpretation
(Read,2002). Fig. 1displays a typical discovery histogram
for bump-hunt in the dimension of the Higgs boson mass.
Translating into the vocabulary of machine learning, the
streamlined process consists of 1) defining a (weighted) sub-
set of data classes of interest, 2) training on the datasets
under examination, 3) evaluating the trained model on tar-
get test datasets, and 4) interpreting the evaluation metrics
depending on the context. A typical workflow is depicted in
Fig. 2.
The results interpretation highly depends on the associated
scientific approaches. Obviously, devoting all the efforts
to increasing class-inclusive metrics (as has been widely
practiced in the machine learning community) could result
in biased research protocols. In the following, we take
a careful look at one application area bridging machine
learning and sciences: anomaly detection.
3.
An Example: Out-of-Distribution Detection
Thanks to the capacity to process high-dimensional data,
deep neural networks-based out-of-distribution (OOD) de-
tection in computer vision and natural language processing
has shown great potential and seen much progress in the
past few years (Hendrycks & Gimpel,2017;Vaze et al.,
2021;Ahmed & Courville,2020;Ren et al.,2019b;Pang
et al.,2021). Models trained on in-distribution (ID) data are
expected to “know” what they don’t know. Thus they are
used to detect unseen patterns as anomalies, by the associ-
ated uncertainties or likelihood. On the other hand, OOD
detection also serves as a check for model calibration and
2
摘要:

BridgingMachineLearningandSciences:OpportunitiesandChallengesTaoliChengMila,UniversityofMontrealchengtaoli.1990@gmail.comAbstractTheapplicationofmachinelearninginscienceshasseenexcitingadvancesinrecentyears.Asawidelyapplicabletechnique,anomalydetec-tionhasbeenlongstudiedinthemachinelearn-ingcommunit...

展开>> 收起<<
Bridging Machine Learning and Sciences Opportunities and Challenges Taoli Cheng Mila University of Montreal.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:514.07KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注