Bridging Machine Learning and Sciences Opportunities and Challenges Taoli Cheng Mila University of Montreal

2025-04-30 2 0 514.07KB 8 页 10玖币

侵权投诉

Bridging Machine Learning and Sciences: Opportunities and Challenges

Taoli Cheng

Mila, University of Montreal

chengtaoli.1990@gmail.com

Abstract

The application of machine learning in sciences

has seen exciting advances in recent years. As

a widely applicable technique, anomaly detec-

tion has been long studied in the machine learn-

ing community. Especially, deep neural nets-

based out-of-distribution detection has made great

progress for high-dimensional data. Recently,

these techniques have been showing their poten-

tial in scientiﬁc disciplines. We take a critical

look at their applicative prospects including data

universality, experimental protocols, model ro-

bustness, etc. We discuss examples that display

transferable practices and domain-speciﬁc chal-

lenges simultaneously, providing a starting point

for establishing a novel interdisciplinary research

paradigm in the near future.

1. Introduction

The advances in the deep learning revolution have been ex-

panding their inﬂuence in many domains and accelerating re-

search in a generalized interdisciplinary manner. Neural nets

serving as general function approximators have been em-

ployed in scientiﬁc applications including object identiﬁca-

tion/classiﬁcation, anomaly/novelty detection, autonomous

control, and neural net-based simulation, etc. Great suc-

cesses have been made in multiple scientiﬁc disciplines.

A typical example is AlphaFold (Jumper et al.,2021) for

accurate protein structure prediction. At the same time, a

lot of progress has been made in physical sciences (Baldi

et al.,2014;Ribli et al.,2019), biology (Ching et al.,2018),

molecule generation / drug discovery (Gottipati et al.,2020),

medical imaging (Zhang et al.,2020), etc.

Despite the successes, there are many challenges or unrec-

ognized pitfalls in transporting machine learning techniques

into more traditional science domains. Not being aware

of these possible pitfalls could result in vain efforts and

sometimes catastrophic consequences in real-world model

deployment. In the following, we take a holistic look at the

current collaborative scheme in machine learning applica-

tions for sciences in this new cross-disciplinary research era.

The differences between general machine learning (mainly

focused on computer vision (CV) and natural language pro-

cessing (NLP)) and tailored scientiﬁc applications reside

in all parts of the pipeline. The following aspects build an

intertwined picture in the modern machine learning-assisted

scientiﬁc discovery:

•

Nature of the data: In natural sciences especially

physical sciences, scientists usually strictly design (and

usually simplify) the experimental environments to

probe the considered phenomena. Thus the nature of

scientiﬁc data is under control to be free of external

noises due to the laboratory settings. (And normally

systematic uncertainties can be well estimated through

control datasets) However, the format of the data can

be more complex (and non-human-readable) compared

with natural images.

•

Inference process: Model inference in real-world set-

tings can come in complicated and varying formats.

As for scientiﬁc applications, usually, the experimental

focus deﬁnes the inference process. And consequently,

the inference process affects the result interpretation.

•

Benchmarks: In the machine learning community

usually benchmark datasets used for model evaluation

is restricted to a few public datasets such as MNIST

(Lecun et al.,1998), ImageNet (Deng et al.,2009),

CIFAR (Krizhevsky,2009), or SVHN (Netzer et al.,

2011). This inevitably results in “overﬁtted” strategies

and research focuses. In contrast, domain sciences

haven’t yet built common datasets for model training

and evaluation, sometimes resulting in difﬁculties in

model comparison.

•

Uncertainty quantiﬁcation: Uncertainty of deep neu-

ral net outputs can be hard to quantify due to the com-

plexity involved. When applied in sciences, the capabil-

ity to incorporate uncertainty estimates in the pipeline

is important for rigorous result interpretation and ro-

bust model prediction.

•

Generalization and Robustness: We would like the

deep models to perform well at various tasks in real-

world environments. Furthermore, a common approach

arXiv:2210.13441v2 [stat.ML] 2 Nov 2023

in sciences is ﬁrst training models on simulation data

with labeling information. When applied to real data,

the dataset shift seeks for remedies to retain accuracy

and robustness as in the cases of computer vision or

natural language modeling. However, the focus of the

simulation-to-real adaptation in sciences differs in the

sense that usually the two environments are known

however we normally need higher precision.

Keeping these differences in mind helps shape the research

guidelines toward a well-focused and suited technology

transfer. Adapting the workﬂow according to the needs

promotes and secures scientiﬁc applications and transforms

the research paradigm in a universal manner. Finally, the

interplay between machine learning and scientiﬁc discovery

beneﬁts from a communal understanding of the ﬁeld vocab-

ulary, the publishing traditions, the collaboration schemes,

and the academic setups (Wagstaff,2012). This new regime

solicits novel community infrastructures for more impactful

research works in the next few years.

2. Scientiﬁc Discovery

Modern scientiﬁc discovery highly depends on the practices

of hypothesis testing and corresponding statistical interpreta-

tion, which builds on and renovates the established theories.

A recent example of successful scientiﬁc discovery is the

Higgs boson (Englert & Brout,1964;Higgs,1964) observed

at the Large Hadron Collider (Aad et al.,2012;Chatrchyan

et al.,2012). The Higgs boson was predicted by physicists

about four decades ago. The properties of the Higgs boson

have been well studied during the last decades. This makes

dedicated searches possible at the particle colliders.

Figure 1.

Invariant mass distribution in the channel of Higgs boson

decaying to two photons. Image taken from Ref. (Aad et al.,2012)

The most typical search strategy includes 1) choosing a

search channel (i.e., deﬁning the potential observational

space for signals), 2) estimating the background using estab-

lished simulation tools, 3) comparing observed events and

the estimated background, and 4) calculating the observed

Event Generation

(MadGraph)

Parton Shower

(Pythia/Herwig)

Detector Simulation

(Fast: Delphes; Full:

GEANT)

Data

hdf5

Kinematic analysis

Data cleaning

Preprocessing

[numpy arrays]

Training/test splitting

Preprocessing

Training

[tensorflow

/PyTorch]

numpy

arrays

Model Training

Trained Model

Test samples

Inference

ROC/AUC/SIC

.h5/

.json/

.pickle

Latent Representation

Image 4-vector

Jet clustering

Network Architecture

Visualization /

downstream tasks

Variations Outputs

Loss Function

FCN LSTM

...

GNN Transformer

Figure 2.

A typical pipeline for machine learning applications in

Large Hadron Collider physics. The pipeline is composed of

event generation, data preprocessing, model training, and model

inference.

conﬁdence level, setting the observed exclusion limits or

claiming an observation under the statistical interpretation

(Read,2002). Fig. 1displays a typical discovery histogram

for bump-hunt in the dimension of the Higgs boson mass.

Translating into the vocabulary of machine learning, the

streamlined process consists of 1) deﬁning a (weighted) sub-

set of data classes of interest, 2) training on the datasets

under examination, 3) evaluating the trained model on tar-

get test datasets, and 4) interpreting the evaluation metrics

depending on the context. A typical workﬂow is depicted in

Fig. 2.

The results interpretation highly depends on the associated

scientiﬁc approaches. Obviously, devoting all the efforts

to increasing class-inclusive metrics (as has been widely

practiced in the machine learning community) could result

in biased research protocols. In the following, we take

a careful look at one application area bridging machine

learning and sciences: anomaly detection.

An Example: Out-of-Distribution Detection

Thanks to the capacity to process high-dimensional data,

deep neural networks-based out-of-distribution (OOD) de-

tection in computer vision and natural language processing

has shown great potential and seen much progress in the

past few years (Hendrycks & Gimpel,2017;Vaze et al.,

2021;Ahmed & Courville,2020;Ren et al.,2019b;Pang

et al.,2021). Models trained on in-distribution (ID) data are

expected to “know” what they don’t know. Thus they are

used to detect unseen patterns as anomalies, by the associ-

ated uncertainties or likelihood. On the other hand, OOD

detection also serves as a check for model calibration and

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BridgingMachineLearningandSciences:OpportunitiesandChallengesTaoliChengMila,UniversityofMontrealchengtaoli.1990@gmail.comAbstractTheapplicationofmachinelearninginscienceshasseenexcitingadvancesinrecentyears.Asawidelyapplicabletechnique,anomalydetec-tionhasbeenlongstudiedinthemachinelearn-ingcommunit...

展开>> 收起<<

Bridging Machine Learning and Sciences Opportunities and Challenges Taoli Cheng Mila University of Montreal.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Bridging Machine Learning and Sciences Opportunities and Challenges Taoli Cheng Mila University of Montreal

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: