Optimisation Generalisation in Networks of Neurons

2025-04-29 0 0 2.3MB 98 页 10玖币

侵权投诉

Optimisation & Generalisation

in Networks of Neurons

Thesis by

Jeremy Bernstein

In Partial Fulﬁllment of the Requirements for

the Degree of Doctor of Philosophy

California Institute of Technology

Pasadena, California

2023

Defended September 23, 2022

arXiv:2210.10101v1 [cs.NE] 18 Oct 2022

Jeremy Bernstein

ORCID: 0000-0001-9110-7476

https://jeremybernste.in/

iii

ACKNOWLEDGEMENTS

I am grateful to the following people: My dear friends and my dear family,

without whom this thesis would not have been written. My advisor Yisong

Yue. The Yue Crew. My close collaborators Kamyar Azizzadenesheli, Dawna

Bagherian, Alex Farhang, Kevin Huang, Yang Liu, Kushal Tirumala and Jiawei

Zhao. My internship mentors Ming-Yu Liu, Arash Vahdat, Yu-Xiang Wang and

Greg Yang. My thesis committee Ming-Yu Liu, Markus Meister, Matt Thomson

and Joel Tropp. My Computation & Neural Systems cohort Jon Kenny, Matt

Rosenberg, Anish Sarma and Tony Zhang—as well as head honchos Pietro

Perona and Thanos Siapas. Ollie Stephenson and everyone at Caltech Letters.

My co-conspirators David Brown and Tatyana Dobreva. Laura Flower Kim

and Daniel Yoder at International Student Programs. Natalie Gilmore in the

Graduate Studies Oﬃce. Claire Ralph in Computing & Mathematical Sciences.

Athena Castro and Greg Fletcher at Caltech Y. Thank you for your presence,

advice, friendship and support, which has enriched my life.

The artwork in this thesis was created by OpenAI’s DALL·E diﬀusion model.

ABSTRACT

The goal of this thesis is to develop the optimisation and generalisation theoretic

foundations of learning in artiﬁcial neural networks. The thesis tackles two

central questions. Given training data and a network architecture:

1) Which weight setting will generalise best to unseen data, and why?

2) What optimiser should be used to recover this weight setting?

On optimisation, an essential feature of neural network training is that the

network weights aﬀect the loss function only indirectly through their appearance

in the network architecture. This thesis proposes a three-step framework for

deriving novel “architecture aware” optimisation algorithms. The ﬁrst step—

termed functional majorisation—is to majorise a series expansion of the loss

function in terms of functional perturbations. The second step is to derive

architectural perturbation bounds that relate the size of functional perturbations

to the size of weight perturbations. The third step is to substitute these

architectural perturbation bounds into the functional majorisation of the loss

and to obtain an optimisation algorithm via minimisation. This constitutes an

application of the majorise-minimise meta-algorithm to neural networks.

On generalisation, a promising recent line of work has applied PAC-Bayes

theory to derive non-vacuous generalisation guarantees for neural networks.

Since these guarantees control the average risk of ensembles of networks, they

do not address which individual network should generalise best. To close this

gap, the thesis rekindles an old idea from the kernels literature: the Bayes

point machine. A Bayes point machine is a single classiﬁer that approximates

the aggregate prediction of an ensemble of classiﬁers. Since aggregation reduces

the variance of ensemble predictions, Bayes point machines tend to generalise

better than other ensemble members. The thesis shows that the space of neural

networks consistent with a training set concentrates on a Bayes point machine

if both the network width and normalised margin are sent to inﬁnity. This

motivates the practice of returning a wide network of large normalised margin.

Potential applications of these ideas include novel methods for uncertainty

quantiﬁcation, more eﬃcient numerical representations for neural hardware,

and optimisers that transfer hyperparameters across learning problems.

TABLE OF CONTENTS

Acknowledgements ............................ iii

Abstract ................................. iv

Table of Contents ............................. v

List of Figures & Tables .......................... vii

Notation ................................. viii

I Introduction 1

Chapter 1: Finding the Foundations . . . . . . . . . . . . . . . . . . . . 2

1.1 Optimisation via perturbation ................... 4

1.2 Generalisation via aggregation ................... 5

Chapter 2: Constructing Spaces of Functions ............... 9

2.1 Kernel methods .......................... 10

2.2 Gaussian processes ........................ 14

2.3 Neural networks ......................... 17

Chapter 3: Correspondences Between Function Spaces . . . . . . . . . . . 24

3.1 GP posterior mean is a kernel interpolator ............. 24

3.2 GP posterior variance bounds the error of kernel interpolation . . . 25

3.3 Making the GP posterior concentrate on a kernel interpolator . . . 26

3.4 Neural network–Gaussian process correspondence . . . . . . . . . 27

II Optimisation 33

Chapter 4: The Majorise-Minimise Meta-Algorithm . . . . . . . . . . . . 34

4.1 Perturbation analysis: Expansions and bounds . . . . . . . . . . . 35

4.2 Majorise-minimise ......................... 35

4.3 Instantiations of the meta-algorithm . . . . . . . . . . . . . . . . 37

4.4 Trade-oﬀ between computation and ﬁdelity . . . . . . . . . . . . 42

Chapter 5: Majorise-Minimise for Learning Problems . . . . . . . . . . . . 43

5.1 Expanding the loss as a series in functional perturbations . . . . . 44

5.2 Functional majorisation of square loss ............... 45

5.3 First application: Deriving gradient descent . . . . . . . . . . . . 46

5.4 Second application: Deriving the Gauss-Newton method . . . . . . 47

Chapter 6: Majorise-Minimise for Deep Networks ............. 50

6.1 The deep linear network . . . . . . . . . . . . . . . . . . . . . . 51

6.2 Architectural perturbation bounds for deep linear networks . . . . . 52

6.3 Majorise-minimise for deep linear networks . . . . . . . . . . . . 54

6.4 Experimental tests with relu networks ............... 59

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Optimisation&GeneralisationinNetworksofNeuronsThesisbyJeremyBernsteinInPartialFulllmentoftheRequirementsfortheDegreeofDoctorofPhilosophyCaliforniaInstituteofTechnologyPasadena,California2023DefendedSeptember23,2022ii©2023JeremyBernsteinORCID:0000-0001-9110-7476https://jeremybernste.in/Allrightsrese...

展开>> 收起<<

Optimisation Generalisation in Networks of Neurons.pdf

共98页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Optimisation Generalisation in Networks of Neurons

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: