Optimisation Generalisation in Networks of Neurons

2025-04-29 0 0 2.3MB 98 页 10玖币
侵权投诉
Optimisation & Generalisation
in Networks of Neurons
Thesis by
Jeremy Bernstein
In Partial Fulfillment of the Requirements for
the Degree of Doctor of Philosophy
California Institute of Technology
Pasadena, California
2023
Defended September 23, 2022
arXiv:2210.10101v1 [cs.NE] 18 Oct 2022
ii
©2023
Jeremy Bernstein
ORCID: 0000-0001-9110-7476
https://jeremybernste.in/
All rights reserved
iii
ACKNOWLEDGEMENTS
I am grateful to the following people: My dear friends and my dear family,
without whom this thesis would not have been written. My advisor Yisong
Yue. The Yue Crew. My close collaborators Kamyar Azizzadenesheli, Dawna
Bagherian, Alex Farhang, Kevin Huang, Yang Liu, Kushal Tirumala and Jiawei
Zhao. My internship mentors Ming-Yu Liu, Arash Vahdat, Yu-Xiang Wang and
Greg Yang. My thesis committee Ming-Yu Liu, Markus Meister, Matt Thomson
and Joel Tropp. My Computation & Neural Systems cohort Jon Kenny, Matt
Rosenberg, Anish Sarma and Tony Zhang—as well as head honchos Pietro
Perona and Thanos Siapas. Ollie Stephenson and everyone at Caltech Letters.
My co-conspirators David Brown and Tatyana Dobreva. Laura Flower Kim
and Daniel Yoder at International Student Programs. Natalie Gilmore in the
Graduate Studies Office. Claire Ralph in Computing & Mathematical Sciences.
Athena Castro and Greg Fletcher at Caltech Y. Thank you for your presence,
advice, friendship and support, which has enriched my life.
The artwork in this thesis was created by OpenAI’s DALL·E diffusion model.
iv
ABSTRACT
The goal of this thesis is to develop the optimisation and generalisation theoretic
foundations of learning in artificial neural networks. The thesis tackles two
central questions. Given training data and a network architecture:
1) Which weight setting will generalise best to unseen data, and why?
2) What optimiser should be used to recover this weight setting?
On optimisation, an essential feature of neural network training is that the
network weights affect the loss function only indirectly through their appearance
in the network architecture. This thesis proposes a three-step framework for
deriving novel “architecture aware” optimisation algorithms. The first step—
termed functional majorisation—is to majorise a series expansion of the loss
function in terms of functional perturbations. The second step is to derive
architectural perturbation bounds that relate the size of functional perturbations
to the size of weight perturbations. The third step is to substitute these
architectural perturbation bounds into the functional majorisation of the loss
and to obtain an optimisation algorithm via minimisation. This constitutes an
application of the majorise-minimise meta-algorithm to neural networks.
On generalisation, a promising recent line of work has applied PAC-Bayes
theory to derive non-vacuous generalisation guarantees for neural networks.
Since these guarantees control the average risk of ensembles of networks, they
do not address which individual network should generalise best. To close this
gap, the thesis rekindles an old idea from the kernels literature: the Bayes
point machine. A Bayes point machine is a single classifier that approximates
the aggregate prediction of an ensemble of classifiers. Since aggregation reduces
the variance of ensemble predictions, Bayes point machines tend to generalise
better than other ensemble members. The thesis shows that the space of neural
networks consistent with a training set concentrates on a Bayes point machine
if both the network width and normalised margin are sent to infinity. This
motivates the practice of returning a wide network of large normalised margin.
Potential applications of these ideas include novel methods for uncertainty
quantification, more efficient numerical representations for neural hardware,
and optimisers that transfer hyperparameters across learning problems.
v
TABLE OF CONTENTS
Acknowledgements ............................ iii
Abstract ................................. iv
Table of Contents ............................. v
List of Figures & Tables .......................... vii
Notation ................................. viii
I Introduction 1
Chapter 1: Finding the Foundations . . . . . . . . . . . . . . . . . . . . 2
1.1 Optimisation via perturbation ................... 4
1.2 Generalisation via aggregation ................... 5
Chapter 2: Constructing Spaces of Functions ............... 9
2.1 Kernel methods .......................... 10
2.2 Gaussian processes ........................ 14
2.3 Neural networks ......................... 17
Chapter 3: Correspondences Between Function Spaces . . . . . . . . . . . 24
3.1 GP posterior mean is a kernel interpolator ............. 24
3.2 GP posterior variance bounds the error of kernel interpolation . . . 25
3.3 Making the GP posterior concentrate on a kernel interpolator . . . 26
3.4 Neural network–Gaussian process correspondence . . . . . . . . . 27
II Optimisation 33
Chapter 4: The Majorise-Minimise Meta-Algorithm . . . . . . . . . . . . 34
4.1 Perturbation analysis: Expansions and bounds . . . . . . . . . . . 35
4.2 Majorise-minimise ......................... 35
4.3 Instantiations of the meta-algorithm . . . . . . . . . . . . . . . . 37
4.4 Trade-off between computation and fidelity . . . . . . . . . . . . 42
Chapter 5: Majorise-Minimise for Learning Problems . . . . . . . . . . . . 43
5.1 Expanding the loss as a series in functional perturbations . . . . . 44
5.2 Functional majorisation of square loss ............... 45
5.3 First application: Deriving gradient descent . . . . . . . . . . . . 46
5.4 Second application: Deriving the Gauss-Newton method . . . . . . 47
Chapter 6: Majorise-Minimise for Deep Networks ............. 50
6.1 The deep linear network . . . . . . . . . . . . . . . . . . . . . . 51
6.2 Architectural perturbation bounds for deep linear networks . . . . . 52
6.3 Majorise-minimise for deep linear networks . . . . . . . . . . . . 54
6.4 Experimental tests with relu networks ............... 59
摘要:

Optimisation&GeneralisationinNetworksofNeuronsThesisbyJeremyBernsteinInPartialFulllmentoftheRequirementsfortheDegreeofDoctorofPhilosophyCaliforniaInstituteofTechnologyPasadena,California2023DefendedSeptember23,2022ii©2023JeremyBernsteinORCID:0000-0001-9110-7476https://jeremybernste.in/Allrightsrese...

展开>> 收起<<
Optimisation Generalisation in Networks of Neurons.pdf

共98页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:98 页 大小:2.3MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 98
客服
关注