Grokking phase transitions in learning local rules with gradient descent Bojan Zunkovi c

2025-05-06 0 0 2.27MB 41 页 10玖币
侵权投诉
Grokking phase transitions in learning local rules with gradient
descent
Bojan ˇ
Zunkoviˇc
Faculty of computer and information science, University of Ljubljana, Ljubljana, Slovenia
Enej Ilievski
Faculty of mathematics and physics, University of Ljubljana, Ljubljana, Slovenia
Abstract
We discuss two solvable grokking (generalisation beyond overfitting) models in a rule learning scenario.
We show that grokking is a phase transition and find exact analytic expressions for the critical exponents,
grokking probability, and grokking time distribution. Further, we introduce a tensor-network map that
connects the proposed grokking setup with the standard (perceptron) statistical learning theory and show
that grokking is a consequence of the locality of the teacher model. As an example, we analyse the cellular
automata learning task, numerically determine the critical exponent and the grokking time distributions
and compare them with the prediction of the proposed grokking model. Finally, we numerically analyse
the connection between structure formation and grokking.
Contents
1 Introduction 3
2 Related work 4
3 Perceptron grokking 5
3.1 1Dexponentialmodel ........................................ 6
3.1.1 Testerrordynamics ..................................... 7
3.1.2 Grokkingprobability..................................... 8
bojan.zunkovic@fri.uni-lj.si
1
arXiv:2210.15435v1 [cond-mat.stat-mech] 26 Oct 2022
3.1.3 Grokkingtime ........................................ 9
3.2 D-dimensional uniform ball model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.1 Testerrordynamics ..................................... 11
3.2.2 Grokkingprobability..................................... 12
3.2.3 Grokkingtime ........................................ 15
3.2.4 Critical exponents for a general isotropic data PDF . . . . . . . . . . . . . . . . . . . 17
4 Learning local rules with shallow tensor networks 18
4.1 Localteachermodel ......................................... 18
4.2 Uniform tensor-network student model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.1 Short introduction to tensor network methods . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.2 Tensor-network attention model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2.3 Tensornetworkmap ..................................... 22
4.3 Simulationdetailsandresults.................................... 22
4.3.1 Constant attention tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3.2 Full model training and structure formation . . . . . . . . . . . . . . . . . . . . . . . . 26
5 Summary and discussion 28
A Grokking time in the 1D model 34
B Grokking probability in the D-dimensional ball model 35
B.1 Case λ1=0............................................. 36
B.2 Case 1 λ1>0.......................................... 37
C Fixed attention 37
D Additional results for the 2–local and the 3–local rules 38
2
1 Introduction
Despite recent progress in understanding the double descend phenomena [1,2,3,4] we still do not have a
complete theory of generalisation in over-parameterised models. Two recent empirical observations, neural
collapse [5] and grokking (generalisation beyond over-fitting) [6], can help us understand the training and
generalisation properties of over-parameterised models.
Neural collapse occurs in the terminal phase of training, i.e. the phase with zero train error. It refers to the
collapse of the Ndimensional, last-layer features (input to the last/classification layer) [5] to a (C1)-
dimensional equiangular tight frame (ETF) structure, where Cis the number of classes. The feature vectors
converge towards the vertices of the ETF structure such that features for each class are close to one vertex.
Also, the distance between the vertices is much larger than all intra-class feature variances. We can partially
understand neural collapse within the unconstrained features and local elasticity models [7]. However, its
role in generalisation, relation to grokking, and appearance of different latent space structures are still not
completely understood.
Grokking also occurs during the terminal phase of training. When training on algorithmic datasets past
the zero train error, a sudden decrease of the test error from approximately one to zero is observed [6].
The grokking phenomenon has been discussed within an effective theory approach [8], where an empirical
connection between representation/structure formation and generalisation has been made. An empirical
study [9] established a relation between grokking and training loss spikes and weight norm increase. However,
no exactly solvable model exhibiting the grokking phenomenon has been discussed so far. Further, it is
not clear how to reconcile grokking with the standard generalisation theory based on statistical learning
methods [10]. The statistical learning theory predicts (in a teacher-student setting) an algebraic (as tν,
where ν= 1 for most learning rules) decrease of the generalisation error with training time t(or a number
of training samples) [10].
Grokking and neural collapse (or latent-space structure formation in general) have been observed in over-
parametrised models. However, we do not know what is the minimal framework within which we can
understand these phenomena or if they are genuinely deep-network phenomena. We aim to formulate a
simple solvable model of grokking and relate it to latent-space structure formation and other common deep-
network training features, e.g. spikes in the training loss.
Main contributions– We have four main contributions:
We propose a simple learning scenario that exhibits grokking (Section 3). We study two solvable
models where grokking is a phase transition to zero test error and calculate exact critical exponents
and grokking-time distributions.
We discuss the teacher-student model within the tensor network approach and map the standard
supervised statistical-learning scenario in the thermodynamic limit to the proposed grokking setup
(Section 4).
We numerically study grokking and structure formation on the example of learning a 1D cellular
automaton rule 30 (Section 4). We show that sudden spikes in the training loss correspond to structural
changes in the latent space representation of the data.
Our analytical results and numerical experiments show a significant difference between L1and L2
regularisations. The L1regularised models have a larger grokking probability, shorter grokking time,
shorter generalisation time, and smaller effective dimension compared to L2regularised models.
3
Broader impact– The proposed exactly-solvable grokking models are a step towards theoretical under-
standing of the late learning phase and generalisation benefits of the terminal phase of training. The in-
troduced tensor-network map connects the standard teacher-student setup in the thermodynamic limit with
the proposed grokking setup. It offers a new tool for studying generalisation properties of local rules (local
teacher-student models), which could lead to more complex learning dynamics (compared to the standard
infinite-range rules).
Although based on simple models, our results can be relevant also for deep learning training practice. We
conjecture that good generalisation is more likely in models with latent space data distributions with small
effective dimension. Our results hint that L1regularisation can improve the generalisation properties of deep
models compared to L2regularisation. Further, we show that spikes in the loss (which often occur during
training of deep neural networks) correspond to latent space structural changes that can be beneficial or
detrimental for generalisation. Assuming this is the case also in deep networks, we can use the information
about the latent space effective dimension to revert the model to a state before the spike or continue training
with the current model.
2 Related work
A sudden transition from zero to 100% accuracy on algorithmic datasets in over-fitted transformer models
has been first described in [6] and named grokking. In the grokking phase, a formation of simple structures
reflecting the properties of the problem have been observed. This finding contradicts the common practice of
early stopping and supports recent observations on the benefits of the terminal phase of training [11,1,12] and
the double descend phenomena [1,13,2,14]. In [8], an effective theory of grokking has been proposed. Within
the effective theory we can calculate the critical training size to observe grokking. The authors relate grokking
with a good representation (or structure formation) and introduce it as a phase between generalisation and
memorisation. We go beyond these findings since we obtain exact solutions for the proposed setup and
calculate even the grokking-time probability density function (PDF). A systematic experimental study of
the grokking phenomena has been presented in [9]. A sling-shot mechanism (related to edge of stability [15])
has been proposed as a necessary condition for grokking. The sling-shot mechanism refers to the occurrence
of cyclic spikes in the training loss and steps in the weight norms during training. The sling-shot behavior is
not restricted to algorithmic datasets but is present also in various common classification tasks [9]. We find
a similar behaviour, i.e. that the grokking coincides with train loss spikes. Moreover, we connect training
loss spikes with discontinuous step-like evolution of the effective dimension of the latent space representation
of the data, which indicate structural changes of the latent space representation.
A particular structure formation common in deep classification neural networks is the neural collapse (NC).
It refers to four empirical/numerical observations in training deep neural network classifiers [5]:
(NC1) Variability collapse – variations of within class features become negligible
(NC2) Convergence to equiangular tight frame (ETF)– class mean vectors form an equal-sized
angles between any given pair
(NC3) Convergence to self-duality– the class means and linear classifiers converge to each other,
up to rescaling
(NC4) Simplification to nearest class center– the network classifier converges to a classifier that
selects the class with the nearest train class mean.
4
The role of the loss function, the regularisation, the batch normalisation, and the optimizer have been
discussed within the unconstrained features model [16,17,7] and the local elasticity assumption [7]. The
relation of NC to generalisation properties has been discussed in [18,19,7]. However, no relation to grokking
has been discussed so far. Although we do not observe NC as defined above, our findings regarding the spikes
in the training loss and latent-space data structure might also be relevant for the NC dynamics.
Our main technical tools are tensor networks which are models obtained by contracting many low-dimensional
tensors. Tensor networks have been very successful in modelling many-body quantum systems. Recently, they
have also been applied to machine learning tasks. In particular to classification problems [20,21,22,23,24,
25,26,27], generative modelling [28,29,30,31], sequence and language modelling [32,33,34,35,36], anomaly
detection [37,38]. Besides, tensor networks have been used as tools to advance machine learning theory by a
derivation of interesting generalisation bounds [34], information theoretical insights [39,40,41,42], and new
connections between machine learning and physics [43,44,45]. Particularly relevant for latent space structure
formation is the connection between recurrent neural networks (RNN) and matrix product states [46]. In
[12] benefits of the terminal phase of training for state automata extraction from RNNs (and hence matrix
product state tensor networks) have been discussed. The authors find internal state space compression and
increased extraction in the terminal phase of training. This is similar to our findings of reduced effective
dimension in the latent (internal) space. In contrast to [12], we introduce a new tensor network, similar to
the tensor-network attention model [36], and study grokking and structure formation in a teacher-student
learning setup.
Finally, our work is related to the statistical-mechanics theory of supervised learning [10]. In the supervised
perceptron teacher-student case, an algebraic decrease of the generalisation error with the training set size
(training time) has been predicted [10]. A first-order phase transition has been derived only in a restricted
setting of discrete weights [10]. These results are typically based on the replica method [47] which requires
the thermodynamic limit, where both the number of samples and the dimension are large and their ratio
is fixed. Outstanding recent results in this direction concern the analysis of optimal generalisation errors
of generalised linear models [48,49]. We study the same teacher-student scenario but with a restriction to
a local teacher (still within the thermodynamic limit). The locality of the teacher/rule enables us to map
the problem to a finite-dimensional latent space where we discuss grokking (a second-order phase transition)
and latent-space structure formation.
3 Perceptron grokking
We consider a simple binary classification problem that exhibits the grokking phenomena. Let us assume
we have a dataset Dconsisting of (˜xi, yi)∈ D, with two linearly separable classes (yi∈ {−1,1}) and
Ddimensional features ˜xiRD. More precisely, the probability densities for the positive (P+) and the
negative (P) class are linearly separable in RD. Our model class is a simple perceptron in Ddimensions,
namely
f(˜x) = sgn(ˆy),ˆy=w·˜x+b, (1)
where w, x RDand bR. We sample Npositive and Nnegative samples and then train the model with
gradient descend
θ
t =R
θ ,(2)
R=1
2N
2N
X
i=1
1
2|ˆyiyi|2+λ1||θ||1+λ2
2||θ||2,
5
摘要:

GrokkingphasetransitionsinlearninglocalruleswithgradientdescentBojanZunkovic*Facultyofcomputerandinformationscience,UniversityofLjubljana,Ljubljana,SloveniaEnejIlievskiFacultyofmathematicsandphysics,UniversityofLjubljana,Ljubljana,SloveniaAbstractWediscusstwosolvablegrokking(generalisationbeyondov...

展开>> 收起<<
Grokking phase transitions in learning local rules with gradient descent Bojan Zunkovi c.pdf

共41页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:41 页 大小:2.27MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 41
客服
关注