Grokking phase transitions in learning local rules with gradient descent Bojan Zunkovi c

2025-05-06 0 0 2.27MB 41 页 10玖币

侵权投诉

Grokking phase transitions in learning local rules with gradient

descent

Bojan ˇ

Zunkoviˇc ∗

Faculty of computer and information science, University of Ljubljana, Ljubljana, Slovenia

Enej Ilievski

Faculty of mathematics and physics, University of Ljubljana, Ljubljana, Slovenia

Abstract

We discuss two solvable grokking (generalisation beyond overﬁtting) models in a rule learning scenario.

We show that grokking is a phase transition and ﬁnd exact analytic expressions for the critical exponents,

grokking probability, and grokking time distribution. Further, we introduce a tensor-network map that

connects the proposed grokking setup with the standard (perceptron) statistical learning theory and show

that grokking is a consequence of the locality of the teacher model. As an example, we analyse the cellular

automata learning task, numerically determine the critical exponent and the grokking time distributions

and compare them with the prediction of the proposed grokking model. Finally, we numerically analyse

the connection between structure formation and grokking.

Contents

1 Introduction 3

2 Related work 4

3 Perceptron grokking 5

3.1 1Dexponentialmodel ........................................ 6

3.1.1 Testerrordynamics ..................................... 7

3.1.2 Grokkingprobability..................................... 8

∗bojan.zunkovic@fri.uni-lj.si

arXiv:2210.15435v1 [cond-mat.stat-mech] 26 Oct 2022

3.1.3 Grokkingtime ........................................ 9

3.2 D-dimensional uniform ball model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2.1 Testerrordynamics ..................................... 11

3.2.2 Grokkingprobability..................................... 12

3.2.3 Grokkingtime ........................................ 15

3.2.4 Critical exponents for a general isotropic data PDF . . . . . . . . . . . . . . . . . . . 17

4 Learning local rules with shallow tensor networks 18

4.1 Localteachermodel ......................................... 18

4.2 Uniform tensor-network student model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2.1 Short introduction to tensor network methods . . . . . . . . . . . . . . . . . . . . . . . 20

4.2.2 Tensor-network attention model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2.3 Tensornetworkmap ..................................... 22

4.3 Simulationdetailsandresults.................................... 22

4.3.1 Constant attention tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.3.2 Full model training and structure formation . . . . . . . . . . . . . . . . . . . . . . . . 26

5 Summary and discussion 28

A Grokking time in the 1D model 34

B Grokking probability in the D-dimensional ball model 35

B.1 Case λ1=0............................................. 36

B.2 Case 1 λ1>0.......................................... 37

C Fixed attention 37

D Additional results for the 2–local and the 3–local rules 38

1 Introduction

Despite recent progress in understanding the double descend phenomena [1,2,3,4] we still do not have a

complete theory of generalisation in over-parameterised models. Two recent empirical observations, neural

collapse [5] and grokking (generalisation beyond over-ﬁtting) [6], can help us understand the training and

generalisation properties of over-parameterised models.

Neural collapse occurs in the terminal phase of training, i.e. the phase with zero train error. It refers to the

collapse of the N−dimensional, last-layer features (input to the last/classiﬁcation layer) [5] to a (C−1)-

dimensional equiangular tight frame (ETF) structure, where Cis the number of classes. The feature vectors

converge towards the vertices of the ETF structure such that features for each class are close to one vertex.

Also, the distance between the vertices is much larger than all intra-class feature variances. We can partially

understand neural collapse within the unconstrained features and local elasticity models [7]. However, its

role in generalisation, relation to grokking, and appearance of diﬀerent latent space structures are still not

completely understood.

Grokking also occurs during the terminal phase of training. When training on algorithmic datasets past

the zero train error, a sudden decrease of the test error from approximately one to zero is observed [6].

The grokking phenomenon has been discussed within an eﬀective theory approach [8], where an empirical

connection between representation/structure formation and generalisation has been made. An empirical

study [9] established a relation between grokking and training loss spikes and weight norm increase. However,

no exactly solvable model exhibiting the grokking phenomenon has been discussed so far. Further, it is

not clear how to reconcile grokking with the standard generalisation theory based on statistical learning

methods [10]. The statistical learning theory predicts (in a teacher-student setting) an algebraic (as t−ν,

where ν= 1 for most learning rules) decrease of the generalisation error with training time t(or a number

of training samples) [10].

Grokking and neural collapse (or latent-space structure formation in general) have been observed in over-

parametrised models. However, we do not know what is the minimal framework within which we can

understand these phenomena or if they are genuinely deep-network phenomena. We aim to formulate a

simple solvable model of grokking and relate it to latent-space structure formation and other common deep-

network training features, e.g. spikes in the training loss.

Main contributions– We have four main contributions:

•We propose a simple learning scenario that exhibits grokking (Section 3). We study two solvable

models where grokking is a phase transition to zero test error and calculate exact critical exponents

and grokking-time distributions.

•We discuss the teacher-student model within the tensor network approach and map the standard

supervised statistical-learning scenario in the thermodynamic limit to the proposed grokking setup

(Section 4).

•We numerically study grokking and structure formation on the example of learning a 1D cellular

automaton rule 30 (Section 4). We show that sudden spikes in the training loss correspond to structural

changes in the latent space representation of the data.

•Our analytical results and numerical experiments show a signiﬁcant diﬀerence between L1and L2

regularisations. The L1regularised models have a larger grokking probability, shorter grokking time,

shorter generalisation time, and smaller eﬀective dimension compared to L2regularised models.

Broader impact– The proposed exactly-solvable grokking models are a step towards theoretical under-

standing of the late learning phase and generalisation beneﬁts of the terminal phase of training. The in-

troduced tensor-network map connects the standard teacher-student setup in the thermodynamic limit with

the proposed grokking setup. It oﬀers a new tool for studying generalisation properties of local rules (local

teacher-student models), which could lead to more complex learning dynamics (compared to the standard

inﬁnite-range rules).

Although based on simple models, our results can be relevant also for deep learning training practice. We

conjecture that good generalisation is more likely in models with latent space data distributions with small

eﬀective dimension. Our results hint that L1regularisation can improve the generalisation properties of deep

models compared to L2regularisation. Further, we show that spikes in the loss (which often occur during

training of deep neural networks) correspond to latent space structural changes that can be beneﬁcial or

detrimental for generalisation. Assuming this is the case also in deep networks, we can use the information

about the latent space eﬀective dimension to revert the model to a state before the spike or continue training

with the current model.

2 Related work

A sudden transition from zero to 100% accuracy on algorithmic datasets in over-ﬁtted transformer models

has been ﬁrst described in [6] and named grokking. In the grokking phase, a formation of simple structures

reﬂecting the properties of the problem have been observed. This ﬁnding contradicts the common practice of

early stopping and supports recent observations on the beneﬁts of the terminal phase of training [11,1,12] and

the double descend phenomena [1,13,2,14]. In [8], an eﬀective theory of grokking has been proposed. Within

the eﬀective theory we can calculate the critical training size to observe grokking. The authors relate grokking

with a good representation (or structure formation) and introduce it as a phase between generalisation and

memorisation. We go beyond these ﬁndings since we obtain exact solutions for the proposed setup and

calculate even the grokking-time probability density function (PDF). A systematic experimental study of

the grokking phenomena has been presented in [9]. A sling-shot mechanism (related to edge of stability [15])

has been proposed as a necessary condition for grokking. The sling-shot mechanism refers to the occurrence

of cyclic spikes in the training loss and steps in the weight norms during training. The sling-shot behavior is

not restricted to algorithmic datasets but is present also in various common classiﬁcation tasks [9]. We ﬁnd

a similar behaviour, i.e. that the grokking coincides with train loss spikes. Moreover, we connect training

loss spikes with discontinuous step-like evolution of the eﬀective dimension of the latent space representation

of the data, which indicate structural changes of the latent space representation.

A particular structure formation common in deep classiﬁcation neural networks is the neural collapse (NC).

It refers to four empirical/numerical observations in training deep neural network classiﬁers [5]:

•(NC1) Variability collapse – variations of within class features become negligible

•(NC2) Convergence to equiangular tight frame (ETF)– class mean vectors form an equal-sized

angles between any given pair

•(NC3) Convergence to self-duality– the class means and linear classiﬁers converge to each other,

up to rescaling

•(NC4) Simpliﬁcation to nearest class center– the network classiﬁer converges to a classiﬁer that

selects the class with the nearest train class mean.

The role of the loss function, the regularisation, the batch normalisation, and the optimizer have been

discussed within the unconstrained features model [16,17,7] and the local elasticity assumption [7]. The

relation of NC to generalisation properties has been discussed in [18,19,7]. However, no relation to grokking

has been discussed so far. Although we do not observe NC as deﬁned above, our ﬁndings regarding the spikes

in the training loss and latent-space data structure might also be relevant for the NC dynamics.

Our main technical tools are tensor networks which are models obtained by contracting many low-dimensional

tensors. Tensor networks have been very successful in modelling many-body quantum systems. Recently, they

have also been applied to machine learning tasks. In particular to classiﬁcation problems [20,21,22,23,24,

25,26,27], generative modelling [28,29,30,31], sequence and language modelling [32,33,34,35,36], anomaly

detection [37,38]. Besides, tensor networks have been used as tools to advance machine learning theory by a

derivation of interesting generalisation bounds [34], information theoretical insights [39,40,41,42], and new

connections between machine learning and physics [43,44,45]. Particularly relevant for latent space structure

formation is the connection between recurrent neural networks (RNN) and matrix product states [46]. In

[12] beneﬁts of the terminal phase of training for state automata extraction from RNNs (and hence matrix

product state tensor networks) have been discussed. The authors ﬁnd internal state space compression and

increased extraction in the terminal phase of training. This is similar to our ﬁndings of reduced eﬀective

dimension in the latent (internal) space. In contrast to [12], we introduce a new tensor network, similar to

the tensor-network attention model [36], and study grokking and structure formation in a teacher-student

learning setup.

Finally, our work is related to the statistical-mechanics theory of supervised learning [10]. In the supervised

perceptron teacher-student case, an algebraic decrease of the generalisation error with the training set size

(training time) has been predicted [10]. A ﬁrst-order phase transition has been derived only in a restricted

setting of discrete weights [10]. These results are typically based on the replica method [47] which requires

the thermodynamic limit, where both the number of samples and the dimension are large and their ratio

is ﬁxed. Outstanding recent results in this direction concern the analysis of optimal generalisation errors

of generalised linear models [48,49]. We study the same teacher-student scenario but with a restriction to

a local teacher (still within the thermodynamic limit). The locality of the teacher/rule enables us to map

the problem to a ﬁnite-dimensional latent space where we discuss grokking (a second-order phase transition)

and latent-space structure formation.

3 Perceptron grokking

We consider a simple binary classiﬁcation problem that exhibits the grokking phenomena. Let us assume

we have a dataset Dconsisting of (˜xi, yi)∈ D, with two linearly separable classes (yi∈ {−1,1}) and

D−dimensional features ˜xi∈RD. More precisely, the probability densities for the positive (P+) and the

negative (P−) class are linearly separable in RD. Our model class is a simple perceptron in Ddimensions,

namely

f(˜x) = sgn(ˆy),ˆy=w·˜x+b, (1)

where w, x ∈RDand b∈R. We sample Npositive and Nnegative samples and then train the model with

gradient descend

∂θ

∂t =−∂R

∂θ ,(2)

R=1

i=1

2|ˆyi−yi|2+λ1||θ||1+λ2

2||θ||2,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GrokkingphasetransitionsinlearninglocalruleswithgradientdescentBojanZunkovic*Facultyofcomputerandinformationscience,UniversityofLjubljana,Ljubljana,SloveniaEnejIlievskiFacultyofmathematicsandphysics,UniversityofLjubljana,Ljubljana,SloveniaAbstractWediscusstwosolvablegrokking(generalisationbeyondov...

展开>> 收起<<

Grokking phase transitions in learning local rules with gradient descent Bojan Zunkovi c.pdf

共41页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Grokking phase transitions in learning local rules with gradient descent Bojan Zunkovi c

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: