Gradient-based Weight Density Balancing for Robust Dynamic Sparse Training Mathias Parger1Alexander Ertl1Paul Eibensteiner1Joerg H. Mueller1 2Martin Winter1 2 Markus Steinberger1 2

2025-05-06 0 0 321.04KB 10 页 10玖币

侵权投诉

Gradient-based Weight Density Balancing for Robust Dynamic Sparse Training

Mathias Parger 1Alexander Ertl 1Paul Eibensteiner 1Joerg H. Mueller 1 2 Martin Winter 1 2

Markus Steinberger 1 2

Abstract

Training a sparse neural network from scratch re-

quires optimizing connections at the same time as

the weights themselves. Typically, the weights

are redistributed after a predeﬁned number of

weight updates, removing a fraction of the pa-

rameters of each layer and inserting them at dif-

ferent locations in the same layers. The density

of each layer is determined using heuristics, often

purely based on the size of the parameter tensor.

While the connections per layer are optimized

multiple times during training, the density of each

layer remains constant. This leaves great unreal-

ized potential, especially in scenarios with a high

sparsity of 90% and more. We propose Global

Gradient-based Redistribution, a technique which

distributes weights across all layers - adding more

weights to the layers that need them most. Our

evaluation shows that our approach is less prone

to unbalanced weight distribution at initialization

than previous work and that it is able to ﬁnd bet-

ter performing sparse subnetworks at very high

sparsity levels.

1. Introduction

With the rise of neural networks, plenty of research was con-

ducted in optimizing network architectures (Sandler et al.,

2018;He et al.,2016), training procedures (Cubuk et al.,

2019;Loshchilov & Hutter,2017) and inference hardware

(Chen et al.,2015;Han et al.,2016a) to lower inference

latency. These achievements enabled the use of neural net-

Graz University of Technology, Austria

Huawei

Technologies. Correspondence to: Mathias Parger

mathias.parger@icg.tugraz.at

, Alexander Ertl

ertl@student.tugraz.at

, Paul Einbensteinger

eiben-

steiner@student.tugraz.at

, Joerg H. Mueller

jo-

erg.mueller@tugraz.at

, Martin Winter

mar-

tin.winter@icg.tugraz.at

, Markus Steinberger

stein-

berger@icg.tugraz.at>.

Proceedings of the

39 th

International Conference on Machine

Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copy-

right 2022 by the author(s).

works on low-power mobile devices like mobile phones or

virtual reality headsets. Yet, neural network inference re-

mains computationally expensive, and the cost of training

can be large - often taking days or weeks even in distributed

environments.

In past years, quantization has established itself as a standard

approach to reduce the memory footprint of the parameters

and to accelerate inference as well as training using lower

precision multiplication. Another popular, yet not equally

established, way to cut costs, is pruning, i.e., weight spar-

siﬁcation (Cun et al.,1990;Molchanov et al.,2017;Zhu &

Gupta,2018). Like quantization, pruning lowers inference

cost with little to no reduction in network performance. The

typical process of pruning consists of training the network

densely and then iteratively removing weights which have

the smallest impact on the accuracy of the prediction, while

retraining the remaining weights. This way, the number

of weights is reduced signiﬁcantly, reducing FLOPs and

memory footprint at the same time. Using smaller, dense

networks from start does lower the training cost, but start-

ing with a large model at ﬁrst comes with the advantage of

having a large set of potential subnetworks that can be estab-

lished during early training. Researchers have shown many

times that large, sparse models outperform dense, small

models with equal parameter count signiﬁcantly (Evci et al.,

2020;Mostafa & Wang,2019).

With the requirements of a densely trained network and

many iterations of pruning and parameter reﬁnement, the

training cost of pruned neural networks is typically much

higher than for a dense neural network by itself. Further-

more, memory consumption is often a limiting factor in the

training process. For example, the language model GPT-3

(Brown et al.,2020) consists of 175 billion parameters -

a multitude of what modern GPUs can store in memory.

Adding the size of feature maps, gradients and optimizer

states, the memory consumption during training is much

larger than in pure inference. Recently, early pruning is be-

coming a popular research topic as it can reduce training cost

compared to traditional pruning (Zhang et al.,2021;Chen

et al.,2021;Liu et al.,2021a). These approaches start with

a high density and then gradually reduce parameter density

during training and before the network has converged. Once

parts of the weights start to settle, less important weights

arXiv:2210.14012v2 [cs.LG] 3 Nov 2022

can be dropped iteratively.

Dynamic sparse training, on the other hand, starts with

a sparse neural network from scratch and only requires a

fraction of the parameters of the large, dense model, and

therefore also only a fraction of gradients, optimizer vari-

ables and FLOPs for processing them, while achieving com-

parable results. Without knowing the set of connections

upfront, however, a well-suited set of connections needs to

be found at the same time as the weights are optimized. Ran-

domly connected networks fall short in accuracy compared

to pruned models, especially at high sparsity levels. Dy-

namic sparse training solves this by training the connections

at the same time as the parameters. This is accomplished

using pruning-like concepts for weight removal, and spe-

cial heuristics for weight insertion at set intervals during

training.

Finding the best weights for removal and insertion is difﬁcult

and often either causes signiﬁcant overhead which negates

the gains of dynamic sparse training, or it fails to achieve

results close to already highly optimized pruning techniques.

A common limitation of most approaches is that weights are

only redistributed locally, on a per-layer basis. The target

sparsity of each layer is chosen heuristically at network

initialization, for example by making larger layers more

sparse and distributing the culled weights to smaller layers

which likely need them more. However, not all layers of a

network require the same density, even if their parameter

tensor has the same shape. This problem becomes more

apparent when using very high levels of sparsity (

>90

where small differences in the density of a layer can make a

big difference on the accuracy of the network.

We propose Global Gradient-based Redistribution (GGR),

a method which redistributes weights throughout layers

dynamically during the training process by inserting the

weights in the layers with the largest gradients. Our main

contributions are:

•

We propose a global gradient-based weight redistribu-

tion technique that adjusts layer densities dynamically

during training.

•

We combine existing dynamic sparse training ap-

proaches to achieve the most robust and consistent

method across a variety of conﬁgurations.

•

We present ablation studies on various global weight

redistribution schemes, comparing our proposed ap-

proach to current state-of-the-art metrics.

2. Related Work

Researchers have proposed many ways to reduce overpa-

rameterization of neural networks. Typically, a network is

ﬁrst trained densely until convergence before starting an

iterative pruning process. Pruning removes weights with the

lowest impact on the loss (e.g., weights with the smallest

magnitude) and the remaining weights are shortly retrained

to compensate for the loss of parameters (Han et al.,2016b;

Molchanov et al.,2017;Zhu & Gupta,2018). Training a

small, dense network with less neurons instead, however,

does not result in the same accuracy as pruning a converged

network.

The lottery ticket hypothesis suggests that when a working

subnetwork is found (e.g., with pruning), the same subnet-

work can achieve the same accuracy when trained from

scratch (Frankle & Carbin,2019). According to the authors

of the lottery ticket hypothesis, the advantage of overpa-

rameterization is that during training, the network can ﬁnd

working subnetworks among large amounts of possible can-

didates. Training a found subnetwork, the winning ticket, to

achieve the same accuracy, however, is difﬁcult and often

requires initializing the subnetwork with weights extracted

from a dense network trained for thousands of optimizer

steps (Zhou et al.,2019). However, the hypothesis shows

that at least simpler architectures can be trained with a sparse

neural network from scratch. The challenge remains to ﬁnd

a combination of connections that achieves a high accuracy

- without having to train the network densely ﬁrst to extract

the winning ticket.

The concept of training a randomly initialized sparse neural

networks from scratch was introduced in 2018 with two

different approaches. Deep Rewiring (DeepR) (Bellec et al.,

2018) proposes to drop weights that would ﬂip the sign dur-

ing the optimizer update step. The same number of weights

is then inserted randomly in the same layer and initialized

with a random sign. Sparse Evolutionary Training (SET)

(Mocanu et al.,2018a), on the other hand, makes use of

a simple and well-established weight removal approach in

pruning research. Following the intuition that weights with

a smaller magnitude contribute less to the output of a neuron,

SET drops a ﬁxed number of weights with the lowest mag-

nitude in predeﬁned intervals and inserts the same amount

randomly. RigL (Evci et al.,2020) improved the criteria

introduced in SET. While they also remove the weights with

the lowest magnitude, they insert the set of weights with the

highest gradients instead of selecting them randomly. This

way, connections that could have the greatest impact on the

result of the current mini-batch are inserted ﬁrst.

The common limitation of all three approaches is that they

use a predeﬁned density per layer. This does have the advan-

tage that the network can be designed to reach a FLOP target

in inference. However, with deep neural networks, selecting

the right density per layer is difﬁcult and performance is

likely left on the table.

Dynamic sparse reparameterization (DSR) (Mostafa &

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Gradient-basedWeightDensityBalancingforRobustDynamicSparseTrainingMathiasParger1AlexanderErtl1PaulEibensteiner1JoergH.Mueller12MartinWinter12MarkusSteinberger12AbstractTrainingasparseneuralnetworkfromscratchre-quiresoptimizingconnectionsatthesametimeastheweightsthemselves.Typically,theweightsareredi...

展开>> 收起<<

Gradient-based Weight Density Balancing for Robust Dynamic Sparse Training Mathias Parger1Alexander Ertl1Paul Eibensteiner1Joerg H. Mueller1 2Martin Winter1 2 Markus Steinberger1 2.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Gradient-based Weight Density Balancing for Robust Dynamic Sparse Training Mathias Parger1Alexander Ertl1Paul Eibensteiner1Joerg H. Mueller1 2Martin Winter1 2 Markus Steinberger1 2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: