Gradient-based Weight Density Balancing for Robust Dynamic Sparse Training Mathias Parger1Alexander Ertl1Paul Eibensteiner1Joerg H. Mueller1 2Martin Winter1 2 Markus Steinberger1 2

2025-05-06 0 0 321.04KB 10 页 10玖币
侵权投诉
Gradient-based Weight Density Balancing for Robust Dynamic Sparse Training
Mathias Parger 1Alexander Ertl 1Paul Eibensteiner 1Joerg H. Mueller 1 2 Martin Winter 1 2
Markus Steinberger 1 2
Abstract
Training a sparse neural network from scratch re-
quires optimizing connections at the same time as
the weights themselves. Typically, the weights
are redistributed after a predefined number of
weight updates, removing a fraction of the pa-
rameters of each layer and inserting them at dif-
ferent locations in the same layers. The density
of each layer is determined using heuristics, often
purely based on the size of the parameter tensor.
While the connections per layer are optimized
multiple times during training, the density of each
layer remains constant. This leaves great unreal-
ized potential, especially in scenarios with a high
sparsity of 90% and more. We propose Global
Gradient-based Redistribution, a technique which
distributes weights across all layers - adding more
weights to the layers that need them most. Our
evaluation shows that our approach is less prone
to unbalanced weight distribution at initialization
than previous work and that it is able to find bet-
ter performing sparse subnetworks at very high
sparsity levels.
1. Introduction
With the rise of neural networks, plenty of research was con-
ducted in optimizing network architectures (Sandler et al.,
2018;He et al.,2016), training procedures (Cubuk et al.,
2019;Loshchilov & Hutter,2017) and inference hardware
(Chen et al.,2015;Han et al.,2016a) to lower inference
latency. These achievements enabled the use of neural net-
1
Graz University of Technology, Austria
2
Huawei
Technologies. Correspondence to: Mathias Parger
<
mathias.parger@icg.tugraz.at
>
, Alexander Ertl
<
ertl@student.tugraz.at
>
, Paul Einbensteinger
<
eiben-
steiner@student.tugraz.at
>
, Joerg H. Mueller
<
jo-
erg.mueller@tugraz.at
>
, Martin Winter
<
mar-
tin.winter@icg.tugraz.at
>
, Markus Steinberger
<
stein-
berger@icg.tugraz.at>.
Proceedings of the
39 th
International Conference on Machine
Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copy-
right 2022 by the author(s).
works on low-power mobile devices like mobile phones or
virtual reality headsets. Yet, neural network inference re-
mains computationally expensive, and the cost of training
can be large - often taking days or weeks even in distributed
environments.
In past years, quantization has established itself as a standard
approach to reduce the memory footprint of the parameters
and to accelerate inference as well as training using lower
precision multiplication. Another popular, yet not equally
established, way to cut costs, is pruning, i.e., weight spar-
sification (Cun et al.,1990;Molchanov et al.,2017;Zhu &
Gupta,2018). Like quantization, pruning lowers inference
cost with little to no reduction in network performance. The
typical process of pruning consists of training the network
densely and then iteratively removing weights which have
the smallest impact on the accuracy of the prediction, while
retraining the remaining weights. This way, the number
of weights is reduced significantly, reducing FLOPs and
memory footprint at the same time. Using smaller, dense
networks from start does lower the training cost, but start-
ing with a large model at first comes with the advantage of
having a large set of potential subnetworks that can be estab-
lished during early training. Researchers have shown many
times that large, sparse models outperform dense, small
models with equal parameter count significantly (Evci et al.,
2020;Mostafa & Wang,2019).
With the requirements of a densely trained network and
many iterations of pruning and parameter refinement, the
training cost of pruned neural networks is typically much
higher than for a dense neural network by itself. Further-
more, memory consumption is often a limiting factor in the
training process. For example, the language model GPT-3
(Brown et al.,2020) consists of 175 billion parameters -
a multitude of what modern GPUs can store in memory.
Adding the size of feature maps, gradients and optimizer
states, the memory consumption during training is much
larger than in pure inference. Recently, early pruning is be-
coming a popular research topic as it can reduce training cost
compared to traditional pruning (Zhang et al.,2021;Chen
et al.,2021;Liu et al.,2021a). These approaches start with
a high density and then gradually reduce parameter density
during training and before the network has converged. Once
parts of the weights start to settle, less important weights
1
arXiv:2210.14012v2 [cs.LG] 3 Nov 2022
can be dropped iteratively.
Dynamic sparse training, on the other hand, starts with
a sparse neural network from scratch and only requires a
fraction of the parameters of the large, dense model, and
therefore also only a fraction of gradients, optimizer vari-
ables and FLOPs for processing them, while achieving com-
parable results. Without knowing the set of connections
upfront, however, a well-suited set of connections needs to
be found at the same time as the weights are optimized. Ran-
domly connected networks fall short in accuracy compared
to pruned models, especially at high sparsity levels. Dy-
namic sparse training solves this by training the connections
at the same time as the parameters. This is accomplished
using pruning-like concepts for weight removal, and spe-
cial heuristics for weight insertion at set intervals during
training.
Finding the best weights for removal and insertion is difficult
and often either causes significant overhead which negates
the gains of dynamic sparse training, or it fails to achieve
results close to already highly optimized pruning techniques.
A common limitation of most approaches is that weights are
only redistributed locally, on a per-layer basis. The target
sparsity of each layer is chosen heuristically at network
initialization, for example by making larger layers more
sparse and distributing the culled weights to smaller layers
which likely need them more. However, not all layers of a
network require the same density, even if their parameter
tensor has the same shape. This problem becomes more
apparent when using very high levels of sparsity (
>90
%)
where small differences in the density of a layer can make a
big difference on the accuracy of the network.
We propose Global Gradient-based Redistribution (GGR),
a method which redistributes weights throughout layers
dynamically during the training process by inserting the
weights in the layers with the largest gradients. Our main
contributions are:
We propose a global gradient-based weight redistribu-
tion technique that adjusts layer densities dynamically
during training.
We combine existing dynamic sparse training ap-
proaches to achieve the most robust and consistent
method across a variety of configurations.
We present ablation studies on various global weight
redistribution schemes, comparing our proposed ap-
proach to current state-of-the-art metrics.
2. Related Work
Researchers have proposed many ways to reduce overpa-
rameterization of neural networks. Typically, a network is
first trained densely until convergence before starting an
iterative pruning process. Pruning removes weights with the
lowest impact on the loss (e.g., weights with the smallest
magnitude) and the remaining weights are shortly retrained
to compensate for the loss of parameters (Han et al.,2016b;
Molchanov et al.,2017;Zhu & Gupta,2018). Training a
small, dense network with less neurons instead, however,
does not result in the same accuracy as pruning a converged
network.
The lottery ticket hypothesis suggests that when a working
subnetwork is found (e.g., with pruning), the same subnet-
work can achieve the same accuracy when trained from
scratch (Frankle & Carbin,2019). According to the authors
of the lottery ticket hypothesis, the advantage of overpa-
rameterization is that during training, the network can find
working subnetworks among large amounts of possible can-
didates. Training a found subnetwork, the winning ticket, to
achieve the same accuracy, however, is difficult and often
requires initializing the subnetwork with weights extracted
from a dense network trained for thousands of optimizer
steps (Zhou et al.,2019). However, the hypothesis shows
that at least simpler architectures can be trained with a sparse
neural network from scratch. The challenge remains to find
a combination of connections that achieves a high accuracy
- without having to train the network densely first to extract
the winning ticket.
The concept of training a randomly initialized sparse neural
networks from scratch was introduced in 2018 with two
different approaches. Deep Rewiring (DeepR) (Bellec et al.,
2018) proposes to drop weights that would flip the sign dur-
ing the optimizer update step. The same number of weights
is then inserted randomly in the same layer and initialized
with a random sign. Sparse Evolutionary Training (SET)
(Mocanu et al.,2018a), on the other hand, makes use of
a simple and well-established weight removal approach in
pruning research. Following the intuition that weights with
a smaller magnitude contribute less to the output of a neuron,
SET drops a fixed number of weights with the lowest mag-
nitude in predefined intervals and inserts the same amount
randomly. RigL (Evci et al.,2020) improved the criteria
introduced in SET. While they also remove the weights with
the lowest magnitude, they insert the set of weights with the
highest gradients instead of selecting them randomly. This
way, connections that could have the greatest impact on the
result of the current mini-batch are inserted first.
The common limitation of all three approaches is that they
use a predefined density per layer. This does have the advan-
tage that the network can be designed to reach a FLOP target
in inference. However, with deep neural networks, selecting
the right density per layer is difficult and performance is
likely left on the table.
Dynamic sparse reparameterization (DSR) (Mostafa &
2
摘要:

Gradient-basedWeightDensityBalancingforRobustDynamicSparseTrainingMathiasParger1AlexanderErtl1PaulEibensteiner1JoergH.Mueller12MartinWinter12MarkusSteinberger12AbstractTrainingasparseneuralnetworkfromscratchre-quiresoptimizingconnectionsatthesametimeastheweightsthemselves.Typically,theweightsareredi...

展开>> 收起<<
Gradient-based Weight Density Balancing for Robust Dynamic Sparse Training Mathias Parger1Alexander Ertl1Paul Eibensteiner1Joerg H. Mueller1 2Martin Winter1 2 Markus Steinberger1 2.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:321.04KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注