Stimulative Training of Residual Networks A Social Psychology Perspective of Loafing

2025-04-15
0
0
1.04MB
20 页
10玖币
侵权投诉
Stimulative Training of Residual Networks: A Social
Psychology Perspective of Loafing
Peng Ye1†, Shengji Tang1†, Baopu Li2, Tao Chen1∗
, Wanli Ouyang3
1School of Information Science and Technology, Fudan University, 2Oracle Health and AI, USA,
3The University of Sydney, SenseTime Computer Vision Group, Australia, and Shanghai AI Lab
yepeng20@fudan.edu.cn
Abstract
Residual networks have shown great success and become indispensable in today’s
deep models. In this work, we aim to re-investigate the training process of residual
networks from a novel social psychology perspective of loafing, and further pro-
pose a new training strategy to strengthen the performance of residual networks.
As residual networks can be viewed as ensembles of relatively shallow networks
(i.e., unraveled view) in prior works, we also start from such view and consider
that the final performance of a residual network is co-determined by a group of
sub-networks. Inspired by the social loafing problem of social psychology, we
find that residual networks invariably suffer from similar problem, where sub-
networks in a residual network are prone to exert less effort when working as part
of the group compared to working alone. We define this previously overlooked
problem as network loafing. As social loafing will ultimately cause the low in-
dividual productivity and the reduced overall performance, network loafing will
also hinder the performance of a given residual network and its sub-networks.
Referring to the solutions of social psychology, we propose stimulative training,
which randomly samples a residual sub-network and calculates the KL-divergence
loss between the sampled sub-network and the given residual network, to act as
extra supervision for sub-networks and make the overall goal consistent. Compre-
hensive empirical results and theoretical analyses verify that stimulative training
can well handle the loafing problem, and improve the performance of a residual
network by improving the performance of its sub-networks. The code is available
at https://github.com/Sunshine-Ye/NIPS22-ST.
1 Introduction
Since ResNet [
1
] wins the first place at the ILSVRC-2015 competition, simple-but-effective residual
connections are applied in various deep networks, such as CNN, MLP, and transformer. To explore
the secrets behind the success of residual networks, numerous studies have been proposed. He et.
al [
1
] exploit the residual structure to avoid the performance degradation of deep networks. Further,
He et. al [
2
] consider that such a structure can transfer any low level features to high level layers in
forward propagation and directly transmit the gradients from deep to shallow layers in backward
propagation. Balduzzi et. al [
3
] find that residual networks can alleviate the shattered gradients
problem that gradients resemble white noise. In addition, Veit et. al [
4
] experimentally verify that
residual networks can be seen as a collection of numerous networks of different lengths, namely
unraveled view. Following this view, Sun et. al [
5
] further attribute the success of residual networks
to shallow sub-networks, which may correspond to the low-degree term when regarding the neural
network as a polynomial function. Since the unraveled view is supported both experimentally and
∗Corresponding Author (eetchen@fudan.edu.cn). †Equal Contribution.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.04153v1 [cs.CV] 9 Oct 2022
Figure 1: Different residual networks invariably suffer from the problem of network loafing, and
deeper residual network tends to have more serious loafing problem. All these networks are trained
on CIFAR10 dataset. The horizontal axis means the sampled different sub-networks from ResNetx.
theoretically, we further investigate some interesting mechanism behind residual networks based on
this inspiring work. Specifically, we treat a residual network as the ensemble of relatively shallow
sub-networks and consider its final performance to be co-determined by a group of sub-networks.
In social psychology, working in a group is always a tricky thing. Compared with performing tasks
alone, group members tend to make less efforts when working as part of a group, which is defined as
the social loafing problem [
6
;
7
]. Moreover, social psychology researches find that increasing the
group size may aggravate social loafing for the decrease of individual visibility [
8
;
9
]. Inspired by
these, we find that the ensemble-like networks formed by residual connections also have a similar
problem and behavior. As shown in Fig. 1, we can see that, different residual networks invariably
suffer from the loafing problem, that is, the sub-networks working in a given residual network are
prone to exert degraded performances than these sub-networks working individually. For example, a
sub-network ResNet32 in the ResNet56 only has a 46.80% Top1 accuracy, much lower than ResNet32
trained individually with accuracy of 92.63%. Moreover, the loafing problem of deeper residual
networks is inclined to be more severe than that of shallower ones, that is, the same sub-network in
deeper residual networks constantly presents inferior performance than that in shallower residual
networks. For example, ResNet20 within ResNet32 has a 55.28% Top1 accuracy, while ResNet20
within ResNet56 (deeper than ResNet 32) only has a 23.17% Top1 accuracy. Such problems have not
been addressed in the literature so far as we know. Hereafter, we define this previously overlooked
problem as network loafing.
2
As social psychology researches show that social loafing will ultimately
cause low productivity of each individual and the collective [
10
;
11
], we consider that network
loafing may also hinder the performance of given residual network and all of its sub-networks.
In social psychology, there are two commonly used solutions for preventing social loafing within
groups: 1) establishing individual accountability by increasing the individual supervision and 2)
making tasks cooperative by setting up the overall goal [
8
;
9
]. Inspired by this, we propose a
novel training strategy for improving residual networks, namely stimulative training. In details, for
each mini-batch during stimulative training, besides the main loss of the given residual network in
conventional training, we will randomly sample a residual sub-network (individual supervision) and
calculate the KL-divergence loss between the sampled sub-network and the given residual network
(consistent overall goal). This simple yet effective training strategy can relieve the loafing problem of
residual networks, by strengthening the individual supervision of sub-networks and making the goals
of residual sub-networks and the given residual network more consistent.
Comprehensive empirical analyses verify that stimulative training can solve the loafing problem of
residual networks effectively and efficiently, thus improve both the performance of a given residual
network and all of its residual sub-networks by a large margin. Furthermore, we theoretically show
the connection of the proposed stimulative training strategy and the improved performance of a given
residual network and all of its residual sub-networks. Besides, experiments on various benchmark
datasets using various residual networks demonstrate the effectiveness of the proposed training
strategy. The contributions of our work can be summarized as the following:
2
Network loafing is just a loose analogy to describe a behavior in neural networks that has no strong
connection with biology.
2
•
We understand residual networks from a social psychology perspective, and find that different
residual networks invariably suffer from the problem of network loafing.
•
We improve residual networks from a social psychology perspective, and propose a simple-
but-effective stimulative training strategy to improve the performance of the given residual
network and all of its sub-networks.
•
Comprehensive empirical and theoretical analysis verify that stimulative training can well
solve the loafing problem of residual networks.
2 Related Works
2.1 Unraveled View
As one pioneer work to investigate residual networks, [
4
] experimentally shows that residual networks
can be seen as a collection of numerous networks of different length, namely unraveled view.
Subsequently, [
5
] follows this view and further attributes the success of residual networks to shallow
sub-networks first. Besides, [
5
] considers the neural network as a polynomial function and corresponds
shallow sub-network to low-degree term to explain the working mechanism of residual networks.
Similar to [
4
] and [
5
], this paper also investigates residual networks from the unraveled view.
Differently, inspired by social psychology, we further reveal the loafing problem of residual networks
under the unraveled view. Besides, we propose a novel stimulative training method to relieve this
problem and further unleash the potential of residual networks.
2.2 Knowledge Distillation
As a classical method, knowledge distillation [
12
;
13
] transfers the knowledge from a teacher network
to a student network via approximating the logits [
12
;
14
;
15
] or features [
16
;
17
;
18
;
19
] output. To
avoid the huge cost of training a high performance teacher, some works abandon the naive teacher-
student framework, like mutual distillation [
20
] making group of students learn from each other
online, and self distillation [
21
] transferring knowledge from deep layers to shallow layers. Generally,
all these distillation methods need to introduce additional networks or structures, and employ fixed
teacher-student pairs. As a comparison, our method does not require any additional network or
structure, and the student network is a randomly sampled sub-network of a network. Besides, our
method is essentially designed to address the loafing problem of residual networks, which is different
from knowledge distillation that aims to obtain a compact network with acceptable accuracy.
2.3 One-shot NAS
One-shot NAS is an important branch of neural architecture search (NAS) [
22
;
23
;
24
;
25
;
26
;
27
].
Along this direction, [
28
] trains an once-for-all (OFA) network with progressive shrinking and
knowledge distillation to support kinds of architectural settings. Following this work, BigNAS [
29
]
introduces several technologies to train a high-quality single-stage model, whose child models can be
directly deployed without extra retraining or post-processing steps. Both OFA and BigNAS aim at
simultaneously training and searching various networks with different resolutions, depths, widths and
operations. Differently, the proposed method aims at improving a given residual network, thus can be
seamlessly applied to the searched model of NAS. As OFA and BigNAS are not designed to solve
the loafing problem, their sampling space and supervision signal are also different with the proposed
method. More importantly, the social-psychology-inspired problem of network loafing may explain
why OFA and BigNAS work.
2.4 Stochastic Depth
As a regularization technique, Stochastic Depth [
30
] randomly disables the convolution layers of
residual blocks, to reduce training time and test error substantially. In Stochastic Depth, the reduction
in test error is attributed to strengthening gradients of earlier layers and the implicit ensemble of
numerous sub-networks of different depths. In fact, its improved performance can be also interpreted
as relieving the network loafing problem defined in this work. The theoretical analysis in this work
can be also applied to Stochastic Depth. Besides, for better solving the loafing problem, our method
samples sub-networks with ordered depth, and uses an additional KL-divergence loss to provide a
more achievable target and make the output of a given network and its sub-networks more consistent.
3
(a) Common training scheme suffers from severe network loafing problem
(b) Stimulative training scheme can relieve network loafing and improve the performance
2L
f−
1L
f−
L
f
2L
f−
2L
f−
1L
f−
2L
f−
2L
f−
1L
f−
L
f
2L
f−
2L
f−
1L
f−
2L
f−
77.3%
2L
f−
1L
f−
L
f
Skip
connection
Residual
module
Building block
Image
Label
Main
loss
2L
f−
1L
f−
L
f
Skip
connection
Residual
module
Building block
Image
Label
Main
loss
70%
(75%)
60%
(65%)
2L
f−
1L
f−
L
f
2L
f−
2L
f−
1L
f−
2L
f−
77.3%
2L
f−
1L
f−
L
f
Skip
connection
Residual
module
Building block
Image
Label
Main
loss
70%
(75%)
60%
(65%)
2L
f−
1L
f−
L
f
2L
f−
2L
f−
1L
f−
2L
f−
2L
f−
1L
f−
L
f
2L
f−
2L
f−
1L
f−
2L
f−
81.07%
2L
f−
1L
f−
L
f
2L
f−
2L
f−
1L
f−
2L
f−
Label
Main
loss
Image
KL
loss
KL
loss
2L
f−
1L
f−
L
f
2L
f−
2L
f−
1L
f−
2L
f−
Label
Main
loss
Image
KL
loss
KL
loss
80%
(75%)
79%
(65%)
2L
f−
1L
f−
L
f
2L
f−
2L
f−
1L
f−
2L
f−
81.07%
2L
f−
1L
f−
L
f
2L
f−
2L
f−
1L
f−
2L
f−
Label
Main
loss
Image
KL
loss
KL
loss
80%
(75%)
79%
(65%)
(a) Common training scheme suffers from severe network loafing problem
(b) Stimulative training scheme can relieve network loafing and improve the performance
2L
f−
1L
f−
L
f
2L
f−
2L
f−
1L
f−
2L
f−
77.3%
2L
f−
1L
f−
L
f
Skip
connection
Residual
module
Building block
Image
Label
Main
loss
70%
(75%)
60%
(65%)
2L
f−
1L
f−
L
f
2L
f−
2L
f−
1L
f−
2L
f−
81.07%
2L
f−
1L
f−
L
f
2L
f−
2L
f−
1L
f−
2L
f−
Label
Main
loss
Image
KL
loss
KL
loss
80%
(75%)
79%
(65%)
Figure 2: Illustration of common and stimulative training schemes. Stimulative training can relieve
the network loafing problem, and improve the performance of a given residual network (from 77.3%
to 81.07%) and all of its sub-networks (e.g., from 60% to 79%). 65% and 75% denote the individual
performance of each sub-network.
3 Stimulative Training
3.1 Motivation
Social loafing is a social psychology phenomenon that individuals lower their productivity when
working in a group. Based on the novel perspective that a residual network behaves like an ensemble
network [
4
], we find various residual networks invariably exhibit loafing-like behaviors as shown
in Fig. 1 and Fig. 4, which we define as network loafing. As social loafing is “a kind of social
disease" that will harm the individual and collective productivity [
10
], we consider network loafing
may also hinder the performance of a residual network and its sub-networks. To alleviate network
loafing, it is intuitive to learn from social psychology. There are two common methods for solving
social-loafing problem in sociology, namely, establishing individual accountability (i.e., increasing
the individual supervision) and making tasks cooperative (i.e., setting up the overall goal) [
8
;
9
]. In
order to increase individual supervision, we sample sub-networks in the whole network and provide
extra supervision to train each sub-network sufficiently. For the overall goal, we adopt KL divergence
loss to constrain the output of sub-networks not far from that of the whole network, which aims at
reducing the performance gap and driving the sampled sub-networks to develop cooperatively.
3.2 Training Algorithm
In this subsection, we briefly illustrate the working scheme of the proposed stimulative training
strategy, and show its difference from the common training method. As shown in Fig. 2, common
training only focuses on optimization of main network, thus suffers from severe network loafing, that
is, sub-networks lower their performance when working in an ensemble. For example, as shown in
Fig. 2(a) sub-networks within the residual network only have an accuracy of 60% and 70%, much
lower than their individual accuracy of 65% and 75%. As a comparison, stimulative training optimizes
the main network and meanwhile uses it to provide extra supervision for a sampled sub-network at
each training iteration, thus well handles the loafing problem. In the test procedure, our method can
adopt the main network or any sub-network as the inference model, thus requiring the same or lower
memory and inference cost compared with a given residual network.
Formally, for a given residual network to be optimized, we define the main network as
Dm
and the
sub-network as
Ds
. All the sub-networks share weights with the main network, and make up the
sampling space
Θ = {Ds|Ds=π(Dm)}
, where
π
is sampling operator, usually random sampling.
In the training process, we randomly sample a sub-network at each iteration to ensure the whole
sampling space can be fully explored. To make the training more efficient and effective, we define a
new sampling space obeying ordered residual sampling rule to be discussed in Section 3.3. Denoting
θD m
and
θD s
as weights of the main network and the sampled network respectively,
x
as the mini-
batch training sample and
y
as its label,
Z
as the output of network, the total loss of stimulative
4
摘要:
展开>>
收起<<
StimulativeTrainingofResidualNetworks:ASocialPsychologyPerspectiveofLoangPengYe1y,ShengjiTang1y,BaopuLi2,TaoChen1,WanliOuyang31SchoolofInformationScienceandTechnology,FudanUniversity,2OracleHealthandAI,USA,3TheUniversityofSydney,SenseTimeComputerVisionGroup,Australia,andShanghaiAILabyepeng20@fudan...
声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
相关推荐
-
VIP免费2024-11-14 22
-
VIP免费2024-11-23 3
-
VIP免费2024-11-23 4
-
VIP免费2024-11-23 3
-
VIP免费2024-11-23 4
-
VIP免费2024-11-23 28
-
VIP免费2024-11-23 11
-
VIP免费2024-11-23 21
-
VIP免费2024-11-23 12
-
VIP免费2024-11-23 5
分类:学术论文
价格:10玖币
属性:20 页
大小:1.04MB
格式:PDF
时间:2025-04-15
作者详情
-
Voltage-Controlled High-Bandwidth Terahertz Oscillators Based On Antiferromagnets Mike A. Lund1Davi R. Rodrigues2Karin Everschor-Sitte3and Kjetil M. D. Hals1 1Department of Engineering Sciences University of Agder 4879 Grimstad Norway10 玖币0人下载
-
Voltage-controlled topological interface states for bending waves in soft dielectric phononic crystal plates10 玖币0人下载