Stimulative Training of Residual Networks A Social Psychology Perspective of Loafing

2025-04-15 1 0 1.04MB 20 页 10玖币

侵权投诉

Stimulative Training of Residual Networks: A Social

Psychology Perspective of Loaﬁng

Peng Ye1†, Shengji Tang1†, Baopu Li2, Tao Chen1∗

, Wanli Ouyang3

1School of Information Science and Technology, Fudan University, 2Oracle Health and AI, USA,

3The University of Sydney, SenseTime Computer Vision Group, Australia, and Shanghai AI Lab

yepeng20@fudan.edu.cn

Abstract

Residual networks have shown great success and become indispensable in today’s

deep models. In this work, we aim to re-investigate the training process of residual

networks from a novel social psychology perspective of loaﬁng, and further pro-

pose a new training strategy to strengthen the performance of residual networks.

As residual networks can be viewed as ensembles of relatively shallow networks

(i.e., unraveled view) in prior works, we also start from such view and consider

that the ﬁnal performance of a residual network is co-determined by a group of

sub-networks. Inspired by the social loaﬁng problem of social psychology, we

ﬁnd that residual networks invariably suffer from similar problem, where sub-

networks in a residual network are prone to exert less effort when working as part

of the group compared to working alone. We deﬁne this previously overlooked

problem as network loaﬁng. As social loaﬁng will ultimately cause the low in-

dividual productivity and the reduced overall performance, network loaﬁng will

also hinder the performance of a given residual network and its sub-networks.

Referring to the solutions of social psychology, we propose stimulative training,

which randomly samples a residual sub-network and calculates the KL-divergence

loss between the sampled sub-network and the given residual network, to act as

extra supervision for sub-networks and make the overall goal consistent. Compre-

hensive empirical results and theoretical analyses verify that stimulative training

can well handle the loaﬁng problem, and improve the performance of a residual

network by improving the performance of its sub-networks. The code is available

at https://github.com/Sunshine-Ye/NIPS22-ST.

1 Introduction

Since ResNet [

] wins the ﬁrst place at the ILSVRC-2015 competition, simple-but-effective residual

connections are applied in various deep networks, such as CNN, MLP, and transformer. To explore

the secrets behind the success of residual networks, numerous studies have been proposed. He et.

al [

] exploit the residual structure to avoid the performance degradation of deep networks. Further,

He et. al [

] consider that such a structure can transfer any low level features to high level layers in

forward propagation and directly transmit the gradients from deep to shallow layers in backward

propagation. Balduzzi et. al [

] ﬁnd that residual networks can alleviate the shattered gradients

problem that gradients resemble white noise. In addition, Veit et. al [

] experimentally verify that

residual networks can be seen as a collection of numerous networks of different lengths, namely

unraveled view. Following this view, Sun et. al [

] further attribute the success of residual networks

to shallow sub-networks, which may correspond to the low-degree term when regarding the neural

network as a polynomial function. Since the unraveled view is supported both experimentally and

∗Corresponding Author (eetchen@fudan.edu.cn). †Equal Contribution.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.04153v1 [cs.CV] 9 Oct 2022

Figure 1: Different residual networks invariably suffer from the problem of network loaﬁng, and

deeper residual network tends to have more serious loaﬁng problem. All these networks are trained

on CIFAR10 dataset. The horizontal axis means the sampled different sub-networks from ResNetx.

theoretically, we further investigate some interesting mechanism behind residual networks based on

this inspiring work. Speciﬁcally, we treat a residual network as the ensemble of relatively shallow

sub-networks and consider its ﬁnal performance to be co-determined by a group of sub-networks.

In social psychology, working in a group is always a tricky thing. Compared with performing tasks

alone, group members tend to make less efforts when working as part of a group, which is deﬁned as

the social loaﬁng problem [

;

]. Moreover, social psychology researches ﬁnd that increasing the

group size may aggravate social loaﬁng for the decrease of individual visibility [

;

]. Inspired by

these, we ﬁnd that the ensemble-like networks formed by residual connections also have a similar

problem and behavior. As shown in Fig. 1, we can see that, different residual networks invariably

suffer from the loaﬁng problem, that is, the sub-networks working in a given residual network are

prone to exert degraded performances than these sub-networks working individually. For example, a

sub-network ResNet32 in the ResNet56 only has a 46.80% Top1 accuracy, much lower than ResNet32

trained individually with accuracy of 92.63%. Moreover, the loaﬁng problem of deeper residual

networks is inclined to be more severe than that of shallower ones, that is, the same sub-network in

deeper residual networks constantly presents inferior performance than that in shallower residual

networks. For example, ResNet20 within ResNet32 has a 55.28% Top1 accuracy, while ResNet20

within ResNet56 (deeper than ResNet 32) only has a 23.17% Top1 accuracy. Such problems have not

been addressed in the literature so far as we know. Hereafter, we deﬁne this previously overlooked

problem as network loaﬁng.

As social psychology researches show that social loaﬁng will ultimately

cause low productivity of each individual and the collective [

;

], we consider that network

loaﬁng may also hinder the performance of given residual network and all of its sub-networks.

In social psychology, there are two commonly used solutions for preventing social loaﬁng within

groups: 1) establishing individual accountability by increasing the individual supervision and 2)

making tasks cooperative by setting up the overall goal [

;

]. Inspired by this, we propose a

novel training strategy for improving residual networks, namely stimulative training. In details, for

each mini-batch during stimulative training, besides the main loss of the given residual network in

conventional training, we will randomly sample a residual sub-network (individual supervision) and

calculate the KL-divergence loss between the sampled sub-network and the given residual network

(consistent overall goal). This simple yet effective training strategy can relieve the loaﬁng problem of

residual networks, by strengthening the individual supervision of sub-networks and making the goals

of residual sub-networks and the given residual network more consistent.

Comprehensive empirical analyses verify that stimulative training can solve the loaﬁng problem of

residual networks effectively and efﬁciently, thus improve both the performance of a given residual

network and all of its residual sub-networks by a large margin. Furthermore, we theoretically show

the connection of the proposed stimulative training strategy and the improved performance of a given

residual network and all of its residual sub-networks. Besides, experiments on various benchmark

datasets using various residual networks demonstrate the effectiveness of the proposed training

strategy. The contributions of our work can be summarized as the following:

Network loaﬁng is just a loose analogy to describe a behavior in neural networks that has no strong

connection with biology.

•

We understand residual networks from a social psychology perspective, and ﬁnd that different

residual networks invariably suffer from the problem of network loaﬁng.

•

We improve residual networks from a social psychology perspective, and propose a simple-

but-effective stimulative training strategy to improve the performance of the given residual

network and all of its sub-networks.

•

Comprehensive empirical and theoretical analysis verify that stimulative training can well

solve the loaﬁng problem of residual networks.

2 Related Works

2.1 Unraveled View

As one pioneer work to investigate residual networks, [

] experimentally shows that residual networks

can be seen as a collection of numerous networks of different length, namely unraveled view.

Subsequently, [

] follows this view and further attributes the success of residual networks to shallow

sub-networks ﬁrst. Besides, [

] considers the neural network as a polynomial function and corresponds

shallow sub-network to low-degree term to explain the working mechanism of residual networks.

Similar to [

] and [

], this paper also investigates residual networks from the unraveled view.

Differently, inspired by social psychology, we further reveal the loaﬁng problem of residual networks

under the unraveled view. Besides, we propose a novel stimulative training method to relieve this

problem and further unleash the potential of residual networks.

2.2 Knowledge Distillation

As a classical method, knowledge distillation [

;

] transfers the knowledge from a teacher network

to a student network via approximating the logits [

;

] or features [

;

] output. To

avoid the huge cost of training a high performance teacher, some works abandon the naive teacher-

student framework, like mutual distillation [

] making group of students learn from each other

online, and self distillation [

] transferring knowledge from deep layers to shallow layers. Generally,

all these distillation methods need to introduce additional networks or structures, and employ ﬁxed

teacher-student pairs. As a comparison, our method does not require any additional network or

structure, and the student network is a randomly sampled sub-network of a network. Besides, our

method is essentially designed to address the loaﬁng problem of residual networks, which is different

from knowledge distillation that aims to obtain a compact network with acceptable accuracy.

2.3 One-shot NAS

One-shot NAS is an important branch of neural architecture search (NAS) [

;

Along this direction, [

] trains an once-for-all (OFA) network with progressive shrinking and

knowledge distillation to support kinds of architectural settings. Following this work, BigNAS [

]

introduces several technologies to train a high-quality single-stage model, whose child models can be

directly deployed without extra retraining or post-processing steps. Both OFA and BigNAS aim at

simultaneously training and searching various networks with different resolutions, depths, widths and

operations. Differently, the proposed method aims at improving a given residual network, thus can be

seamlessly applied to the searched model of NAS. As OFA and BigNAS are not designed to solve

the loaﬁng problem, their sampling space and supervision signal are also different with the proposed

method. More importantly, the social-psychology-inspired problem of network loaﬁng may explain

why OFA and BigNAS work.

2.4 Stochastic Depth

As a regularization technique, Stochastic Depth [

] randomly disables the convolution layers of

residual blocks, to reduce training time and test error substantially. In Stochastic Depth, the reduction

in test error is attributed to strengthening gradients of earlier layers and the implicit ensemble of

numerous sub-networks of different depths. In fact, its improved performance can be also interpreted

as relieving the network loaﬁng problem deﬁned in this work. The theoretical analysis in this work

can be also applied to Stochastic Depth. Besides, for better solving the loaﬁng problem, our method

samples sub-networks with ordered depth, and uses an additional KL-divergence loss to provide a

more achievable target and make the output of a given network and its sub-networks more consistent.

(a) Common training scheme suffers from severe network loafing problem

(b) Stimulative training scheme can relieve network loafing and improve the performance

f−

77.3%

f−

Skip

connection

Residual

module

Building block

Image

Label

Main

loss

f−

Skip

connection

Residual

module

Building block

Image

Label

Main

loss

70%

(75%)

60%

(65%)

f−

77.3%

f−

Skip

connection

Residual

module

Building block

Image

Label

Main

loss

70%

(75%)

60%

(65%)

f−

81.07%

f−

Label

Main

loss

Image

loss

f−

Label

Main

loss

Image

loss

80%

(75%)

79%

(65%)

f−

81.07%

f−

Label

Main

loss

Image

loss

80%

(75%)

79%

(65%)

(a) Common training scheme suffers from severe network loafing problem

(b) Stimulative training scheme can relieve network loafing and improve the performance

f−

77.3%

f−

Skip

connection

Residual

module

Building block

Image

Label

Main

loss

70%

(75%)

60%

(65%)

f−

81.07%

f−

Label

Main

loss

Image

loss

80%

(75%)

79%

(65%)

Figure 2: Illustration of common and stimulative training schemes. Stimulative training can relieve

the network loaﬁng problem, and improve the performance of a given residual network (from 77.3%

to 81.07%) and all of its sub-networks (e.g., from 60% to 79%). 65% and 75% denote the individual

performance of each sub-network.

3 Stimulative Training

3.1 Motivation

Social loaﬁng is a social psychology phenomenon that individuals lower their productivity when

working in a group. Based on the novel perspective that a residual network behaves like an ensemble

network [

], we ﬁnd various residual networks invariably exhibit loaﬁng-like behaviors as shown

in Fig. 1 and Fig. 4, which we deﬁne as network loaﬁng. As social loaﬁng is “a kind of social

disease" that will harm the individual and collective productivity [

], we consider network loaﬁng

may also hinder the performance of a residual network and its sub-networks. To alleviate network

loaﬁng, it is intuitive to learn from social psychology. There are two common methods for solving

social-loaﬁng problem in sociology, namely, establishing individual accountability (i.e., increasing

the individual supervision) and making tasks cooperative (i.e., setting up the overall goal) [

;

]. In

order to increase individual supervision, we sample sub-networks in the whole network and provide

extra supervision to train each sub-network sufﬁciently. For the overall goal, we adopt KL divergence

loss to constrain the output of sub-networks not far from that of the whole network, which aims at

reducing the performance gap and driving the sampled sub-networks to develop cooperatively.

3.2 Training Algorithm

In this subsection, we brieﬂy illustrate the working scheme of the proposed stimulative training

strategy, and show its difference from the common training method. As shown in Fig. 2, common

training only focuses on optimization of main network, thus suffers from severe network loaﬁng, that

is, sub-networks lower their performance when working in an ensemble. For example, as shown in

Fig. 2(a) sub-networks within the residual network only have an accuracy of 60% and 70%, much

lower than their individual accuracy of 65% and 75%. As a comparison, stimulative training optimizes

the main network and meanwhile uses it to provide extra supervision for a sampled sub-network at

each training iteration, thus well handles the loaﬁng problem. In the test procedure, our method can

adopt the main network or any sub-network as the inference model, thus requiring the same or lower

memory and inference cost compared with a given residual network.

Formally, for a given residual network to be optimized, we deﬁne the main network as

and the

sub-network as

. All the sub-networks share weights with the main network, and make up the

sampling space

Θ = {Ds|Ds=π(Dm)}

, where

is sampling operator, usually random sampling.

In the training process, we randomly sample a sub-network at each iteration to ensure the whole

sampling space can be fully explored. To make the training more efﬁcient and effective, we deﬁne a

new sampling space obeying ordered residual sampling rule to be discussed in Section 3.3. Denoting

θD m

and

θD s

as weights of the main network and the sampled network respectively,

as the mini-

batch training sample and

as its label,

as the output of network, the total loss of stimulative

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

StimulativeTrainingofResidualNetworks:ASocialPsychologyPerspectiveofLoangPengYe1y,ShengjiTang1y,BaopuLi2,TaoChen1,WanliOuyang31SchoolofInformationScienceandTechnology,FudanUniversity,2OracleHealthandAI,USA,3TheUniversityofSydney,SenseTimeComputerVisionGroup,Australia,andShanghaiAILabyepeng20@fudan...

展开>> 收起<<

Stimulative Training of Residual Networks A Social Psychology Perspective of Loafing.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Stimulative Training of Residual Networks A Social Psychology Perspective of Loafing

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: