•
We understand residual networks from a social psychology perspective, and find that different
residual networks invariably suffer from the problem of network loafing.
•
We improve residual networks from a social psychology perspective, and propose a simple-
but-effective stimulative training strategy to improve the performance of the given residual
network and all of its sub-networks.
•
Comprehensive empirical and theoretical analysis verify that stimulative training can well
solve the loafing problem of residual networks.
2 Related Works
2.1 Unraveled View
As one pioneer work to investigate residual networks, [
4
] experimentally shows that residual networks
can be seen as a collection of numerous networks of different length, namely unraveled view.
Subsequently, [
5
] follows this view and further attributes the success of residual networks to shallow
sub-networks first. Besides, [
5
] considers the neural network as a polynomial function and corresponds
shallow sub-network to low-degree term to explain the working mechanism of residual networks.
Similar to [
4
] and [
5
], this paper also investigates residual networks from the unraveled view.
Differently, inspired by social psychology, we further reveal the loafing problem of residual networks
under the unraveled view. Besides, we propose a novel stimulative training method to relieve this
problem and further unleash the potential of residual networks.
2.2 Knowledge Distillation
As a classical method, knowledge distillation [
12
;
13
] transfers the knowledge from a teacher network
to a student network via approximating the logits [
12
;
14
;
15
] or features [
16
;
17
;
18
;
19
] output. To
avoid the huge cost of training a high performance teacher, some works abandon the naive teacher-
student framework, like mutual distillation [
20
] making group of students learn from each other
online, and self distillation [
21
] transferring knowledge from deep layers to shallow layers. Generally,
all these distillation methods need to introduce additional networks or structures, and employ fixed
teacher-student pairs. As a comparison, our method does not require any additional network or
structure, and the student network is a randomly sampled sub-network of a network. Besides, our
method is essentially designed to address the loafing problem of residual networks, which is different
from knowledge distillation that aims to obtain a compact network with acceptable accuracy.
2.3 One-shot NAS
One-shot NAS is an important branch of neural architecture search (NAS) [
22
;
23
;
24
;
25
;
26
;
27
].
Along this direction, [
28
] trains an once-for-all (OFA) network with progressive shrinking and
knowledge distillation to support kinds of architectural settings. Following this work, BigNAS [
29
]
introduces several technologies to train a high-quality single-stage model, whose child models can be
directly deployed without extra retraining or post-processing steps. Both OFA and BigNAS aim at
simultaneously training and searching various networks with different resolutions, depths, widths and
operations. Differently, the proposed method aims at improving a given residual network, thus can be
seamlessly applied to the searched model of NAS. As OFA and BigNAS are not designed to solve
the loafing problem, their sampling space and supervision signal are also different with the proposed
method. More importantly, the social-psychology-inspired problem of network loafing may explain
why OFA and BigNAS work.
2.4 Stochastic Depth
As a regularization technique, Stochastic Depth [
30
] randomly disables the convolution layers of
residual blocks, to reduce training time and test error substantially. In Stochastic Depth, the reduction
in test error is attributed to strengthening gradients of earlier layers and the implicit ensemble of
numerous sub-networks of different depths. In fact, its improved performance can be also interpreted
as relieving the network loafing problem defined in this work. The theoretical analysis in this work
can be also applied to Stochastic Depth. Besides, for better solving the loafing problem, our method
samples sub-networks with ordered depth, and uses an additional KL-divergence loss to provide a
more achievable target and make the output of a given network and its sub-networks more consistent.
3