•
The proposal of a new approach to structured dropout named ProbDropBlock, which improved model
performance on both vision and language tasks. ProbDropBlock is adaptive and the blocks dropped are
dependent on the relative per-pixel values. It improves RoBERTa finetuning on MNLI by
0.22%
and ResNet50
on ImageNet by 0.28%.
• Further observation of the benefits of simple linear scheduling observed [10] for both structured and unstruc-
tured Dropout on a range of vision and language models.
2 Related Work
In this section, we briefly review related works in the areas of structured and unstructured Dropouts used as both a
regularization technique to improve model performance and as an approach to pruning to reduce the model size and
computational requirements. We briefly detail unstructured Dropout and the various structured Dropouts devised for
other network architectures.
2.1 Unstructured Dropout
To help address the problem of overfitting in neural networks, Srivastava et al.
[15]
proposed Dropout as a simple
way of limiting the co-adaptation of the activation of units in the network. By randomly deactivating units during
training they sample from an exponential number of different thinned networks and at test time an ensemble of these
thinned networks is approximated by a single full network with smaller weights. Dropout led to improvements in the
performance of neural networks on various tasks and has become widely adopted. This form of Dropout in this work
we refer to as unstructured Dropout as any combination of units in the network may be randomly dropped/deactivated.
In the following subsection, we consider forms of structured Dropout which extend this idea further for other network
architectures and tasks.
2.2 Dropblock and other structured Dropouts
Ghiasi et al.
[10]
proposed DropBlock as a way to perform Structured Dropout for Convolutional Neural Nets (CNNs).
They suggest that unstructured Dropout is less effective for convolutional layers than fully connected layers because
activation units in convolutional layers are spatially correlated so information can still flow through convolutional
networks despite Dropout and so they devised DropBlock which drops units in a contiguous area of the feature map
collectively. This approach was inspired by Devries and Taylor
[16]
’s Cutout, a data augmentation method where
parts of the input examples are zeroed out. DropBlock generalized Cutout by applying Cutout at every feature map
in convolutional networks. Ghiasi et al.
[10]
also found that a scheduling scheme of linearly increasing DropBlock’s
zero-out ratio performed better than a fixed ratio.
Dai et al.
[11]
extended DropBlock to Batch DropBlock. Their network consists of two branches; a global branch and a
feature-dropping branch. In their feature dropping branch they randomly zero out the same contiguous area from each
feature map in a batch involved in computing loss function. They suggest zeroing out the same block in each batch
allows the network to learn a more comprehensive and spatially distributed feature representation.
Larsson et al.
[17]
proposed DropPath in their work on FractalNets. Just as Dropout prevents the co-adaptation of
activations, DropPath prevents the co-adaptation of parallel paths in networks such as FractalNets by randomly dropping
operands of the join layers. DropPath provides at least one such path while sampling a subnetwork with many other
paths disabled. DropPath during training alternates between a global sampling strategy which returns only a single path
and a local sampling strategy in which a join drops each input with fixed probability, but with a guarantee, at least one
survives. This encourages the development of individual columns as performant stand-alone subnetworks.
Cai et al.
[12]
proposed DropConv2d as they suggest the failure of standard dropout is due to conflict between the
stochasticity of unstructured dropout and the following Batch Normalization (BN) step. They propose placing dropout
operations right before the convolutional operation instead of BN or replacing BN with Group Normalization (GN) to
reduce this conflict. Additionally, they devised DropConv2d which draws inspiration from DropPath and DropChannel,
they treat each channel connection as a path between input and output channels and perform dropout on replicates of
each of these paths.
DropBlock, BatchDropBlock, DropPath and DropConv2d are forms of structured Dropout designed with specific
architecture in mind. However, as seen by Cai et al.
[12]
DropConv2d an approach to structured Dropout designed for
a given network can still be useful to novel network architecture. Aside from being used to improve generalization,
structured Dropout has also been used as an approach to pruning and reducing computational resource requirements at
inference time.
3