REVISITING STRUCTURED DROPOUT Yiren Zhao Imperial College London

2025-04-29 0 0 381.88KB 14 页 10玖币
侵权投诉
REVISITING STRUCTURED DROPOUT
Yiren Zhao
Imperial College London
and University of Cambridge
a.zhao@imperial.ac.uk
Oluwatomisin Dada
University of Cambridge
oluwatomisin.dada@cl.cam.ac.uk
Xitong Gao
SIAT
xt.gao@siat.ac.cn
Robert D Mullins
University of Cambridge
robert.mullins@cl.cam.ac.uk
ABSTRACT
Large neural networks are often overparameterised and prone to overfitting, Dropout is a widely
used regularization technique to combat overfitting and improve model generalization. However,
unstructured Dropout is not always effective for specific network architectures and this has led to the
formation of multiple structured Dropout approaches to improve model performance and, sometimes,
reduce the computational resources required for inference. In this work, we revisit structured Dropout
comparing different Dropout approaches to natural language processing and computer vision tasks
for multiple state-of-the-art networks. Additionally, we devise an approach to structured Dropout we
call
ProbDropBlock
which drops contiguous blocks from feature maps with a probability given by
the normalized feature salience values. We find that with a simple scheduling strategy the proposed
approach to structured Dropout consistently improved model performance compared to baselines
and other Dropout approaches on a diverse range of tasks and models. In particular, we show
ProbDropBlock
improves RoBERTa finetuning on MNLI by
0.22%
, and training of ResNet50 on
ImageNet by 0.28%.
1 Introduction
In our modern society, Deep Neural Networks have become increasingly ubiquitous, having achieved significant success
in many tasks including visual recognition and natural language processing [
1
,
2
,
3
]. These networks now play a larger
role in our lives and our devices, however, despite their successes they still have notable weaknesses. Deep Neural
Networks are often found to be highly overparameterized, and as a result, require excessive memory and significant
computational resources. Additionally, due to overparameterization, these networks are prone to overfit their training
data.
There are several approaches to mitigate overfitting including reducing model size or complexity, early stopping [
4
],
data augmentation [
5
] and regularisation [
6
]. In this paper, we focus on Dropout which is a widely used form of
regularisation proposed by Srivastava et al.
[7]
. Standard Unstructured Dropout involves randomly deactivating a subset
of neurons in the network for each training iteration and training this subnetwork, at inference time the full model could
then be treated as an approximation of an ensemble of these subnetworks.
Unstructured Dropout was efficient and effective and this led to it being widely adopted, however, when applied to
Convolutional Neural Networks (CNNs), unstructured Dropout struggled to achieve notable improvements [
8
,
9
] and this
led to the development of several structured Dropout approaches [
10
,
11
,
12
] including DropBlock and DropChannel.
DropBlock considers the spatial correlations between nearby entries in a feature map of a CNN and attempts to stop
that information flow by deactivating larger contiguous areas/blocks, while DropChannel considers the correlation of
information within a particular channel and performs Dropout at the channel level. However, since the development
of these structured approaches, there have been further strides in network architecture design, with rising spread and
interest in Transformer-based models.
arXiv:2210.02570v1 [cs.LG] 5 Oct 2022
(a) Original image and its RGB channels
(b) Dropout (c) BatchDropBlock
(d) DropBlock (e) Adaptive DropBlock
Figure 1: An illustration of applying different Dropouts to an image.
Given the success achieved by block-wise structured Dropout on CNNs, it is only natural to ask the question, do these
approaches apply to Transformer-based models? Structured Dropout approaches for transformers seem to focus on
reducing the model size and inference time, these works place more emphasis on pruning or reducing computational
resources [13, 14] than combating overfitting which is the focus of this paper.
In this paper, we revisit the idea of structured Dropout for current state-of-the-art models on language and vision tasks.
Additionally, we devised our own form of adaptive structured Dropout -
ProbDropBlock
and compare it to preexisting
approaches to structured and unstructured Dropout.
In Figure 1 we illustrate the effects of select structured and unstructured Dropout approaches on an image of a cat. As
can be seen in Figure 1a the original image consists of three channels (RGB) which are aggregated to form the image.
Different approaches to Dropout may treat channels differently. In Figure 1b we illustrate the effect of unstructured
Dropout on this image, the many small black squares represent deactivated/dropped weights at a pixel level and we also
see different pixels have been deactivated in each channel. In Figure 1c we see fewer but larger black squares, and that
the locations of dropped pixels are consistent between channels, however, this is not the case in Figure 1e and Figure 1d.
In this work, we say that BatchDropBlock is channel consistent i.e. channels do not deactivate blocks independently
rather the deactivated blocks are consistent between channels.
In Figure 1b, Figure 1c, Figure 1d, for a single channel there is a uniform probability of any pixel or block (depending
on the approach) to be dropped and so deactivated pixels may not contain any of the key information required to identify
this image as a cat (i.e. the probability of deactivating a pixel/block belonging to the cat is the same as that of one
belonging to the background). This is not the case for Figure 1e, in our adaptive DropBlock approach the probability of
a block being dropped is dependent on the value of the center pixel in the block. It can be seen that this approach is not
channel consistent and deactivated pixels are concentrated on the cat.
Figure 1 is illustrative to give one an intuitive understanding of these techniques, as in practice these techniques are
applied to feature maps which are the output activations of a preceding layer of the network. The contributions of this
paper include:
The testing of preexisting unstructured and structured Dropout approaches on current state-of-the-art models
including transformer-based models on natural language inference and vision tasks. We reveal that structured
Dropouts are generally better than unstructured ones on both vision and language tasks.
2
The proposal of a new approach to structured dropout named ProbDropBlock, which improved model
performance on both vision and language tasks. ProbDropBlock is adaptive and the blocks dropped are
dependent on the relative per-pixel values. It improves RoBERTa finetuning on MNLI by
0.22%
and ResNet50
on ImageNet by 0.28%.
Further observation of the benefits of simple linear scheduling observed [10] for both structured and unstruc-
tured Dropout on a range of vision and language models.
2 Related Work
In this section, we briefly review related works in the areas of structured and unstructured Dropouts used as both a
regularization technique to improve model performance and as an approach to pruning to reduce the model size and
computational requirements. We briefly detail unstructured Dropout and the various structured Dropouts devised for
other network architectures.
2.1 Unstructured Dropout
To help address the problem of overfitting in neural networks, Srivastava et al.
[15]
proposed Dropout as a simple
way of limiting the co-adaptation of the activation of units in the network. By randomly deactivating units during
training they sample from an exponential number of different thinned networks and at test time an ensemble of these
thinned networks is approximated by a single full network with smaller weights. Dropout led to improvements in the
performance of neural networks on various tasks and has become widely adopted. This form of Dropout in this work
we refer to as unstructured Dropout as any combination of units in the network may be randomly dropped/deactivated.
In the following subsection, we consider forms of structured Dropout which extend this idea further for other network
architectures and tasks.
2.2 Dropblock and other structured Dropouts
Ghiasi et al.
[10]
proposed DropBlock as a way to perform Structured Dropout for Convolutional Neural Nets (CNNs).
They suggest that unstructured Dropout is less effective for convolutional layers than fully connected layers because
activation units in convolutional layers are spatially correlated so information can still flow through convolutional
networks despite Dropout and so they devised DropBlock which drops units in a contiguous area of the feature map
collectively. This approach was inspired by Devries and Taylor
[16]
s Cutout, a data augmentation method where
parts of the input examples are zeroed out. DropBlock generalized Cutout by applying Cutout at every feature map
in convolutional networks. Ghiasi et al.
[10]
also found that a scheduling scheme of linearly increasing DropBlock’s
zero-out ratio performed better than a fixed ratio.
Dai et al.
[11]
extended DropBlock to Batch DropBlock. Their network consists of two branches; a global branch and a
feature-dropping branch. In their feature dropping branch they randomly zero out the same contiguous area from each
feature map in a batch involved in computing loss function. They suggest zeroing out the same block in each batch
allows the network to learn a more comprehensive and spatially distributed feature representation.
Larsson et al.
[17]
proposed DropPath in their work on FractalNets. Just as Dropout prevents the co-adaptation of
activations, DropPath prevents the co-adaptation of parallel paths in networks such as FractalNets by randomly dropping
operands of the join layers. DropPath provides at least one such path while sampling a subnetwork with many other
paths disabled. DropPath during training alternates between a global sampling strategy which returns only a single path
and a local sampling strategy in which a join drops each input with fixed probability, but with a guarantee, at least one
survives. This encourages the development of individual columns as performant stand-alone subnetworks.
Cai et al.
[12]
proposed DropConv2d as they suggest the failure of standard dropout is due to conflict between the
stochasticity of unstructured dropout and the following Batch Normalization (BN) step. They propose placing dropout
operations right before the convolutional operation instead of BN or replacing BN with Group Normalization (GN) to
reduce this conflict. Additionally, they devised DropConv2d which draws inspiration from DropPath and DropChannel,
they treat each channel connection as a path between input and output channels and perform dropout on replicates of
each of these paths.
DropBlock, BatchDropBlock, DropPath and DropConv2d are forms of structured Dropout designed with specific
architecture in mind. However, as seen by Cai et al.
[12]
DropConv2d an approach to structured Dropout designed for
a given network can still be useful to novel network architecture. Aside from being used to improve generalization,
structured Dropout has also been used as an approach to pruning and reducing computational resource requirements at
inference time.
3
摘要:

REVISITINGSTRUCTUREDDROPOUTYirenZhaoImperialCollegeLondonandUniversityofCambridgea.zhao@imperial.ac.ukOluwatomisinDadaUniversityofCambridgeoluwatomisin.dada@cl.cam.ac.ukXitongGaoSIATxt.gao@siat.ac.cnRobertDMullinsUniversityofCambridgerobert.mullins@cl.cam.ac.ukABSTRACTLargeneuralnetworksareoftenover...

展开>> 收起<<
REVISITING STRUCTURED DROPOUT Yiren Zhao Imperial College London.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:381.88KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注