REVISITING STRUCTURED DROPOUT Yiren Zhao Imperial College London

2025-04-29 0 0 381.88KB 14 页 10玖币

侵权投诉

REVISITING STRUCTURED DROPOUT

Yiren Zhao

Imperial College London

and University of Cambridge

a.zhao@imperial.ac.uk

Oluwatomisin Dada

University of Cambridge

oluwatomisin.dada@cl.cam.ac.uk

Xitong Gao

SIAT

xt.gao@siat.ac.cn

Robert D Mullins

University of Cambridge

robert.mullins@cl.cam.ac.uk

ABSTRACT

Large neural networks are often overparameterised and prone to overﬁtting, Dropout is a widely

used regularization technique to combat overﬁtting and improve model generalization. However,

unstructured Dropout is not always effective for speciﬁc network architectures and this has led to the

formation of multiple structured Dropout approaches to improve model performance and, sometimes,

reduce the computational resources required for inference. In this work, we revisit structured Dropout

comparing different Dropout approaches to natural language processing and computer vision tasks

for multiple state-of-the-art networks. Additionally, we devise an approach to structured Dropout we

call

ProbDropBlock

which drops contiguous blocks from feature maps with a probability given by

the normalized feature salience values. We ﬁnd that with a simple scheduling strategy the proposed

approach to structured Dropout consistently improved model performance compared to baselines

and other Dropout approaches on a diverse range of tasks and models. In particular, we show

ProbDropBlock

improves RoBERTa ﬁnetuning on MNLI by

0.22%

, and training of ResNet50 on

ImageNet by 0.28%.

1 Introduction

In our modern society, Deep Neural Networks have become increasingly ubiquitous, having achieved signiﬁcant success

in many tasks including visual recognition and natural language processing [

]. These networks now play a larger

role in our lives and our devices, however, despite their successes they still have notable weaknesses. Deep Neural

Networks are often found to be highly overparameterized, and as a result, require excessive memory and signiﬁcant

computational resources. Additionally, due to overparameterization, these networks are prone to overﬁt their training

data.

There are several approaches to mitigate overﬁtting including reducing model size or complexity, early stopping [

data augmentation [

] and regularisation [

]. In this paper, we focus on Dropout which is a widely used form of

regularisation proposed by Srivastava et al.

[7]

. Standard Unstructured Dropout involves randomly deactivating a subset

of neurons in the network for each training iteration and training this subnetwork, at inference time the full model could

then be treated as an approximation of an ensemble of these subnetworks.

Unstructured Dropout was efﬁcient and effective and this led to it being widely adopted, however, when applied to

Convolutional Neural Networks (CNNs), unstructured Dropout struggled to achieve notable improvements [

] and this

led to the development of several structured Dropout approaches [

] including DropBlock and DropChannel.

DropBlock considers the spatial correlations between nearby entries in a feature map of a CNN and attempts to stop

that information ﬂow by deactivating larger contiguous areas/blocks, while DropChannel considers the correlation of

information within a particular channel and performs Dropout at the channel level. However, since the development

of these structured approaches, there have been further strides in network architecture design, with rising spread and

interest in Transformer-based models.

arXiv:2210.02570v1 [cs.LG] 5 Oct 2022

(a) Original image and its RGB channels

(b) Dropout (c) BatchDropBlock

(d) DropBlock (e) Adaptive DropBlock

Figure 1: An illustration of applying different Dropouts to an image.

Given the success achieved by block-wise structured Dropout on CNNs, it is only natural to ask the question, do these

approaches apply to Transformer-based models? Structured Dropout approaches for transformers seem to focus on

reducing the model size and inference time, these works place more emphasis on pruning or reducing computational

resources [13, 14] than combating overﬁtting which is the focus of this paper.

In this paper, we revisit the idea of structured Dropout for current state-of-the-art models on language and vision tasks.

Additionally, we devised our own form of adaptive structured Dropout -

ProbDropBlock

and compare it to preexisting

approaches to structured and unstructured Dropout.

In Figure 1 we illustrate the effects of select structured and unstructured Dropout approaches on an image of a cat. As

can be seen in Figure 1a the original image consists of three channels (RGB) which are aggregated to form the image.

Different approaches to Dropout may treat channels differently. In Figure 1b we illustrate the effect of unstructured

Dropout on this image, the many small black squares represent deactivated/dropped weights at a pixel level and we also

see different pixels have been deactivated in each channel. In Figure 1c we see fewer but larger black squares, and that

the locations of dropped pixels are consistent between channels, however, this is not the case in Figure 1e and Figure 1d.

In this work, we say that BatchDropBlock is channel consistent i.e. channels do not deactivate blocks independently

rather the deactivated blocks are consistent between channels.

In Figure 1b, Figure 1c, Figure 1d, for a single channel there is a uniform probability of any pixel or block (depending

on the approach) to be dropped and so deactivated pixels may not contain any of the key information required to identify

this image as a cat (i.e. the probability of deactivating a pixel/block belonging to the cat is the same as that of one

belonging to the background). This is not the case for Figure 1e, in our adaptive DropBlock approach the probability of

a block being dropped is dependent on the value of the center pixel in the block. It can be seen that this approach is not

channel consistent and deactivated pixels are concentrated on the cat.

Figure 1 is illustrative to give one an intuitive understanding of these techniques, as in practice these techniques are

applied to feature maps which are the output activations of a preceding layer of the network. The contributions of this

paper include:

•

The testing of preexisting unstructured and structured Dropout approaches on current state-of-the-art models

including transformer-based models on natural language inference and vision tasks. We reveal that structured

Dropouts are generally better than unstructured ones on both vision and language tasks.

•

The proposal of a new approach to structured dropout named ProbDropBlock, which improved model

performance on both vision and language tasks. ProbDropBlock is adaptive and the blocks dropped are

dependent on the relative per-pixel values. It improves RoBERTa ﬁnetuning on MNLI by

0.22%

and ResNet50

on ImageNet by 0.28%.

• Further observation of the beneﬁts of simple linear scheduling observed [10] for both structured and unstruc-

tured Dropout on a range of vision and language models.

2 Related Work

In this section, we brieﬂy review related works in the areas of structured and unstructured Dropouts used as both a

regularization technique to improve model performance and as an approach to pruning to reduce the model size and

computational requirements. We brieﬂy detail unstructured Dropout and the various structured Dropouts devised for

other network architectures.

2.1 Unstructured Dropout

To help address the problem of overﬁtting in neural networks, Srivastava et al.

[15]

proposed Dropout as a simple

way of limiting the co-adaptation of the activation of units in the network. By randomly deactivating units during

training they sample from an exponential number of different thinned networks and at test time an ensemble of these

thinned networks is approximated by a single full network with smaller weights. Dropout led to improvements in the

performance of neural networks on various tasks and has become widely adopted. This form of Dropout in this work

we refer to as unstructured Dropout as any combination of units in the network may be randomly dropped/deactivated.

In the following subsection, we consider forms of structured Dropout which extend this idea further for other network

architectures and tasks.

2.2 Dropblock and other structured Dropouts

Ghiasi et al.

[10]

proposed DropBlock as a way to perform Structured Dropout for Convolutional Neural Nets (CNNs).

They suggest that unstructured Dropout is less effective for convolutional layers than fully connected layers because

activation units in convolutional layers are spatially correlated so information can still ﬂow through convolutional

networks despite Dropout and so they devised DropBlock which drops units in a contiguous area of the feature map

collectively. This approach was inspired by Devries and Taylor

[16]

’s Cutout, a data augmentation method where

parts of the input examples are zeroed out. DropBlock generalized Cutout by applying Cutout at every feature map

in convolutional networks. Ghiasi et al.

[10]

also found that a scheduling scheme of linearly increasing DropBlock’s

zero-out ratio performed better than a ﬁxed ratio.

Dai et al.

[11]

extended DropBlock to Batch DropBlock. Their network consists of two branches; a global branch and a

feature-dropping branch. In their feature dropping branch they randomly zero out the same contiguous area from each

feature map in a batch involved in computing loss function. They suggest zeroing out the same block in each batch

allows the network to learn a more comprehensive and spatially distributed feature representation.

Larsson et al.

[17]

proposed DropPath in their work on FractalNets. Just as Dropout prevents the co-adaptation of

activations, DropPath prevents the co-adaptation of parallel paths in networks such as FractalNets by randomly dropping

operands of the join layers. DropPath provides at least one such path while sampling a subnetwork with many other

paths disabled. DropPath during training alternates between a global sampling strategy which returns only a single path

and a local sampling strategy in which a join drops each input with ﬁxed probability, but with a guarantee, at least one

survives. This encourages the development of individual columns as performant stand-alone subnetworks.

Cai et al.

[12]

proposed DropConv2d as they suggest the failure of standard dropout is due to conﬂict between the

stochasticity of unstructured dropout and the following Batch Normalization (BN) step. They propose placing dropout

operations right before the convolutional operation instead of BN or replacing BN with Group Normalization (GN) to

reduce this conﬂict. Additionally, they devised DropConv2d which draws inspiration from DropPath and DropChannel,

they treat each channel connection as a path between input and output channels and perform dropout on replicates of

each of these paths.

DropBlock, BatchDropBlock, DropPath and DropConv2d are forms of structured Dropout designed with speciﬁc

architecture in mind. However, as seen by Cai et al.

[12]

DropConv2d an approach to structured Dropout designed for

a given network can still be useful to novel network architecture. Aside from being used to improve generalization,

structured Dropout has also been used as an approach to pruning and reducing computational resource requirements at

inference time.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

REVISITINGSTRUCTUREDDROPOUTYirenZhaoImperialCollegeLondonandUniversityofCambridgea.zhao@imperial.ac.ukOluwatomisinDadaUniversityofCambridgeoluwatomisin.dada@cl.cam.ac.ukXitongGaoSIATxt.gao@siat.ac.cnRobertDMullinsUniversityofCambridgerobert.mullins@cl.cam.ac.ukABSTRACTLargeneuralnetworksareoftenover...

展开>> 收起<<

REVISITING STRUCTURED DROPOUT Yiren Zhao Imperial College London.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

REVISITING STRUCTURED DROPOUT Yiren Zhao Imperial College London

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: