
3 Methods
3.1 Preliminaries
Let fθ(x) : X → Y be a classifier with parameters θwhere, X ⊆ RD,Y={1, ..., C}. In feed-forward
deep neural networks, the classifier fθusually consists of simpler functions f(l)(x),l∈ {1, . . . , L}composed
together such that the network output is computed as ˆy=f(L)(f(L−1)(. . . f (1)(x))). For our function fθto
correctly classify the input x, we wish for it to attain a small risk for (x, y)∼ D as measured by loss function
L. Additionally, for our classifier to be robust, we also wish fθto attain a small risk in the vicinity of all
x∈ X, normally defined by a p-norm ball of fixed radius ϵaround the sample points Madry et al. (2017).
Intuitively, a model which has a high prediction variance (or similarly high risk variance) to noisy inputs, is
more likely to exhibit extreme high risks for data points sampled from the same distribution (i.e. adversarial
examples). Indeed, classifiers that generate lower variance predictions are often expected to generalize better
and be more robust to input noise. For example, classic ensemble methods like bagging,boosting, and random
forests operate by combining the decisions of many weak (i.e. high variance) classifiers into a stronger one
with reduced prediction variance and improved generalization performance Hastie et al. (2009).
Given an ensemble of predictor functions fi, i ∈1, . . . , K with zero or small biases, the ensemble prediction
(normally considered as the mean prediction ¯y=1
KPK
i=1 ˆyi) reduces the expected generalization loss by
shrinking the prediction variance. To demonstrate the point, one can consider Ki.i.d. random variables
with variance σ2and their average value that has a variance of σ2
K. Based on this logic, one can expect
ensembles of neural network classifiers to be more robust in the presence of noise or input perturbations
in general. However, several prior ensemble models have been shown to remain prone to ensembles of
adversarial attacks with large epsilons Tramer et al. (2020). One reason for the ensemble models to remain
vulnerable to adversarial attacks is that individual networks participating in these ensembles may still learn
different sets of non-robust representations leaving room for the attackers to find common weak spots across
all individual models within the ensemble. Additionally, while larger ensembles may be effective in that
regard, constructing ever-larger ensemble classifiers might quickly become infeasible, especially in the case
of neural network classifiers.
One possible solution could be to focus on learning robust features by forming ensembles of features in
the network. Indeed, learning robust features has been suggested as a way towards robust classification
Bashivan et al. (2021). Consequently, if individual kernels within a single network are made robust through
ensembling, it would become much more difficult to find adversaries that can fool the full network. In the
next section, we introduce Kernel Average Pooling for learning ensembles of kernels with better robustness
properties against input perturbations.
3.2 Kernel average pooling (KAP)
Mean filters (a.k.a., average pooling) are widely accepted as simple noise suppression mechanisms in computer
vision. For example, spatial average pooling layers are commonly used in modern deep neural networks Zoph
et al. (2018) by applying a mean filter along the spatial dimensions of the input to reduce the effect of spatially
distributed noise (e.g. adjacent pixels in an image).
Here, we wish to substitute each kernel in the neural network model with an ensemble of kernels performing
the same function such that the ensemble output is the average of individual kernel outputs. This can be
conveniently carried out by applying the average pooling operation along the kernel dimension of the input
tensor.
Given an input z= [z1, . . . , zNk]∈RNk, the kernel average pooling operation (KAP) with kernel size Kand
stride S, computes the function
¯zi=1
K
Si+K−1
2
X
l=Si−K−1
2
zl(2)
Where zlis zero-padded (with zero weight in the computation of the average) to match the dimensionality of
¯zand zvariables (see A.1 for the details of padding). Importantly, when zis the output of an operation linear
4