
Equivariance-aware Architectural Optimization of Neural Networks
Contributions First, we present two mechanisms towards equivariance-aware architectural optimization. The equiv-
ariance relaxation morphism for group convolutional layers partially expands the representation and parameters of the
layer to enable less constrained learning with a prior on symmetry. The [G]-mixed equivariant layer parameterizes a
layer as a weighted sum of layers equivariant to different groups, permitting the learning of architectural weighting
parameters.
Second, we implement these concepts within two algorithms for architectural optimization of partially-equivariant net-
works. Evolutionary Equivariance-Aware NAS (EquiNASE) utilizes the equivariance relaxation morphism in a greedy
evolutionary algorithm, dynamically relaxing constraints throughout the training process. Differentiable Equivariance-
Aware NAS (EquiNASD) implements [G]-mixed equivariant layers throughout a network to learn the appropriate
approximate equivariance of each layer, in addition to their optimized weights, during training.
Finally, we analyze the proposed mechanisms via their respective NAS approaches in multiple image classification
tasks, investigating how the dynamically learned approximate equivariance affects training and performance over
baseline models and other approaches.
1.1 Related works
Approximate equivariance Although no other works on approximate equivariance explicitly study architectural
optimization, some approaches are architectural in nature. We compare our contributions with the most conceptually
similar works to our knowledge.
The main contributions of Basu et al. (2021) and Agrawal & Ostrowski (2022) are similar to our proposed equivariant
relaxation morphism. Basu et al. (2021) also utilizes subgroup decomposition but instead algorithmically builds up
equivariances from smaller groups, while our work focuses on relaxing existing constraints. Agrawal & Ostrowski
(2022) presents theoretical contributions towards network morphisms for group-invariant shallow neural networks:
in comparison, our work focuses on deep group convolutional architectures and implements the morphism in a NAS
algorithm.
The main contributions of Wang et al. (2022) and Finzi et al. (2021) are similar to our proposed [G]-mixed equivariant
layer. Wang et al. (2022) also uses a weighted sum of kernels, but uses the same group for each kernel and defines the
weights over the domain of group elements. Finzi et al. (2021) uses an equivariant layer in parallel to a linear layer
with weighted regularization, thus only using two layers in parallel and weighting them through regularization rather
than parameterization.
In more diverse approaches, Zhou et al. (2020) and Yeh et al. (2022) represent symmetry-inducing weight sharing
through learnable matrices. Romero & Lohit (2022) and van der Ouderaa et al. (2022) learn partial or soft equivari-
ances for each layer.
Neural architecture search Neural architecture search (NAS) aims to optimize both the architecture and its param-
eters for a given task. Liu et al. (2018) approaches this difficult bi-level optimization by creating a large super-network
containing all possible elements and continuously relaxing the discrete architectural parameters to enable search by
gradient descent. Other NAS approaches include evolutionary algorithms (Real et al., 2017; Lu et al., 2019; Elsken
et al., 2017) and reinforcement learning (Zoph & Le, 2017), which search over discretely represented architectures.
2 Background
We assume familiarity with group theory (see Appendix A.1). Let Gbe a discrete group. The lth G-equivariant group
convolutional layer (Cohen & Welling, 2016) of a group convolutional neural network (G-CNN) convolves the feature
map f:G→RCl−1output from the previous layer with a filter with kernel size krepresented as learnable parameters
ψ:G→RCl×Cl−1. For each output channel d∈[Cl], where [C] := {1, . . . , C}, and group element g∈G, the
layer’s output is defined via the convolution operator1:
[f ?Gψ]d(g) = X
h∈G
Cl−1
X
c=1
fc(h)ψd,c(g−1h).(1)
The first layer is a special case: the input to the network needs to be lifted via this operation such that the output
feature map of this layer has a domain of G. In the case of image data, an image xwith Cchannels may be interpreted
1We identify the correlation and convolution operators as they only differ where the inverse group element is placed and refer
to both as ”convolution” throughout this work.
2