Equi-Tuning Group Equivariant Fine-Tuning of Pretrained Models Sourya Basu1 2 Prasanna Sattigeri1Karthikeyan Natesan Ramamurthy1 Vijil Chenthamarakshan1Kush R. Varshney1Lav R. Varshney2Payel Das1

2025-04-26 0 0 1.25MB 16 页 10玖币
侵权投诉
Equi-Tuning: Group Equivariant Fine-Tuning of Pretrained Models
Sourya Basu,1, 2, *Prasanna Sattigeri,1Karthikeyan Natesan Ramamurthy,1
Vijil Chenthamarakshan,1Kush R. Varshney,1Lav R. Varshney,2Payel Das1
1IBM Research – Thomas J. Watson Research Center, 2University of Illinois at Urbana-Champaign
Abstract
We introduce equi-tuning, a novel fine-tuning method that
transforms (potentially non-equivariant) pretrained models
into group equivariant models while incurring minimum L2
loss between the feature representations of the pretrained
and the equivariant models. Large pretrained models can be
equi-tuned for different groups to satisfy the needs of vari-
ous downstream tasks. Equi-tuned models benefit from both
group equivariance as an inductive bias and semantic pri-
ors from pretrained models. We provide applications of equi-
tuning on three different tasks: image classification, compo-
sitional generalization in language, and fairness in natural
language generation (NLG). We also provide a novel group-
theoretic definition for fairness in NLG. The effectiveness of
this definition is shown by testing it against a standard em-
pirical method of fairness in NLG. We provide experimental
results for equi-tuning using a variety of pretrained models:
Alexnet, Resnet, VGG, and Densenet for image classifica-
tion; RNNs, GRUs, and LSTMs for compositional general-
ization; and GPT2 for fairness in NLG. We test these models
on benchmark datasets across all considered tasks to show the
generality and effectiveness of the proposed method.
1 Introduction
Modern deep learning models show promising transfer-
learning abilities for a wide range of downstream
tasks (Bommasani et al. 2021). Lu et al. (2021) show that
the GPT2 language model (Radford et al. 2019) can be used
as a pretrained model for various downstream tasks such
as numerical computation, image classification, and even
protein folding prediction. But pretraining large models re-
quires immense computational and data resources. Hence,
it is essential to design effective fine-tuning algorithms that
can squeeze the most from these pretrained models.
Fine-tuning leverages semantic priors from pretrained
models for downstream tasks. E.g. CNNs trained on Ima-
genet (Deng et al. 2009) can extract useful features from
images outside the training set and can use that ability for
any other downstream image processing task. A different
method of using priors in deep learning is via inductive
biases in models such as group equivariance, e.g. design-
ing group equivariant architectures such as GCNNs (Cohen
*Work done during an internship at IBM Research
Copyright © 2023, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
(a) Fine-tuning (b) Equi-tuning for the c4group
Figure 1: Comparison of architectures for fine-tuning and
equi-tuning for c4group of 90rotations. For (a) fine-
tuning, the input is passed through the pretrained model and
then through a custom layer to obtain the output. For (b)
equi-tuning, the inputs are transformed using the group ac-
tion of c4. These inputs are passed through the pretrained
model parallelly to obtain a list of outputs, which are trans-
formed using inverse transformations from the same group
and passed through a custom equivariant layer to obtain the
output.
and Welling 2016; Kondor and Trivedi 2018). A model is
group equivariant if transformations of its input results in
a group transformation of its output. Popular examples in-
clude CNNs themselves that are equivariant to translations
and GCNNs that are equivariant to more general symmetries
such as 90rotations. Thus, fine-tuning and group equivari-
ance leverage different kinds of priors to improve perfor-
mance in a task. But it is not obvious how to effectively use
them together in a single method. Moreover, the same pre-
trained model may need to be used for downstream tasks in
different target domains.
We introduce equi-tuning, a simple fine-tuning method
that yields equivariance, even if the pretrained model is not
equivariant to any group symmetry. This method solves a
simple optimization problem minimizing the distance be-
tween the features of a pretrained model and any group
equivariant model. One salient feature of equi-tuning is its
generality in potential applications. To show this, we exper-
iment with diverse downstream tasks: image classification,
language compositionality, and fairness in natural language
generation (NLG).
arXiv:2210.06475v2 [cs.LG] 5 Feb 2023
Figure 2: Rotated input images in (a) give unpredictably
changing features for pretrained Alexnet in (b), whereas fea-
tures from equi-tuned Alexnet change equivariantly in (c).
For image classification, we consider classifying the Hy-
menoptera and CIFAR-10 datasets as downstream tasks
using several pretrained models such as Alexnet, Resnet,
VGG, and Densenet.1These pretrained models are not natu-
rally equivariant to groups such as the c4group of 90rota-
tions, see Fig. 2. We find that equi-tuning these models using
group symmetries such as c4outperform fine-tuning.
Lake and Baroni (2018) proposed the SCAN task to
benchmark the performance of language models on com-
positional generalization. Standard models such as RNNs,
GRUs, and LSTMs fail miserably on this task showing their
lack of compositional generalization abilities. Later, Gordon
et al. (2019) proposed a group-equivariant language model
with compositional generalization capabilities that passes
the SCAN task. But, training group equivariant language
models from scratch for different compositionality require-
ments can be computationally expensive. Here, we simply
equi-tune pretrained models using suitable groups to ob-
tain competitive results and sometimes even outperform the
group equivariant models of Gordon et al. (2019).
Several empirical studies on fairness in NLG show biases
and stereotypes in language models such as GPT2 (Zhao
et al. 2017; Sheng et al. 2019; Nadeem, Bethke, and Reddy
2021).2But, theoretical study of bias mitigation methods in
NLG remain largely unexplored. We first provide a group-
theoretic framework for fairness in NLG. Then we introduce
two different equi-tuning methods for debiasing language
models. We use the regard classifier of Sheng et al. (2019)
to show that equi-tuned GPT2 reduces bias towards various
demographic groups in generated texts compared to the orig-
inal GPT2 model.
The main contributions of this paper are as follows.
§ 4 derives equi-tuning and discusses its properties.
§ 5.1 and 5.2 apply equi-tuning to image classification
and compositional generalization, respectively.
§ 5.3 first provides a group-theoretic definition of fair-
ness in NLG. Then, it provides two different equi-tuning
1We will use Resnet to refer to Resnet18 and VGG to refer to
VGG11 throughout this paper
2Throughout this work we use GPT2 to refer to the version of
the GPT2 model that has 117M parameters.
methods to mitigate bias in language models.
§ 6 provides experimental validation of equi-tuning by
testing with several pretrained models and benchmark
datasets across all the aforementioned applications.
2 Related Work
Group equivariant networks. Group equivariant net-
works (Cohen and Welling 2016; Kondor and Trivedi 2018;
Ravanbakhsh, Schneider, and Poczos 2017) use equivari-
ance as inductive priors for efficient learning. They find ap-
plications in image classification (Cohen and Welling 2016,
2017), graph processing (Satorras, Hoogeboom, and Welling
2021; Maron et al. 2019; Keriven and Peyr´
e 2019), meshes
and 3D point cloud data processing (He et al. 2021; De Haan
et al. 2020; Basu et al. 2022a), and reinforcement learning
(van der Pol et al. 2020; Wang et al. 2022; Basu et al. 2022b).
But these methods do not leverage the recent emergence of
powerful pretrained models.
Transfer learning. Transfer learning has gained popular-
ity in deep learning because of the availability of large
pretrained models and the gains obtained from their use
(Zhuang et al. 2020; Dai et al. 2009; Zamir et al. 2018; Tay-
lor and Stone 2009; Bengio 2012; Ruder et al. 2019). But
equivariance in transfer learning remains unexplored.
Compositional generalization. SCAN is a dataset that
benchmarks the performance of language models for their
compositional generalization ability (Lake and Baroni
2018). Various models such as RNNs, GRUs, and LSTMs
fail at the SCAN task (Lake and Baroni 2018). Several
methods have been proposed to solve parts of the SCAN
task: group equivariance (Gordon et al. 2019), meta learn-
ing (Lake 2019), syntactic attention mechanism (Russin
et al. 2019), and data augmentation (GECA) (Andreas
2020). Among these, the group equivariant method of Gor-
don et al. (2019) is the most systematic and achieves the
best results. Also, all methods besides GECA require com-
plex architectures or training methods that are non trivial to
use with transfer learning. Equi-tuning, in contrast, is a sys-
tematic method that can be used on top of pretrained models
such as RNNs, GRUs, LSTMs, transformers, etc.
Fairness in NLG. Several works have shown bias in lan-
guage models on the basis of gender, race, sexual orienta-
tion, etc. (Sheng et al. 2019; Prates, Avelar, and Lamb 2020;
Henderson et al. 2018). Existing work on detecting and mit-
igating biases in NLG is mainly ad hoc and lacks generality
(Sun et al. 2019; Nadeem, Bethke, and Reddy 2021; Abid,
Farooqi, and Zou 2021). Moreover, Steed et al. (2022) have
shown that mitigating bias in the embedding space does not
help reduce bias for downstream tasks. In contrast, our work
attempts to define fairness using group theory, which moti-
vates our bias mitigation methods that provide appropriate
guarantees on fairness. Recently, Yeo and Chen (2020) pro-
vided a theoretical definition of fairness in NLG inspired by
Dwork et al. (2012); the idea is that similar prompts from
different demographic groups such as “man” and “woman”
must generate similar sentences. There, defining the metric
to measure similarity is non-trivial since the metric must also
preserve the individuality of different demographic groups.
In contrast, our framework does not need any such metric
and provides a direct method to preserve such individuality
while mitigating bias.
3 Background
Here we give a background on group equivariance, compo-
sitional generalization, and fairness in NLG.
3.1 Group Equivariance
Groups. A set with a binary operator, (G, ·)is called a
group if it satisfies the axioms of a group in appendix § A.1.
The action of a group on a finite set Xis given as Γ : G×
X 7→ X that satisfies the axioms of group action in § A.4.
Group actions are used to formally describe transformations
acting on a set X, e.g. rotations of 90s is an action Γon
a set of square images X. A transformation of x∈ X by
group element gGis written as Γ(g, x).
Group equivariance. Let ΓXand ΓYbe the group actions
of Gon sets Xand Yrespectively. A function f:X 7→ Y is
called group equivariant to Gif fX(g, x)) = ΓY(g, f(x))
for all gG, x X . Hence, if a neural network performing
segmentation is equivariant to the group of 90rotations (c4
group), then, if the input is rotated by a multiple of 90, the
output also gets rotated by the same angle.
3.2 Compositional Generalization
Compositionality in languages refers to the ability to un-
derstand novel sentences by understanding and algebraically
manipulating their components (Chomsky 2009; Montague
1970). Compositionality is key to excellent human under-
standing of languages, whereas it is hypothesized that neu-
ral networks do not posses such capabilities, leading to their
extreme sample inefficiency in modeling languages (Lake
et al. 2017; Lake and Baroni 2018; Loula, Baroni, and Lake
2018; Dess`
ı and Baroni 2019). E.g., if humans understand
the meanings of “walk”, “jump”, and “jump twice”, then
they can naturally understand the meaning of “walk twice”.
But deep neural networks fail to do so, as shown by tests on
the SCAN dataset (Lake and Baroni 2018).
SCAN is a translation dataset where the inputs are com-
mands such as “Jump Twice” and the outputs consist of cor-
responding actions such as “JUMP JUMP”. There are sev-
eral data splits in SCAN that test different generalization ca-
pabilities of a model. The two of interest to us are the Add
jump task and the Around right task. These two tasks test the
compositional generalization capabilities of models.
The training set of the Add jump task consists of sen-
tences that do not contain any commands containing the
word “Jump” except for the word “Jump” itself. But the
training set contains other sentences with verbs that are sim-
ilar to “Jump”, such as “Walk”, “Run”, “Walk Twice”, “Run
Twice”, etc. The test set on the other hand contains com-
plicated commands using the word “Jump” such as “Jump
Twice”, “Turn Left After Jump Twice”, etc. Thus, for a
model to perform well in the test set, it must infer the mean-
ing of complicated sentences such as “Jump Twice” from
the understanding of “Jump” and “Walk Twice”. Similarly,
in the training set of the Around right task, the command
Around Right” never appears, but similar commands such
as “Around Left” appear. The test set contains the phrase
Around Right” and for the model to succeed in this task, it
must infer that “Right” and “Left” are directions and can be
treated in a similar way.
3.3 Fairness: Mitigating Biases in NLG
As discussed in § 2, Sheng et al. (2019) show that lan-
guage models such as GPT2 exhibit biases towards certain
demographic groups in their generated texts. These biases
are often subtle and are not easily detectable using senti-
ment classifiers. Hence, they introduce the concept of re-
gard towards various demographic groups and provide a task
to detect bias in texts generated by models in terms of re-
gards. They consider three sets of demographic groups for
this task: a) [“man”, “woman”], b) [“Black”, “White”], c)
[“gay”, “straight”]. These sets correspond to gender, race,
and sexual orientation, respectively. The task consists of two
types of contexts: respect and occupation. Each type has
five context sentences, and models generate texts for each
of these sentences. The respect task tests the biases in the
model’s respect towards various demographic groups, e.g.
‘The XYZ was known for’, where XYZ is replaced by any
demographic group. The occupation task tests the bias in
model’s description of occupation for different demographic
groups, e.g. ‘The XYZ worked as’, where XYZ is replaced by
any demographic group. Sheng et al. (2019) also develop an
automatic regard classifier using transfer learning on BERT
using a dataset created using human annotations. This classi-
fier labels any generated sentence as negative, neutral, posi-
tive, or other. This classifier is shown to match human labels
of regard for texts with around 80% accuracy. We use this
regard classifier in our experiments for fairness in NLG.
4 Equi-Tuning
We motivate equi-tuning as a method that minimizes a dis-
tance between the features obtained by a pretrained model
and any equivariant model when the dataset contains all
the transformations from a discrete group. We show that
the solution obtained corresponds to the Reynold’s opera-
tor (Sturmfels 2008) applied to the pretrained model, which
directly implies certain universality properties.
Let M:X Rn7→ Y Rmbe a pretrained model.
Further, let ΓXand ΓYbe group actions of the group G
on Xand Yrespectively. We construct a model MGthat is
equivariant to actions of a finite group Gand also minimizes
the sum of the distances between features MX(g, x)) and
MGX(g, x)) for any x, for all gG. The idea is that MG
loses little pretrained knowledge from Mwhile also being
equivariant to G. We assume that the group actions are well
defined, which is true for a wide range of cases including all
cases considered in this paper. Formally, for any x X , we
want to solve the following optimization problem.
min
MG(x)X
gG
kMX(g, x)) MGX(g, x))k2
2
s.t. MGX(g, x)) = ΓY(g, MG(x)) for all gG.
(1)
When clear from context, we write ΓX(g, x)as gx and
ΓY(g, y)as gy, for simplicity. Now, assuming that kgk2=
1, we have the optimization as
min
MG(x)X
gG
g1M(gx)MG(x)
2
2
s.t. MG(gx) = gMG(x)for all gG.
(2)
To solve (2), we first remove the constraint of equivariance
on MGand obtain a lower bound to the solution of (2).
Then, we show the obtained solution also satisfies the con-
straints in (2), hence, it is also a solution to (2). Removing
the equivariant constraint from (2), we obtain the optimiza-
tion problem minMG(x)PgG
g1M(gx)MG(x)
2
2.
This is a convex problem with solution
MG(x) = 1
|G|X
gG
g1M(gx)(3)
Note that (3) is the Reynold’s operator (Sturmfels
2008) applied to M. Further, Yarotsky (2022) shows that
Reynold’s operator for group Gapplied to any function
makes it equivariant to G. Hence, it satisfies the constraints
of (2). Since it minimizes the lower bound, it also minimizes
the function in (2). Sec. C gives efficient implementation of
(3). Sec. D shows that equituning is comparable to parameter
sharing (Ravanbakhsh, Schneider, and Poczos 2017; Cohen
and Welling 2016) in compute complexity.
Comments and properties. The assumption kgk2= 1
is very general and subsumes the entire class of permuta-
tion, and special linear groups such as SO(n), where nis
a positive integer. Moreover, our algorithm can be directly
extended to groups that have a constant norm, not necessar-
ily just 1. Note that equi-tuning is not useful in cases where
Mis already equivariant/invariant to a larger group H>G,
where we get MG(x) = M(x)in (3).
Under the assumption that Mis a universal approximator
of all functions f:X 7→ Y as defined in appendix § B.2, it
follows from Yarotsky (2022) and Murphy et al. (2018) that
MGis an universal approximator of all functions e:X 7→ Y
that are equivariant with respect to G.
Discussion and Example. The features obtained in (3) are
called scalar features as described by Cohen et al. (2019). In
appendix § H, we extend this solution to obtain outputs that
are regular features represented by MR
Gin Alg. 2. Regular
features are considered more expressive than scalar features.
As proved in § H, MR
Gis also equivariant. We restrict our
experiments in this work to scalar features for simplicity.
Traditional equivariant networks, such as GCNN (Cohen
and Welling 2016), SE(3)-transformers (Fuchs et al. 2020),
and LieConv (Finzi et al. 2020), require the group equivari-
ance constraint to hold for each layer of the network. In con-
trast, for equi-tuning, we only need to ensure that the group
actions are defined on the input and output layers of the pre-
trained model, which is a key reason for the simplicity and
generality of our algorithm.
Now we provide an example of equi-tuning for image pro-
cessing using the c4 = {e, r, r2, r3}group, where eis the
identity and rdenotes rotation by 90. As shown in Fig. 1b,
for constructing the model for equi-tuning, we compute four
transformations of the input and compute the features by
passing them through the pretrained model parallelly. The
outputs are transformed using inverse transformations and
are passed through a custom group equivariant layer, where
they are averaged and passed through custom equivariant
layers to obtain the output. In contrast, for fine-tuning the
input is simply passed through the model and a custom layer
to obtain the output, see Fig. 1a. §F gives examples of equi-
tuning for language models.
5 Applications
Emphasizing the generality of equi-tuning, we apply it to
three different tasks: 1) image classification, 2) composi-
tional generalization in language, and 3) fairness in NLG.
5.1 Image Classification
Cohen and Welling (2016) found that equivariant networks
using the c4(90rotations) and d4groups (90rotations and
horizontal flips) consistently outperformed non-equivariant
networks on the CIFAR10 dataset. Hence, we choose the
same groups for our image classification experiments.
As shown in Fig. 1, equi-tuning supports a custom equiv-
ariant layer, which is useful to change the dimension of the
output as required by downstream tasks. For our image clas-
sification tasks, we use parameter-sharing (Ravanbakhsh,
Schneider, and Poczos 2017) to design the custom equivari-
ant layers for the c4and d4groups. Parameter-sharing sim-
ply takes a fully connected network and introduces a sharing
scheme in the weights of the network.
5.2 Compositional Generalization in Language
We consider the SCAN task for testing compositional gener-
alization of language models. As discussed in § 3.2, Gordon
et al. (2019) provide a solution to the Add jump task and
Around right task by training group equivariant recurrent
deep neural networks such as G-RNNs, G-GRUs, G-LSTMs
from scratch.
For solving the SCAN task, Gordon et al. (2019) use
cyclic groups and apply them on the vocabulary space of
the models to achieve local equivariance. The group used
for both Add jump task and Around right task is the cyclic
group of size two, i.e. G= ({e, g},·), where g·g=e, and
eis the identity element. The group acts on the input and
output vocabularies of models considered for the tasks. The
identity element makes no transformations to the input or
the output. The element gswaps two words in both the in-
put and the output vocabularies simultaneously. The words
swapped depends on the task considered.
For Add jump task,gswaps the words [“Jump”, “Run”]
in the input vocabulary, and the words [JUMP,RUN] in the
output vocabulary. Similarly, for Around right task,gswaps
the words [“Left”, “Right”] in the input vocabulary, and the
words [LEFT,RIGHT] in the output vocabulary.
We start with recurrent models such as RNNs, GRUs,
LSTMs, pretrained in-house, and treat them as blackbox
models and simply use the equi-tune transform from (3) on
摘要:

Equi-Tuning:GroupEquivariantFine-TuningofPretrainedModelsSouryaBasu,1,2,*PrasannaSattigeri,1KarthikeyanNatesanRamamurthy,1VijilChenthamarakshan,1KushR.Varshney,1LavR.Varshney,2PayelDas11IBMResearch–ThomasJ.WatsonResearchCenter,2UniversityofIllinoisatUrbana-ChampaignAbstractWeintroduceequi-tuning,ano...

展开>> 收起<<
Equi-Tuning Group Equivariant Fine-Tuning of Pretrained Models Sourya Basu1 2 Prasanna Sattigeri1Karthikeyan Natesan Ramamurthy1 Vijil Chenthamarakshan1Kush R. Varshney1Lav R. Varshney2Payel Das1.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:1.25MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注