FairCLIP Social Bias Elimination based on Attribute Prototype Learning and Representation Neutralization Junyang Wang Yi Zhang Jitao Sang

2025-05-06 0 0 3.26MB 9 页 10玖币
侵权投诉
FairCLIP: Social Bias Elimination based on Attribute Prototype Learning and
Representation Neutralization
Junyang Wang, Yi Zhang, Jitao Sang
Beijing Jiaotong University
junyangwang, yi.zhang, jtsang@bjtu.edu.cn
Abstract
The Vision-Language Pre-training (VLP) models like CLIP
have gained popularity in recent years. However, many works
found that the social biases hidden in CLIP easily manifest in
downstream tasks, especially in image retrieval, which can
have harmful effects on human society. In this work, we pro-
pose FairCLIP to eliminate the social bias in CLIP-based
image retrieval without damaging the retrieval performance
achieving the compatibility between the debiasing effect and
the retrieval performance. FairCLIP is divided into two steps:
Attribute Prototype Learning (APL) and Representation Neu-
tralization (RN). In the first step, we extract the concepts
needed for debiasing in CLIP. We use the query with learn-
able word vector prefixes as the extraction structure. In the
second step, we first divide the attributes into target and bias
attributes. By analysis, we find that both attributes have an
impact on the bias. Therefore, we try to eliminate the bias by
using Re-Representation Matrix (RRM) to achieve the neu-
tralization of the representation. We compare the debiasing
effect and retrieval performance with other methods, and ex-
periments demonstrate that FairCLIP can achieve the best
compatibility. Although FairCLIP is used to eliminate bias in
image retrieval, it achieves the neutralization of the represen-
tation which is common to all CLIP downstream tasks. This
means that FairCLIP can be applied as a general debiasing
method for other fairness issues related to CLIP.
Introduction
The past few years have witnessed the rapid development
of Vision-Language Pre-training (VLP) models (Chen et al.
2020; Cui et al. 2021; Li et al. 2020; Zhang et al. 2020).
Among them, OpenAI’s CLIP (Radford et al. 2021) stands
out. CLIP has strong zero-shot performance and can be used
directly for many downstream tasks without fine-tuning.
One typical task is image retrieval. With the help of aligned
image-text semantic space from CLIP, it is possible to re-
trieve images by text. Although image retrieval is a relatively
benign downstream task (Berg et al. 2022), (Geyik, Ambler,
and Kenthapadi 2019) still considered that image retrieval
exists fairness issues. For example, as shown in Figure 1
(top), when using CLIP as the retrieval model and “A photo
of a smart person” as the retrieval text to retrieve images on
the face dataset, 89 of the first 100 samples in the result are
male, even though there is no explicit gender hint.
Figure 1: Example of the bias in CLIP-based image retrieval.
When retrieving the face dataset using the specific text with
no gender hint, almost all of the first returned 100 samples
are male (top). We use Re-Representation Matrix (RRM) to
neutralize the visual representations to achieve debiasing.
After debiasing, the number of male and female samples is
relatively balanced (bottom).
Prioritizing fairness is of central importance in artificial
intelligence (AI) systems, however, there exists very lim-
ited work addressing the bias in CLIP-based image retrieval.
(Berg et al. 2022) used an adversarial learning method for
debiasing. They added learnable word vectors before the re-
trieval text to interfere with CLIP to capture gender infor-
mation from the text. (Wang, Liu, and Wang 2021a) used a
dropout-like editing method. They dropped the dimensions
that are highly correlated with gender in the visual and text
representation. The above works failed to consider the mul-
timodal property of CLIP. Since their methods are migrated
from unimodal bias work, they ignore the fact that biases in
the VLP model are generated from the interaction between
the two modalities, making it difficult to guarantee the align-
ment of the two modalities after debiasing. A strong debias-
ing can destroy the semantic space while a weak debiasing
has a very limited effect. This results in the incompatibility
between the debiasing effect and retrieval performance.
To better consider the multimodal property of CLIP for
debiasing, there are two challenges to be addressed. First,
to achieve the debiasing for specific attributes, we need to
model the concepts of attributes in CLIP. A natural idea is
using the manual query to extract the concepts. However,
arXiv:2210.14562v2 [cs.CV] 30 May 2024
the manual query usually generalizes poorly to downstream
datasets (Zhou et al. 2022). This can lead to a deviated de-
biasing goal, thus reducing the debiasing effectiveness. Sec-
ond, it is impractical to retrain CLIP because of the signif-
icant difference between pre-training tasks and downstream
tasks. This means that the methods based on retraining are
difficult to migrate to CLIP.
Inspired by optimization methods in CLIP (Gao et al.
2021; Zhou et al. 2022), we propose FairCLIP as the Fig-
ure 2 to achieve the debiasing. The FairCLIP is divided into
two steps: (1) Attribute Prototype Learning (APL). To more
accurately extract the concept of attributes that match the
distribution of downstream datasets, we use the structure of
a query with learnable word vectors prefixes; (2) Represen-
tation Neutralization (RN). We use the Re-Representation
Matrix (RRM) which is a square matrix with the same di-
mensions as the representation layer of CLIP to achieve the
neutralization of representation. In this step, we first analyze
the bias in image retrieval and divide the attributes into tar-
get and bias attributes. The bias stems from two parts. First,
the divergence in the representation of bias attributes can di-
rectly cause divergence in the representation of groups with
different bias attributes. Second, the bias attributes are of-
ten more significant compared to the target attributes, which
makes the bias attributes have a greater impact on the re-
trieval results. For target and bias attributes, we set the train-
ing constraints of RRM separately.
We summarize the contributions as follows:
• We propose Attribute Prototype Learning (APL) that
models the concepts of target and bias attributes in CLIP.
To take advantage of the multimodal property of CLIP,
we use the query with learnable word vector prefixes for
the attribute concepts extraction.
We used the extracted concepts to analyze that bias is
affected by both target and bias attributes. To elimi-
nate the bias, we set the training constraints of the Re-
Representation Matrix (RRM) for target and bias at-
tributes separately to achieve Representation Neutraliza-
tion (RN).
We examine the debiasing effect on face datasets and
examine retrieval performance on image-text retrieval
datasets. We compare the results with other debiasing
methods. The experiments demonstrate that FairCLIP not
only achieves the best debiasing effect but also has little
degradation in retrieval performance.
Background and Related Work
CLIP
Without fine-tuning, a variety of downstream tasks can be
achieved by CLIP. CLIP encodes images and text separately
and calculates the similarities between them. The similari-
ties can be used for tasks such as classification, retrieval, etc.
The performance of CLIP can be close to or even better than
fine-tune models (Radford et al. 2021). Much research work
has been done on applying CLIP to application scenarios
such as image segmentation (Xu et al. 2021), caption gener-
ation (Dai et al. 2022; Mokady, Hertz, and Bermano 2021),
image generation (Wang et al. 2022), and target detection
(Du et al. 2022). (Shen et al. 2021) applied CLIP’s modules
to other VLP models to improve the performance. Many de-
velopers have developed applications based on CLIP for dif-
ferent purposes in spaces of Hugging Face1. What can be
confirmed is that CLIP is making multimodal work more ef-
ficient.
Fairness in CLIP and Image Retrieval
Beyond performance, fairness plays a critical role in the
trustworthy deployment of VLP models and is highly en-
dorsed by many VLP model designers (Li et al. 2021). Al-
though CLIP has been widely used, (Schuhmann et al. 2021)
argued that lacking the constraint of high-quality fine-tuning
data, the model can only depend on the pre-training data
that often suffers from human social biases. These biases are
highly susceptible to being captured by the model during the
pre-training phase (Steed and Caliskan 2021). Several works
have focused on the fairness issue in CLIP. (Agarwal et al.
2021) found that CLIP is more likely to classify blacks as
bad. (Dehouche 2021) found that CLIP can extract gender
information from neutral text. (Wang, Liu, and Wang 2021b)
found that different languages in CLIP show different biases.
(Wolfe, Banaji, and Caliskan 2022) demonstrated that CLIP
learns stereotypes about gender and race.
The biases in CLIP have the potential to manifest in im-
age retrieval, which can cause serious social implications
(Geyik, Ambler, and Kenthapadi 2019). But limited work
addresses this issue. (Berg et al. 2022) found that the results
of image retrieval in face datasets exist gender bias when us-
ing a certain neutral query. They added a learnable prefix be-
fore the query to eliminate the gender information and used
adversarial training to optimize it. (Wang, Liu, and Wang
2021a) found that gender bias also exists when retrieving
images in datasets such as MSCOCO and Flickr. They used a
post-processing method called CLIP-clip. The idea is to dis-
card the dimensions of concept representation that are most
relevant to gender. In the above works, the ideas of debiasing
are just a simple migration of the methods in unimodal bias
elimination. They failed to focus on the multimodal prop-
erty of CLIP in the process of debiasing, which results in
the incompatibility of the debiasing effect and retrieval per-
formance.
Two challenges need to be addressed to better use the mul-
timodal property of CLIP for debiasing. First, the concept
of a visual attribute is difficult to extract directly from the
visual modality in CLIP. A natural idea is to use a manual
query that describes this visual attribute. However, the man-
ual query often can’t generalize well to downstream datasets
so it often cannot achieve the best result (Zhou et al. 2022).
Second, CLIP can’t be retrained. This is because CLIP uses
a large amount of image-text data to construct a large-scale
similarity matrix for comparison learning in the pre-training
phase. Such a pre-training task is almost impossible to repro-
duce, and thus any optimization would damage the seman-
tic space of CLIP. This means that many traditional fairness
methods that are based on retraining are difficult to migrate
1https://huggingface.co/spaces
摘要:

FairCLIP:SocialBiasEliminationbasedonAttributePrototypeLearningandRepresentationNeutralizationJunyangWang,YiZhang,JitaoSangBeijingJiaotongUniversityjunyangwang,yi.zhang,jtsang@bjtu.edu.cnAbstractTheVision-LanguagePre-training(VLP)modelslikeCLIPhavegainedpopularityinrecentyears.However,manyworksfound...

展开>> 收起<<
FairCLIP Social Bias Elimination based on Attribute Prototype Learning and Representation Neutralization Junyang Wang Yi Zhang Jitao Sang.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:3.26MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注