FairCLIP Social Bias Elimination based on Attribute Prototype Learning and Representation Neutralization Junyang Wang Yi Zhang Jitao Sang

2025-05-06 0 0 3.26MB 9 页 10玖币

侵权投诉

FairCLIP: Social Bias Elimination based on Attribute Prototype Learning and

Representation Neutralization

Junyang Wang, Yi Zhang, Jitao Sang

Beijing Jiaotong University

junyangwang, yi.zhang, jtsang@bjtu.edu.cn

Abstract

The Vision-Language Pre-training (VLP) models like CLIP

have gained popularity in recent years. However, many works

found that the social biases hidden in CLIP easily manifest in

downstream tasks, especially in image retrieval, which can

have harmful effects on human society. In this work, we pro-

pose FairCLIP to eliminate the social bias in CLIP-based

image retrieval without damaging the retrieval performance

achieving the compatibility between the debiasing effect and

the retrieval performance. FairCLIP is divided into two steps:

Attribute Prototype Learning (APL) and Representation Neu-

tralization (RN). In the ﬁrst step, we extract the concepts

needed for debiasing in CLIP. We use the query with learn-

able word vector preﬁxes as the extraction structure. In the

second step, we ﬁrst divide the attributes into target and bias

attributes. By analysis, we ﬁnd that both attributes have an

impact on the bias. Therefore, we try to eliminate the bias by

using Re-Representation Matrix (RRM) to achieve the neu-

tralization of the representation. We compare the debiasing

effect and retrieval performance with other methods, and ex-

periments demonstrate that FairCLIP can achieve the best

compatibility. Although FairCLIP is used to eliminate bias in

image retrieval, it achieves the neutralization of the represen-

tation which is common to all CLIP downstream tasks. This

means that FairCLIP can be applied as a general debiasing

method for other fairness issues related to CLIP.

Introduction

The past few years have witnessed the rapid development

of Vision-Language Pre-training (VLP) models (Chen et al.

2020; Cui et al. 2021; Li et al. 2020; Zhang et al. 2020).

Among them, OpenAI’s CLIP (Radford et al. 2021) stands

out. CLIP has strong zero-shot performance and can be used

directly for many downstream tasks without ﬁne-tuning.

One typical task is image retrieval. With the help of aligned

image-text semantic space from CLIP, it is possible to re-

trieve images by text. Although image retrieval is a relatively

benign downstream task (Berg et al. 2022), (Geyik, Ambler,

and Kenthapadi 2019) still considered that image retrieval

exists fairness issues. For example, as shown in Figure 1

(top), when using CLIP as the retrieval model and “A photo

of a smart person” as the retrieval text to retrieve images on

the face dataset, 89 of the ﬁrst 100 samples in the result are

male, even though there is no explicit gender hint.

Figure 1: Example of the bias in CLIP-based image retrieval.

When retrieving the face dataset using the speciﬁc text with

no gender hint, almost all of the ﬁrst returned 100 samples

are male (top). We use Re-Representation Matrix (RRM) to

neutralize the visual representations to achieve debiasing.

After debiasing, the number of male and female samples is

relatively balanced (bottom).

Prioritizing fairness is of central importance in artiﬁcial

intelligence (AI) systems, however, there exists very lim-

ited work addressing the bias in CLIP-based image retrieval.

(Berg et al. 2022) used an adversarial learning method for

debiasing. They added learnable word vectors before the re-

trieval text to interfere with CLIP to capture gender infor-

mation from the text. (Wang, Liu, and Wang 2021a) used a

dropout-like editing method. They dropped the dimensions

that are highly correlated with gender in the visual and text

representation. The above works failed to consider the mul-

timodal property of CLIP. Since their methods are migrated

from unimodal bias work, they ignore the fact that biases in

the VLP model are generated from the interaction between

the two modalities, making it difﬁcult to guarantee the align-

ment of the two modalities after debiasing. A strong debias-

ing can destroy the semantic space while a weak debiasing

has a very limited effect. This results in the incompatibility

between the debiasing effect and retrieval performance.

To better consider the multimodal property of CLIP for

debiasing, there are two challenges to be addressed. First,

to achieve the debiasing for speciﬁc attributes, we need to

model the concepts of attributes in CLIP. A natural idea is

using the manual query to extract the concepts. However,

arXiv:2210.14562v2 [cs.CV] 30 May 2024

the manual query usually generalizes poorly to downstream

datasets (Zhou et al. 2022). This can lead to a deviated de-

biasing goal, thus reducing the debiasing effectiveness. Sec-

ond, it is impractical to retrain CLIP because of the signif-

icant difference between pre-training tasks and downstream

tasks. This means that the methods based on retraining are

difﬁcult to migrate to CLIP.

Inspired by optimization methods in CLIP (Gao et al.

2021; Zhou et al. 2022), we propose FairCLIP as the Fig-

ure 2 to achieve the debiasing. The FairCLIP is divided into

two steps: (1) Attribute Prototype Learning (APL). To more

accurately extract the concept of attributes that match the

distribution of downstream datasets, we use the structure of

a query with learnable word vectors preﬁxes; (2) Represen-

tation Neutralization (RN). We use the Re-Representation

Matrix (RRM) which is a square matrix with the same di-

mensions as the representation layer of CLIP to achieve the

neutralization of representation. In this step, we ﬁrst analyze

the bias in image retrieval and divide the attributes into tar-

get and bias attributes. The bias stems from two parts. First,

the divergence in the representation of bias attributes can di-

rectly cause divergence in the representation of groups with

different bias attributes. Second, the bias attributes are of-

ten more signiﬁcant compared to the target attributes, which

makes the bias attributes have a greater impact on the re-

trieval results. For target and bias attributes, we set the train-

ing constraints of RRM separately.

We summarize the contributions as follows:

• We propose Attribute Prototype Learning (APL) that

models the concepts of target and bias attributes in CLIP.

To take advantage of the multimodal property of CLIP,

we use the query with learnable word vector preﬁxes for

the attribute concepts extraction.

• We used the extracted concepts to analyze that bias is

affected by both target and bias attributes. To elimi-

nate the bias, we set the training constraints of the Re-

Representation Matrix (RRM) for target and bias at-

tributes separately to achieve Representation Neutraliza-

tion (RN).

• We examine the debiasing effect on face datasets and

examine retrieval performance on image-text retrieval

datasets. We compare the results with other debiasing

methods. The experiments demonstrate that FairCLIP not

only achieves the best debiasing effect but also has little

degradation in retrieval performance.

Background and Related Work

CLIP

Without ﬁne-tuning, a variety of downstream tasks can be

achieved by CLIP. CLIP encodes images and text separately

and calculates the similarities between them. The similari-

ties can be used for tasks such as classiﬁcation, retrieval, etc.

The performance of CLIP can be close to or even better than

ﬁne-tune models (Radford et al. 2021). Much research work

has been done on applying CLIP to application scenarios

such as image segmentation (Xu et al. 2021), caption gener-

ation (Dai et al. 2022; Mokady, Hertz, and Bermano 2021),

image generation (Wang et al. 2022), and target detection

(Du et al. 2022). (Shen et al. 2021) applied CLIP’s modules

to other VLP models to improve the performance. Many de-

velopers have developed applications based on CLIP for dif-

ferent purposes in spaces of Hugging Face1. What can be

conﬁrmed is that CLIP is making multimodal work more ef-

ﬁcient.

Fairness in CLIP and Image Retrieval

Beyond performance, fairness plays a critical role in the

trustworthy deployment of VLP models and is highly en-

dorsed by many VLP model designers (Li et al. 2021). Al-

though CLIP has been widely used, (Schuhmann et al. 2021)

argued that lacking the constraint of high-quality ﬁne-tuning

data, the model can only depend on the pre-training data

that often suffers from human social biases. These biases are

highly susceptible to being captured by the model during the

pre-training phase (Steed and Caliskan 2021). Several works

have focused on the fairness issue in CLIP. (Agarwal et al.

2021) found that CLIP is more likely to classify blacks as

bad. (Dehouche 2021) found that CLIP can extract gender

information from neutral text. (Wang, Liu, and Wang 2021b)

found that different languages in CLIP show different biases.

(Wolfe, Banaji, and Caliskan 2022) demonstrated that CLIP

learns stereotypes about gender and race.

The biases in CLIP have the potential to manifest in im-

age retrieval, which can cause serious social implications

(Geyik, Ambler, and Kenthapadi 2019). But limited work

addresses this issue. (Berg et al. 2022) found that the results

of image retrieval in face datasets exist gender bias when us-

ing a certain neutral query. They added a learnable preﬁx be-

fore the query to eliminate the gender information and used

adversarial training to optimize it. (Wang, Liu, and Wang

2021a) found that gender bias also exists when retrieving

images in datasets such as MSCOCO and Flickr. They used a

post-processing method called CLIP-clip. The idea is to dis-

card the dimensions of concept representation that are most

relevant to gender. In the above works, the ideas of debiasing

are just a simple migration of the methods in unimodal bias

elimination. They failed to focus on the multimodal prop-

erty of CLIP in the process of debiasing, which results in

the incompatibility of the debiasing effect and retrieval per-

formance.

Two challenges need to be addressed to better use the mul-

timodal property of CLIP for debiasing. First, the concept

of a visual attribute is difﬁcult to extract directly from the

visual modality in CLIP. A natural idea is to use a manual

query that describes this visual attribute. However, the man-

ual query often can’t generalize well to downstream datasets

so it often cannot achieve the best result (Zhou et al. 2022).

Second, CLIP can’t be retrained. This is because CLIP uses

a large amount of image-text data to construct a large-scale

similarity matrix for comparison learning in the pre-training

phase. Such a pre-training task is almost impossible to repro-

duce, and thus any optimization would damage the seman-

tic space of CLIP. This means that many traditional fairness

methods that are based on retraining are difﬁcult to migrate

1https://huggingface.co/spaces

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FairCLIP:SocialBiasEliminationbasedonAttributePrototypeLearningandRepresentationNeutralizationJunyangWang,YiZhang,JitaoSangBeijingJiaotongUniversityjunyangwang,yi.zhang,jtsang@bjtu.edu.cnAbstractTheVision-LanguagePre-training(VLP)modelslikeCLIPhavegainedpopularityinrecentyears.However,manyworksfound...

展开>> 收起<<

FairCLIP Social Bias Elimination based on Attribute Prototype Learning and Representation Neutralization Junyang Wang Yi Zhang Jitao Sang.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

FairCLIP Social Bias Elimination based on Attribute Prototype Learning and Representation Neutralization Junyang Wang Yi Zhang Jitao Sang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: