
the manual query usually generalizes poorly to downstream
datasets (Zhou et al. 2022). This can lead to a deviated de-
biasing goal, thus reducing the debiasing effectiveness. Sec-
ond, it is impractical to retrain CLIP because of the signif-
icant difference between pre-training tasks and downstream
tasks. This means that the methods based on retraining are
difficult to migrate to CLIP.
Inspired by optimization methods in CLIP (Gao et al.
2021; Zhou et al. 2022), we propose FairCLIP as the Fig-
ure 2 to achieve the debiasing. The FairCLIP is divided into
two steps: (1) Attribute Prototype Learning (APL). To more
accurately extract the concept of attributes that match the
distribution of downstream datasets, we use the structure of
a query with learnable word vectors prefixes; (2) Represen-
tation Neutralization (RN). We use the Re-Representation
Matrix (RRM) which is a square matrix with the same di-
mensions as the representation layer of CLIP to achieve the
neutralization of representation. In this step, we first analyze
the bias in image retrieval and divide the attributes into tar-
get and bias attributes. The bias stems from two parts. First,
the divergence in the representation of bias attributes can di-
rectly cause divergence in the representation of groups with
different bias attributes. Second, the bias attributes are of-
ten more significant compared to the target attributes, which
makes the bias attributes have a greater impact on the re-
trieval results. For target and bias attributes, we set the train-
ing constraints of RRM separately.
We summarize the contributions as follows:
• We propose Attribute Prototype Learning (APL) that
models the concepts of target and bias attributes in CLIP.
To take advantage of the multimodal property of CLIP,
we use the query with learnable word vector prefixes for
the attribute concepts extraction.
• We used the extracted concepts to analyze that bias is
affected by both target and bias attributes. To elimi-
nate the bias, we set the training constraints of the Re-
Representation Matrix (RRM) for target and bias at-
tributes separately to achieve Representation Neutraliza-
tion (RN).
• We examine the debiasing effect on face datasets and
examine retrieval performance on image-text retrieval
datasets. We compare the results with other debiasing
methods. The experiments demonstrate that FairCLIP not
only achieves the best debiasing effect but also has little
degradation in retrieval performance.
Background and Related Work
CLIP
Without fine-tuning, a variety of downstream tasks can be
achieved by CLIP. CLIP encodes images and text separately
and calculates the similarities between them. The similari-
ties can be used for tasks such as classification, retrieval, etc.
The performance of CLIP can be close to or even better than
fine-tune models (Radford et al. 2021). Much research work
has been done on applying CLIP to application scenarios
such as image segmentation (Xu et al. 2021), caption gener-
ation (Dai et al. 2022; Mokady, Hertz, and Bermano 2021),
image generation (Wang et al. 2022), and target detection
(Du et al. 2022). (Shen et al. 2021) applied CLIP’s modules
to other VLP models to improve the performance. Many de-
velopers have developed applications based on CLIP for dif-
ferent purposes in spaces of Hugging Face1. What can be
confirmed is that CLIP is making multimodal work more ef-
ficient.
Fairness in CLIP and Image Retrieval
Beyond performance, fairness plays a critical role in the
trustworthy deployment of VLP models and is highly en-
dorsed by many VLP model designers (Li et al. 2021). Al-
though CLIP has been widely used, (Schuhmann et al. 2021)
argued that lacking the constraint of high-quality fine-tuning
data, the model can only depend on the pre-training data
that often suffers from human social biases. These biases are
highly susceptible to being captured by the model during the
pre-training phase (Steed and Caliskan 2021). Several works
have focused on the fairness issue in CLIP. (Agarwal et al.
2021) found that CLIP is more likely to classify blacks as
bad. (Dehouche 2021) found that CLIP can extract gender
information from neutral text. (Wang, Liu, and Wang 2021b)
found that different languages in CLIP show different biases.
(Wolfe, Banaji, and Caliskan 2022) demonstrated that CLIP
learns stereotypes about gender and race.
The biases in CLIP have the potential to manifest in im-
age retrieval, which can cause serious social implications
(Geyik, Ambler, and Kenthapadi 2019). But limited work
addresses this issue. (Berg et al. 2022) found that the results
of image retrieval in face datasets exist gender bias when us-
ing a certain neutral query. They added a learnable prefix be-
fore the query to eliminate the gender information and used
adversarial training to optimize it. (Wang, Liu, and Wang
2021a) found that gender bias also exists when retrieving
images in datasets such as MSCOCO and Flickr. They used a
post-processing method called CLIP-clip. The idea is to dis-
card the dimensions of concept representation that are most
relevant to gender. In the above works, the ideas of debiasing
are just a simple migration of the methods in unimodal bias
elimination. They failed to focus on the multimodal prop-
erty of CLIP in the process of debiasing, which results in
the incompatibility of the debiasing effect and retrieval per-
formance.
Two challenges need to be addressed to better use the mul-
timodal property of CLIP for debiasing. First, the concept
of a visual attribute is difficult to extract directly from the
visual modality in CLIP. A natural idea is to use a manual
query that describes this visual attribute. However, the man-
ual query often can’t generalize well to downstream datasets
so it often cannot achieve the best result (Zhou et al. 2022).
Second, CLIP can’t be retrained. This is because CLIP uses
a large amount of image-text data to construct a large-scale
similarity matrix for comparison learning in the pre-training
phase. Such a pre-training task is almost impossible to repro-
duce, and thus any optimization would damage the seman-
tic space of CLIP. This means that many traditional fairness
methods that are based on retraining are difficult to migrate
1https://huggingface.co/spaces