1
Subspace-Contrastive Multi-View Clustering
Lele Fu, Lei Zhang, Jinghua Yang, Chuan Chen*, Chuanfu Zhang, and Zibin Zheng, Senior Member, IEEE
Abstract—Most multi-view clustering methods are limited by
shallow models without sound nonlinear information perception
capability, or fail to effectively exploit complementary informa-
tion hidden in different views. To tackle these issues, we propose
a novel Subspace-Contrastive Multi-View Clustering (SCMC)
approach. Specifically, SCMC utilizes view-specific auto-encoders
to map the original multi-view data into compact features per-
ceiving its nonlinear structures. Considering the large semantic
gap of data from different modalities, we employ subspace
learning to unify the multi-view data into a joint semantic
space, namely the embedded compact features are passed through
multiple self-expression layers to learn the subspace represen-
tations, respectively. In order to enhance the discriminability
and efficiently excavate the complementarity of various subspace
representations, we use the contrastive strategy to maximize
the similarity between positive pairs while differentiate negative
pairs. Thus, a weighted fusion scheme is developed to initially
learn a consistent affinity matrix. Furthermore, we employ the
graph regularization to encode the local geometric structure
within varying subspaces for further fine-tuning the appropriate
affinities between instances. To demonstrate the effectiveness of
the proposed model, we conduct a large number of comparative
experiments on eight challenge datasets, the experimental results
show that SCMC outperforms existing shallow and deep multi-
view clustering methods.
Index Terms—Multi-view clustering, subspace clustering,
multi-view fusion, contrastive learning.
I. INTRODUCTION
With the growing popularity of data generation and feature
extraction, multi-view or multimedia data are available in large
quantities. To be specific, multi-view data refer to various
feature representations from multiple aspects of objects. For
instance, an image can be characterized by wavelet texture
(WT), local binary pattern (LBP), histogram of oriented gra-
dient (HOG), etc. A piece of document can be expressed in
numerous languages. Researchers generally believe that multi-
view data consist of rich and useful heterogeneous informa-
tion, so the technologies related to multi-view analysis are
receiving increasing attention. Multi-view clustering (MVC)
[1], [2], [3] is one of the representative technologies, which
aims to explore the complementary and consistent information
embedded in multi-view data to boost the clustering perfor-
mance.
Currently, there are extensive multi-view clustering meth-
ods. For example, graph-based MVC [4], [5], [6] learned
the connectivity graph matrices to reveal the relationship of
samples, then the designed fusion schemes are developed
Lele Fu and Chuanfu Zhang are with the School of System Sciences
and Engineering, Sun Yat-sen University, Guangzhou, China. Lei Zhang,
Chuan Chen, and Zibin Zheng are with the School of Computer Sci-
ence and Engineering, Sun Yat-sen University, Guangzhou, China. Jinghua
Yang is with Faculty of Information Technology, Macau University of
Science and Technology, Macau, China. (email: lawrencefzu@gmail.com,
chenchuan@mail.sysu.edu.cn). * Corresponding author.
to merge these graph matrices into a global graph. Spectral
embedding-based MVC [7], [8], [9] exploited low-dimensional
spectral embedding with orthogonal constraint for each view,
which portrays important components of data, a consensus
representation was further learned on the basis of them. The
goal of nonnegative matrix based MVC [10], [11], [12] is
to factorize a nonnegative discrete cluster indicator matrix
from varying representations, thus the argmax(·)function
is adopted to acquire the data labels. Among multitudinous
MVC methods, multi-view subspace clustering is a research
hotspot and widely studied for its superior performance, which
absorbs theory from conventional subspace clustering and
further develops it. The works [13], [14] are classic multi-
view subspace clustering approaches, which aimed to explore
a uniform underlying feature space from multiple subspace
representations. [15], [16] performed the tensor factorization
on the representation tensor to capture the global correlations
between views. These shallow models have yielded promis-
ing clustering results, but most real-world data are high-
dimensional and nonlinear, shallow models might not been
equipped with the ability to fetch nonlinear structures.
Auto-Encoder (AE) is an effective unsupervised deep rep-
resentation learning paradigm, which non-linearly maps the
original data features into a compact feature space via the
encoders, then passes the compact representations through
the decoders to reconstruct the data. AE is frequently used
to condense data information in clustering tasks. [17], [18]
are two well-known deep embedding learning methods, which
used Kullback-Leibler (KL) divergence regularization to max-
imize the similarity of soft assignments and target distribu-
tions. During the past few years, AE is also introduced to
multi-view subspace clustering. Sun et al. [19] used self-
supervised strategy to improve the unified subspace represen-
tation learning. Zhu et al. [20] simultaneously learned a set
of view-specific self-expression representations, then which
are combined into a common self-expression representation.
Wang et al. [21] learned a unified subspace representation
from multi-view discriminative feature spaces. Cui et al. [22]
proposed the spectral supervisor to guide the learning of con-
sensus subspace representation. The clustering performance
of the above deep multi-view subspace clustering approaches
are excellent, but their abilities of exploiting the association
between multiple subspace representations still need to be
improved. For instance, [19], [21] directly learn the consistent
self-expression representation from multi-view latent features
refined by AEs, which could not capture the characteristics of
disparate views, thus failing to utilizing the complementary
information. [20] applies a Hilbert Schmidt Independence
Criterion (HSIC) regularization term to reinforce the diversity
of different views, this indistinguishable alienation of different
views may render it difficult to obtain the agreement of them.
arXiv:2210.06795v1 [cs.LG] 13 Oct 2022