Towards Practical Explainability with Cluster Descriptors 1stXiaoyuan Liu

2025-05-06 0 0 361.28KB 10 页 10玖币
侵权投诉
Towards Practical Explainability with Cluster
Descriptors
1st Xiaoyuan Liu
Fujitsu Research of America, Inc.
Sunnyvale, CA, USA
xliu@fujitsu.com
2nd Ilya Tyagin
University of Delaware
Newark, DE, USA
tyagin@udel.edu
3rd Hayato Ushijima-Mwesigwa
Fujitsu Research of America, Inc.
Sunnyvale, CA, USA
hayato@fujitsu.com
4th Indradeep Ghosh
Fujitsu Research of America, Inc.
Sunnyvale, CA, USA
ighosh@fujitsu.com
5th Ilya Safro
University of Delaware
Newark, DE, USA
isafro@udel.edu
Abstract—With the rapid development of machine learning,
improving its explainability has become a crucial research goal.
We study the problem of making the clusters more explainable
by investigating the cluster descriptors. Given a set of objects
S, a clustering of these objects π, and a set of tags Tthat
have not participated in the clustering algorithm. Each object
in Sis associated with a subset of T. The goal is to find a
representative set of tags for each cluster, referred to as the
cluster descriptors, with the constraint that these descriptors we
find are pairwise disjoint, and the total size of all the descriptors
is minimized. In general, this problem is NP-hard. We propose a
novel explainability model that reinforces the previous models in
such a way that tags that do not contribute to explainability and
do not sufficiently distinguish between clusters are not added to
the optimal descriptors. The proposed model is formulated as
a quadratic unconstrained binary optimization problem which
makes it suitable for solving on modern optimization hardware
accelerators. We experimentally demonstrate how a proposed
explainability model can be solved on specialized hardware
for accelerating combinatorial optimization, the Fujitsu Digital
Annealer, and use real-life Twitter and PubMed datasets for use
cases. Reproducibility materials: Link
Index Terms—explainability, clustering, machine learning,
Ising model, combinatorial optimization
I. INTRODUCTION
Machine learning (ML) methods and artificial intelligence
(AI) technologies bring growing impact to our daily life, across
all domains, from healthcare to urban planning. As a result, the
need to make the obtained results more explainable is contin-
uously growing, and improving the explainability is extremely
important for many applications. Results of unsupervised ML
techniques are often hard to interpret, especially in a post-hoc
analysis. Unsupervised clustering is not an exception.
While most existing literature focuses on performing clus-
tering and finding an explanation simultaneously using the
same dataset, Davidson et al. [1] first introduced the cluster
description problem in which they explored the idea of taking
a partitioned dataset Xfound by some clustering algorithm,
and explaining its result using another dataset Y. For instance,
This work is supported by Fujitsu Research of America, Inc.
in our experiments with the Twitter dataset [1], Xis a matrix
of user reposting behavior in Twitter, i.e., Xi,j is the number
of times user iwas retweeted by user j. We are given the
result found by spectral clustering, which partitions the users
into clusters. The dataset Yis a matrix of users’ hashtag usage,
i.e., Yi,j is the number of times user itweets with hashtag j.
For another example, we experiment with PubMed, the dataset
of scientific abstracts and metadata, for which matrix Xis
not strictly required for clustering. The clustering is naturally
performed based on the medical subject heading terms (MeSH
terms) that are manually curated and added to the document
metadata. Matrix Yrepresents a document-term matrix i.e.,
Yi,j is the number of times a term joccurs in document
i. Overall, there exist many motivating examples in which
we may need to restrict some features from participating in
clustering but then use them to analyze the clusters. One
important example is sensitive and protected features that
exhibit a major problem in modern data mining. These features
are not supposed to participate in clustering but if they still
clearly distinguish the obtained clusters, it could be a sign of a
wrong clustering model or hyperparameters’ choice. Another
use case is the third party proprietary data that cannot inform
clustering but can be used to analyze its results.
Let Sdenote the set of nobjects, S={si}n
i=1, and we are
given a clustering πthat partitions Sinto kdisjoint clusters
{C`}k
`=1. We are also given a set of tags T. Each object siS
is associated with a subset tiTof tags, 1in. For each
cluster, we define T`T, the descriptor of cluster C`. An
object siin cluster C`, is said to be covered by the descriptor
T`, if T`contains at least one of the tags in ti, the tag set
associated with si. The goal of cluster description problem
is to find kpairwise disjoint descriptors, one descriptor for
each cluster, such that all objects are covered and the total
number of tags used in the descriptors is minimized. This
optimization problem is refered to as the disjoint tag descriptor
minimization (DTDM) problem. Figure 1(a) depicts a toy
example with 6 objects and a partition of 2 clusters.
However, the constraint that requires all objects to be
arXiv:2210.10662v2 [cs.LG] 20 Oct 2022
Tag 1 Tag 2 Tag 3 Tag 4 Tag 1 TAG2 TAG 3 TAG4 Tag 1 Tag 2 Tag 3 Tag 4
(a) (b) (c)
Fig. 1. (a) A toy example with 6 objects {si}6
i=1, and their partition into 2 clusters C1and C2. Tag set Tcontains 4 tags. Each object siis associated a subset
tiT. In this example, t1={TAG 1,TAG 2},t2={TAG 1},t3={TAG 1},t4={TAG 2,TAG 3,TAG 4},t5={TAG 3},t6={TAG 3}. One
solution for DTDM is T1={TAG 1}, T2={TAG 3}. (b) Optimal solution for DTDM and MinConCD is T1={TAG 1}, T2={TAG 3}, however both
tags are generic and do not reveal specific information for each cluster, reducing the usefulness of the descriptors. (c) If select T1={TAG 1,TAG 2}, T2=
{TAG 3,TAG 4}as descriptors, they explain each cluster better albeit using more tags.
covered can sometimes be too strict, making even finding
a feasible solution very challenging. To address this issue,
Davidson et al. also discussed several variants of DTDM.
One such variant is referred to as the “cover-or-forget” ver-
sion, which relaxes the constraint that all objects must be
covered and allows ignoring some of the objects. Recently,
Sambaturu et al. [2], have pointed out that the “cover-or-
forget” variant does resolve the feasibility issue but might
have highly unbalanced coverage, i.e., for some clusters, the
descriptors can cover most of the objects, but for some cluster,
the descriptor might cover very few objects. The authors
proposed to address this issue by adding constraints on the
number of objects needed to be covered for each cluster, i.e.,
introduce additional parameter M`for each cluster C`, such
that the number of objects that are covered must be at least
M`. The extended cluster description formulation with more
balanced coverage guarantees for all clusters is referred to
as the minimum constrained cluster description (MinConCD)
problem. The DTDM is therefore a special case of MinConCD
when M`=|C`|for 1`k. However, we argue that this
model has another issue. Consider the following scenario: in
many datasets, we might have several tags that cover most
of the objects. For example, it has been noticed multiple
times that depending on the partially extracted information,
the Twitter hashtags may follow Zipf or approximate power
laws [3], [4]. If we only focus on minimizing the total number
of tags while satisfying the coverage constraints, the most
frequently used tags will be selected preferably. Figure 1(b)
shows such scenario, while the optimal solution uses only 2
tags. These tags might generally describe the entire dataset,
they do not reveal specific information about the cluster they
are assigned to. The descriptors of the clusters will therefore be
similar to each other, reducing the usefulness of the results. In
contrast, we wish to select the tags that reveal the information
of each cluster albeit potentially using more tags, as in the
scenario displayed in Figure 1(c).
To address this issue, we propose to further reinforce the
explainability formulations of MinConCD by adding an addi-
tional metric to the objective function which will fundamen-
tally change the outcomes. The novel metric, which we call
tag modularity (TM), is inspired by the notion of modularity
[5] in network science. Modularity is a measure of the graph
structure which quantifies the strengths of the division of a
graph into clusters. A clustering with high modularity tends to
have dense connections between the nodes within clusters but
sparse connections between nodes in different clusters. Quanti-
tatively, the modularity on graphs measures a deviation of node
clusters from a random model. Maximizing the modularity
naturally punishes trivial clusterings which are practically
useless. Intuitively, by considering both the total number of
tags used for the descriptors and the connections between the
tags and the objects, we can better identify the cluster-specific
tags and filter out the general tags, making the descriptor
more interpretable and thus improving the explainability of
the descriptors.
Both DTDM and MinConCD are challenging combinatorial
optimization problems. Davidson et al. [1] showed that even
for two clusters, finding only a feasible solution of DTDM
is intractable not to mention the optimality. Sambaturu et
al. [2] showed that if the coverage constraints in MinConCD
for each cluster must be met, then unless P=NP , for
any ρ1, there is no polynomial-time algorithm that can
approximate the objective function value with a factor of ρ.
Typically, in order to tackle such challenging optimization
models, the problems are formulated in terms of integer linear
programming (ILP), as proposed in the works of Davidson
et al. and Sambaturu et al. Then, to tackle the computational
intractability of such problems, relaxations on the constraints
or approximation algorithms are applied.
We explore the way to deal with the hardness of such
problems from the optimization accelerator systems’ angle
whose recent advent opens new opportunities for the data
mining community. As we are approaching the end of Moore’s
law [6] due to the physical limitations, major efforts from the
摘要:

TowardsPracticalExplainabilitywithClusterDescriptors1stXiaoyuanLiuFujitsuResearchofAmerica,Inc.Sunnyvale,CA,USAxliu@fujitsu.com2ndIlyaTyaginUniversityofDelawareNewark,DE,USAtyagin@udel.edu3rdHayatoUshijima-MwesigwaFujitsuResearchofAmerica,Inc.Sunnyvale,CA,USAhayato@fujitsu.com4thIndradeepGhoshFujits...

展开>> 收起<<
Towards Practical Explainability with Cluster Descriptors 1stXiaoyuan Liu.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:361.28KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注