ProSiT Latent Variable Discovery with PROgressive SImilarity Thresholds Tommaso Fornaciari

2025-05-02 0 0 2.14MB 17 页 10玖币
侵权投诉
ProSiT! Latent Variable Discovery with PROgressive SImilarity
Thresholds
Tommaso Fornaciari
Italian National Police
Dirk Hovy
Bocconi University
Federico Bianchi
Stanford University
Abstract
The most common ways to explore latent docu-
ment dimensions are topic models and cluster-
ing methods. However, topic models have sev-
eral drawbacks: e.g., they require us to choose
the number of latent dimensions a priori, and
the results are stochastic. Most clustering
methods have the same issues and lack flexibil-
ity in various ways, such as not accounting for
the influence of different topics on single doc-
uments, forcing word-descriptors to belong to
a single topic (hard-clustering) or necessarily
relying on word representations. We propose
PROgressive SImilarity Thresholds - ProSiT, a
deterministic and interpretable method, agnos-
tic to the input format, that finds the optimal
number of latent dimensions and only has two
hyper-parameters, which can be set efficiently
via grid search. We compare this method with
a wide range of topic models and clustering
methods on four benchmark data sets. In most
setting, ProSiT matches or outperforms the
other methods in terms six metrics of topic co-
herence and distinctiveness, producing replica-
ble, deterministic results.
1 Introduction
Latent variable models are essential for data explo-
ration and are commonly used in social science and
business applications. However, the most common
methods used to explore these models, topic mod-
els and clustering, come with a set of drawbacks,
chief among them their stochastic nature. This
makes results hard to replicate precisely, as they
require constant interpretation. In this paper, we
instead propose a latent variable discovery method
that produces deterministic results.
Topic models, such as latent Dirichlet allocation
(LDA) (Blei et al.,2003), rely on a number of top-
ics that are provided a priori, refining their initial
tommaso.fornaciari@poliziadistato.it
dirk.hovy@unibocconi.it
fede@stanford.edu
random guesses to establish probability distribu-
tion iteratively. This approach guarantees that the
final state improves over the initial one in terms
of topic quality, but the method is vulnerable to
local optima. This problem can be mitigated by
initializing several models and picking the most
useful/interpretable solution. Selection usually re-
lies on coherence measures that estimate the topics’
reliability. However, the method also expects the
user to provide the number of topics as a parameter.
Therefore the probability of finding good results
depends on previous domain knowledge about the
data set - which is often an unreasonable expec-
tation. In these cases, it is difficult to interpret
or justify the resulting topics theoretically, as they
depend on an arbitrary starting point. Moreover,
since each model’s random initialization will pro-
vide slightly different results, these results are de-
batable.
Clustering methods, another popular choice for
exploring latent document dimensions, similarly
rely on the number of latent dimensions as in-
put and their random initialization. For exam-
ple,
K
-Means (Sculley,2010) can extract topics
from groups of documents identified by minimiz-
ing the distance between the documents’ represen-
tations. The procedure partitions the vectors space
in Voronoi cells, where each document exclusively
contributes to the identification of a single cluster
(Burrough et al.,2015). However, in many sce-
narios, the distinction between clusters/topics is
nuanced, and we expect to find borderline texts that
can be connected to different topics, which should
not, in turn, be artificially forced to belong to a
unique cluster.
To overcome these limitations, we propose
ProSiT, a deterministic and interpretable latent
variable discovery algorithm that is entirely repro-
ducible. ProSiT is agnostic to the input space (em-
beddings, discrete textual, and even non-textual
features), allowing for flexible use and inheriting
arXiv:2210.14763v1 [cs.CL] 26 Oct 2022
the properties of the input representations of what-
ever nature. For example, with documents encoded
by a multi-lingual model, the identified topics will
also be multi-lingual.
ProSiT is entirely data-driven and does not de-
pend on any guess or random initialization. Instead,
as the name suggests, it relies on the similarity be-
tween texts, i.e., their distances in the vector space.
The only theoretical assumption we make is that
documents treating similar topic(s) will contain
similar words, and therefore will be in proximate
regions of the vector space.In geometric terms, we
assume the topic’s convexity in the vector space,
which implies that the distance between points rep-
resents their degree of similarity.
We evaluate ProSiT on four commonly used data
sets against eight common and State-Of-The-Art
latent variable discovery methods. We evaluate
the performance with six standard metrics of latent
topic coherence and distinctiveness. We find that
ProSiT is always comparable and often superior
to State-Of-The-Art methods but requires fewer
parameters and no prior selection of latent compo-
nents.
Contributions
We propose ProSiT, a novel al-
gorithm for latent variable discovery. It is inter-
pretable, deterministic, input agnostic, fuzzy, com-
putationally efficient, and effective. We release
ProSiT as a PyPi package. The documentation can
be found on GitHub.1
2 ProSiT
ProSiT takes as input a corpus of documents. The
algorithm has two main steps: 1) identifying the
latent variables (from here on: topics) and 2) ex-
tracting words describing these topics.
ProSiT is efficient for inductive learning: once
trained, the topics are represented as points in the
same vector space as the training documents. To
evaluate unseen documents is as simple as com-
puting their distance from the topics coordinates.
The only requirement for new documents is that
they are represented in the same vector space as the
training data.
2.1 Step 1: Finding topics
The first step is an iterative process that identifies
several potential topic sets. The number of topics
is data-driven, rather than decided a priori. ProSiT
uses different degrees of similarity between groups
1https://github.com/fornaciari/prosit
of documents to determine whether they belong to
a topic. These similarity levels are determined by a
threshold, which changes at each iteration step and
follows a hyperbolic curve. A parameter we call
α
tunes the slope of this curve.
Formally, we assume a set of documents
D
.
Each document is represented in two ways: as
a
d
-dimensional vector
m
(continuous or sparse),
and as a list of strings
s
, representing the (pre-
processed) document with space-separated words.
We use the matrix
M
over every document vec-
tor
m
to determine the topics and
S
over every
string representation
s
to extract the topic descrip-
tors.
We set a similarity threshold
α
and iterate over
M
with the following procedure until it stops at the
smallest number of topics found:
1.
removing any (possible) duplicated vectors
to prevent topics being affected by repeated
documents. This step could seem unneces-
sary, but many corpora contain repeated texts.
Since ProSiT models topics as clusters of their
associated documents’ average representation,
removing duplicates reduces the impact of re-
peated documents on topic identification.
2.
computing the cosine similarity between all
documents in M:
C=MM>
kMk kM>k(1)
The norms are computed row-wise and multi-
plied as column and row vectors, respectively.
The resulting matrix
C
is a square, symmetric
similarity matrix over the unique input docu-
ments.
3.
mapping each data point to the set of all its
neighbors (including itself), that is, the set
of points more similar than the threshold
α
.
The output of this procedure is a list of redun-
dant sets. However, it also identifies isolated
instances too far from others, resulting in a
singleton set. Both situations are simplified in
Figure 1.
4.
removing the groups that are repeated subsets
of other groups;
5.
computing the groups’ centroids, i.e., the doc-
uments’ average vector;
Figure 1: Examples of data points (lowercase letters)
and the relative neighbourhood areas (uppercase let-
ters), determined by α. The dotted lined represent sets
of neighbors that are subsets of other sets.
6.
assigning isolated instances (outliers) to the
group having the closest centroid;
7.
recomputing the cluster centroids, including
the outliers in the groups to which they were
assigned.
8.
re-starting the iteration, using the centroids as
new data points (keeping track of the original
data points associated with these centroids)
Similarly to agglomerative hierarchical cluster-
ing, each document is initially considered its own
topic. The number of topics is reduced by itera-
tively collapsing the documents together. However,
ProSiT and agglomerative hierarchical clustering
differ substantially. In the latter, pairs of instances
are iteratively joined, creating a tree that is rep-
resentable by dendrograms and in which each in-
stance belongs to one cluster only. In ProSiT, as
shown in Figure 1, we create sets of instances ly-
ing within a similarity threshold, where overlap
between sets is allowed. Therefore, a given in-
stance can contribute to the centroid computation
of more than one cluster, leading to smoother topic
representations.
However, there is a problem: Iteratively aver-
aging document vectors pushes the resulting cen-
troids towards the mean of means, i.e., the global
centroid of the entire data set. To prevent the docu-
ment representations from collapsing in one point,
in ProSiT we set an increasingly higher cosine sim-
ilarity threshold (
CST
) at every training iteration
iter according to a hyperbolic function
CST =iter α
iter (2)
whose slope depends on the αparameter.
Figure 2: Some examples of the αcurve.
Since the thresholds are cosine similarity values,
the maximal value is 1. We will therefore want an
α
value close to 0. Figure 2shows the hyperbolic
curve for various values of α.
ProSiT does not require the number of top-
ics/clusters as an input: They emerge from the
process. Topics correspond to the number of clus-
ters identified at each iteration, which become pro-
gressively lower until the algorithm reaches con-
vergence. The convergence is reached when all
the computed centroids are farther (or less similar)
from each other than the threshold, α.
Since the topics are represented as points in the
documents’ vector space, we can compute the affin-
ity between any (unseen) document and each topic
in terms of distance, similarly to LDA, which mea-
sures the topics’ presence in each document.
2.2 Step 2: Selecting topic descriptors
Once ProSiT has identified a set of topics, the sec-
ond step consists of selecting the documents that
best represent each topic and extracting the most
representative terms from them. While it is possi-
ble to extract the descriptors at every iteration (each
corresponding to a different number of topics), for
ease of exposition, we show the results for topics
ranging between 5 and 25.
In hard clustering methods, identifying the doc-
uments that provide the descriptors is straightfor-
ward, as every document belongs to a unique topic.
In our case, though, we consider the continuous
distance between documents and topics so that doc-
uments can be associated with several topics with
different distance levels. We are interested in se-
lecting the most representative documents, i.e., the
closest ones to the topic centroids. To achieve this,
we use another threshold,
β
, to define how close
to the topics a document should be for the topics
to be used for descriptors.
β
describes
n
percent
documents closest to each topic centroid, ranging
from 0 to 1 (where 1 means the whole data set).
Once we have selected the set of documents that
are most representative of each topic, we extract the
descriptors using the information gain (IG) (For-
man et al.,2003) value, computed for each topic.
IG measures the entropy of features selected from
some instances belonging to different classes. It is
usually used for feature selection from instances
in binary classification tasks. Here we propose an
original use of IG in a multi-class scenario, where
the topics represent the classes. We rank the terms
according to their IG (i.e., probability of belonging
to a topic) and select the top
n
words. This proce-
dure, similarly to the document-topics’ affinity, can
measure the affinity between words and topics.
3 Experimental Settings
3.1 Data sets
We test ProSiT on four data sets, two with long
documents and two with short documents. For
long documents, we use the Reuters and Google
News data sets,
2
which have previously been used
by Sia et al. (2020) and Qiang et al. (2020), re-
spectively. For short documents, we use Wikipedia
abstracts from DBpedia,
3
the same data set used
in by Bianchi et al. (2021b), to which we refer
as Wiki20K, and a tweet data set, used for topic
models by Qiang et al. (2020). Wiki20K (Bianchi
et al.,2021b) contains Wikipedia abstracts filtered
to consist of only the 2,000 most frequent words of
the vocabulary. Tweet and Google News are stan-
dard data sets in the community and were released
by Qiang et al. (2020); both data sets have been
preprocessed (e.g., stop words have been removed).
Table 1contains descriptive statistics for the data
sets. We use a small vocabulary size, which is
desirable for most Topic Models scenarios, where
including extremely-low frequency terms would
result in a too fine-grained number of topics.
3.2 Metrics
We evaluate the topics with four metrics for coher-
ence and two for distinctiveness. First, we use stan-
dard coherence measures, that is
CV
and Normal-
ized Pointwise Mutual Information (NPMI) (Röder
et al.,2015). Also, we consider the Rank-Biased
2
The Reuters data set can be found at
https://www.
nltk.org/book/ch02.html.
3
The abstracts can be found at
https://wiki.
dbpedia.org/downloads-2016-10.
Data set Docs Vocab. M. words pre-p.
Reuters 10,788 949 130.11 55.02
Google News 10,950 2,000 191.98 68.00
Wiki20K 20,000 2,000 49.82 17.44
Tweets2011 2,472 5098 8.56 8.56
Table 1: Corpora statistics, with vocabulary size and
mean words with and without pre-processing.
Overlap (RBO) (Webber et al.,2010), a discrete
measure of overlap between sequences. We also
use the inverted RBO (IRBO) score, that is 1 - RBO
(Bianchi et al.,2021a). This score describes how
different the different topics are on average. Lastly,
similarly to the approach of Ding et al. (2018),
we use an external word embedding-based coher-
ence measure (WECO) to compute the coherence
on an external domain. This metric computes the
average pairwise similarity of the words in each
topic and averages the results. We use the standard
GoogleNews word embedding commonly used in
the literature (Mikolov et al.,2013).
Concerning the distinctiveness, that is, how
clearly the topics differentiate from each other, fol-
lowing Mimno and Lee (2014) we measure Topic
Specificity (TS) and Topic Dissimilarity (TD). The
first is the average Kullback-Leibler divergence
(Kullback and Leibler,1951) from each topic’s
conditional distribution to the corpus distribution;
the second is based on the conditional distribution
of words across different topics.
While we discuss the outcomes of all metrics,
for space constraints, we only show the results of
CV, reporting the whole results in Appendix.
3.3 Baselines
We compare ProSiT with two groups of models,
differing by the text representations they require as
input: embeddings or sparse count features. ProSiT
allows us to use each of them.
All comparison methods require the number of
latent topics as an a-priori input parameter. We
evaluate their performance for inputs of 5, 10, 15,
20, and 25 topics to show a defined range. How-
ever, recall that ProSiT does not take the number of
topics as input but instead identifies them automati-
cally. Therefore, we also evaluate the other models
on those numbers of topics that ProSiT finds.
In the first group, we consider contextualized
topic models (Bianchi et al.,2021a, CTM) and Ze-
roShot topic models (Bianchi et al.,2021b, ZSTM).
They introduce the use of contextual embeddings
摘要:

ProSiT!LatentVariableDiscoverywithPROgressiveSImilarityThresholdsTommasoFornaciariItalianNationalPoliceDirkHovyBocconiUniversityyFedericoBianchiStanfordUniversityzAbstractThemostcommonwaystoexplorelatentdocu-mentdimensionsaretopicmodelsandcluster-ingmethods.However,topicmodelshavesev-eraldrawbacks:...

展开>> 收起<<
ProSiT Latent Variable Discovery with PROgressive SImilarity Thresholds Tommaso Fornaciari.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:2.14MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注