
to be used for descriptors.
β
describes
n
percent
documents closest to each topic centroid, ranging
from 0 to 1 (where 1 means the whole data set).
Once we have selected the set of documents that
are most representative of each topic, we extract the
descriptors using the information gain (IG) (For-
man et al.,2003) value, computed for each topic.
IG measures the entropy of features selected from
some instances belonging to different classes. It is
usually used for feature selection from instances
in binary classification tasks. Here we propose an
original use of IG in a multi-class scenario, where
the topics represent the classes. We rank the terms
according to their IG (i.e., probability of belonging
to a topic) and select the top
n
words. This proce-
dure, similarly to the document-topics’ affinity, can
measure the affinity between words and topics.
3 Experimental Settings
3.1 Data sets
We test ProSiT on four data sets, two with long
documents and two with short documents. For
long documents, we use the Reuters and Google
News data sets,
2
which have previously been used
by Sia et al. (2020) and Qiang et al. (2020), re-
spectively. For short documents, we use Wikipedia
abstracts from DBpedia,
3
the same data set used
in by Bianchi et al. (2021b), to which we refer
as Wiki20K, and a tweet data set, used for topic
models by Qiang et al. (2020). Wiki20K (Bianchi
et al.,2021b) contains Wikipedia abstracts filtered
to consist of only the 2,000 most frequent words of
the vocabulary. Tweet and Google News are stan-
dard data sets in the community and were released
by Qiang et al. (2020); both data sets have been
preprocessed (e.g., stop words have been removed).
Table 1contains descriptive statistics for the data
sets. We use a small vocabulary size, which is
desirable for most Topic Models scenarios, where
including extremely-low frequency terms would
result in a too fine-grained number of topics.
3.2 Metrics
We evaluate the topics with four metrics for coher-
ence and two for distinctiveness. First, we use stan-
dard coherence measures, that is
CV
and Normal-
ized Pointwise Mutual Information (NPMI) (Röder
et al.,2015). Also, we consider the Rank-Biased
2
The Reuters data set can be found at
https://www.
nltk.org/book/ch02.html.
3
The abstracts can be found at
https://wiki.
dbpedia.org/downloads-2016-10.
Data set Docs Vocab. M. words pre-p.
Reuters 10,788 949 130.11 55.02
Google News 10,950 2,000 191.98 68.00
Wiki20K 20,000 2,000 49.82 17.44
Tweets2011 2,472 5098 8.56 8.56
Table 1: Corpora statistics, with vocabulary size and
mean words with and without pre-processing.
Overlap (RBO) (Webber et al.,2010), a discrete
measure of overlap between sequences. We also
use the inverted RBO (IRBO) score, that is 1 - RBO
(Bianchi et al.,2021a). This score describes how
different the different topics are on average. Lastly,
similarly to the approach of Ding et al. (2018),
we use an external word embedding-based coher-
ence measure (WECO) to compute the coherence
on an external domain. This metric computes the
average pairwise similarity of the words in each
topic and averages the results. We use the standard
GoogleNews word embedding commonly used in
the literature (Mikolov et al.,2013).
Concerning the distinctiveness, that is, how
clearly the topics differentiate from each other, fol-
lowing Mimno and Lee (2014) we measure Topic
Specificity (TS) and Topic Dissimilarity (TD). The
first is the average Kullback-Leibler divergence
(Kullback and Leibler,1951) from each topic’s
conditional distribution to the corpus distribution;
the second is based on the conditional distribution
of words across different topics.
While we discuss the outcomes of all metrics,
for space constraints, we only show the results of
CV, reporting the whole results in Appendix.
3.3 Baselines
We compare ProSiT with two groups of models,
differing by the text representations they require as
input: embeddings or sparse count features. ProSiT
allows us to use each of them.
All comparison methods require the number of
latent topics as an a-priori input parameter. We
evaluate their performance for inputs of 5, 10, 15,
20, and 25 topics to show a defined range. How-
ever, recall that ProSiT does not take the number of
topics as input but instead identifies them automati-
cally. Therefore, we also evaluate the other models
on those numbers of topics that ProSiT finds.
In the first group, we consider contextualized
topic models (Bianchi et al.,2021a, CTM) and Ze-
roShot topic models (Bianchi et al.,2021b, ZSTM).
They introduce the use of contextual embeddings