2
the cluster structure is refined by means of an improvement
algorithm that reconsiders the sequence of observations and
performs a greedy, local maximization of the log-likelihood
function of the BMC model.
A broad study on the performance of this clustering
algorithm for sequential data obtained from real-world pro-
cesses was however not provided. We tackle this problem
by investigating the clustering algorithm under a variety of
scenarios, and thus demonstrating its applicability in real-
world data. This is the main contribution of the present paper.
Evaluating the actual performance of such clustering algorithm
in a real-world scenario is however a nontrivial task since the
latent cluster structure is unknown, as opposed to a scenario
where data is generated synthetically. Hence, we also propose a
set of tools used for evaluation of the quality of the clusters and
the BMC model. We go, therefore, a step beyond clustering
benchmark evaluation and use statistical tools to assess the
validity of the BMC assumption in the data.
Our goal is here not to compare the performance of the
BMC clustering algorithm relative to other algorithms. A
comparison would namely only be fair if both algorithms have
access to the same amount of training data. Now remark
that the BMC algorithm is explicitly designed to manage in
sparse regimes where the amount of data is small. On the
other hand most model-free algorithms, such as those based
on deep learning, perform optimally in a regime where one has
access to large amounts of training data. A comparison would
consequently be vacuous; its outcome mainly being determined
by the choice of the amount of training data. Our goal is rather
to supplement the theoretical understanding of the BMC-based
algorithm with a practical viewpoint.
In order to practically evaluate the clustering algorithm for
BMCs, we investigate the following questions:
•
How can the BMC model practically aid in data explo-
ration of nonsynthetic sequential data obtained from real-
world data?
•
Can the quality of the found clusters be evaluated when
no ground-truth clustering is known?
•
How can one statistically decide whether the BMC model
is an appropriate model for the sequence of observations?
How can it be detected that either a simpler model than a
BMC would suffice, or a richer model is required?
•
Can the algorithm be expected to give meaningful results
despite the sparsity and complexity of real-world data?
Are the clustering algorithms robust to model violations?
To obtain insight into these questions, we develop practical
tools for evaluating the merit of a detected clustering, which
we then apply to several data sets. These evaluation tools
are designed to cope with the aforementioned fact that
ground-truth clusters are not known to us. The datasets
which we consider are diverse as they come from the fields
of microbiology, natural language processing, ethology, and
finance. Specifically, we investigate sequences of:
a. Codons in human Deoxyribonucleic Acid (DNA).
b. Words in Wikipedia articles.
c.
Global Positioning System (GPS) data describing the
movement of bison in a field.
d.
Companies in the Standard and Poor’s 500 (S&P500) with
the highest daily return.
On each dataset we apply the BMC clustering algorithm to
uncover underlying clusters, and we evaluate these clusters
using the collection of tools which are developed in the present
paper. It is here found that the BMC model indeed aids in
exploratory data analyses of real-world sequential data.
For example, in DNA the algorithm leads us to rediscover
phenomena which are known in the biological community
as codon–pair bias and dinucleotide bias. In the text-based
sequential data we find that the BMC-based improvement
algorithm improves performance on downstream natural lan-
guage processing tasks. The model evaluation tools of the
present paper are here found to be informative. They namely
uncover some model violations which are suggestive for future
methodological expansion. Our findings in the GPS data are
particularly striking. There, a scatter plot of the data gives
rise to a picture which is difficult to interpret. After cluster-
ing, a picture can be displayed which provides significantly
more insight; compare Figure 3 and Figure 6. It is further
notable that the sequential structure of the data gives rise to
geographical features which would have been difficult, if not
impossible, to extract based on solely a distance-based notion
of similarity as is employed in
K
-means. The S&P500 dataset
gives the least clear conclusions out of the four datasets but is a
good illustrative example for our evaluation tools in a difficult
setting. The difficulty of this dataset is due to the combination
of sparsity and a nuisance factor. With these different examples,
along with the evaluation tools developed, we answer positively
the highlighted questions posed previously, and we conclude
that the BMC-based clustering can be successfully applied to
real-world sequential data.
Let us finally contemplate the practical implication of our
findings as it pertains to using clustering within sequential data
as a method to speed up a subsequent optimization procedure.
Consider that canonical machine learning algorithms that one
would like to apply to sequential data, such as
Q
-learning
or
SARSA
-learning, slow down to an impractical degree if
the observations are from a high-dimensional state space.
By solving such learning problems on an accurate lower-
dimensional representation of the state space, the numerical
complexity can be reduced dramatically [
3
], [
4
]. The present
paper suggest that future integration of clustering in sequential
data applications such as natural language processing and
machine learning is indeed practically feasible.
Structure of this paper
We introduce the problem of clustering in sequential data in
Section 2. We describe the BMC as well as other models that
appear in our experiments in Section 3, and briefly discuss
the advantages of a model-based approach. Next, we give an
overview of related literature in Section 4, and we introduce
the clustering algorithm in Section 5. We describe there also
our C
++
implementation of this clustering algorithm, which we
made publicly available as a Python library. Section 6 describes
practical tools to evaluate clusters found in datasets in the
absence of knowledge on the underlying ground truth. Sec-
tion 7 introduces the datasets and explains our preprocessing
procedures; Sections 8, 9 then extensively evaluate the clusters.
Finally, Section 10 concludes with a brief summary.