SampleHST: Efficient On-the-Fly Selection of
Distributed Traces
Alim Ul Gias∗, Yicheng Gao†, Matthew Sheldon†, Jos´
e A. Perusqu´
ıa‡, Owen O’Brien§, Giuliano Casale†
∗University of Westminster, Email: a.gias@westminster.ac.uk
†Imperial College London, Email: {y.gao20, matthew.sheldon20, g.casale}@imperial.ac.uk
‡Universidad Nacional Aut´
onoma de M´
exico, Email: jose.perusquia@sigma.iimas.unam.mx
§Huawei Technologies (Ireland) Co., Ltd, Email: owen.obrien@huawei.com
Abstract—Since only a small number of traces generated
from distributed tracing helps in troubleshooting, its storage
requirement can be significantly reduced by biasing the selection
towards anomalous traces. To aid in this scenario, we propose
SampleHST, a novel approach to sample on-the-fly from a stream
of traces in an unsupervised manner. SampleHST adjusts the
storage quota of normal and anomalous traces depending on the
size of its budget. Initially, it utilizes a forest of Half Space Trees
(HSTs) for trace scoring. This is based on the distribution of the
mass scores across the trees, which characterizes the probability
of observing different traces. The mass distribution from HSTs
is subsequently used to cluster the traces online leveraging a
variant of the mean-shift algorithm. This trace-cluster association
eventually drives the sampling decision. We have compared the
performance of SampleHST with a recently suggested method
using data from a cloud data center and demonstrated that
SampleHST improves sampling performance up to by 9.5×.
Index Terms—Distributed Tracing, Microservices, Anomaly
Detection, Sampling.
I. INTRODUCTION
Distributed tracing is tailored primarily to monitoring and
profiling applications built with the microservice-based archi-
tecture [1]. In a microservice ecosystem, with the increase of
services, the volume of the trace data, used for observability
of application performance and reliability, increases signifi-
cantly [2]. In a typical production setup, each server, hosting
hundreds of microservices, generates several tens of gigabytes
of trace data every day. Considering all the servers, the total
daily generated data are in the order of several terabytes.
Nevertheless, most of the traces do not report on application
anomalies and thus there is little value in storing them all.
The fraction that can be retained is constrained by a storage
budget [3] and the problem we study is how to select the
most interesting traces to help monitoring and diagnostics of
microservices runtime behavior. This entails sampling a mix
of traces that characterizes the overall user behavior but at the
same time retaining a high relative ratio of anomalous traces.
To accommodate the storage budget, we need to deploy a
sampling strategy. It is a common industry practice to use
uniform sampling [3], which is also referred as head-based
sampling. Under this strategy, the sampling decision is taken
once the request for a service is received, leading to a lower
hit rate of anomalous traces. To address this issue, it is
increasingly preferred to use a tail-based sampling strategy
[4], which can improve the selection accuracy as it takes the
sampling decision after the response is served, i.e., when the
entire trace for the service call chain is available. This allows
to reason on the information contained in the trace itself upon
deciding whether to store it or not.
Ideally, a tail-based sampling strategy should be online and
without any batch processing. This means that we must decide
either to save or discard a trace on-the-fly rather than storing
it temporarily for batch processing. Recently, researchers have
proposed different tail-based sampling strategies based on
unsupervised learning [3], [5], [6]. However, existing research
faces multiple challenges such as difficulties in performing
clustering due to high dimensionality of data, requirements of
batch processing, low amplitude scores for anomalous traces,
and no explicit consideration of the budget size. To address all
these shortcomings, we propose a novel method, SampleHST.
On the one hand, SampleHST focuses on sampling only
anomalous traces when the storage budget is comparatively
lower than the fraction of expected anomalies. On the other
hand, when the budget is higher, SampleHST samples both the
normal and anomalous traces, with a bias towards anomalous
ones. Such a bias is fair because it increases the representation
of the anomalous traces, which are rare compared to normal
ones, among the sampled traces. In other words, the bias
allows representative sampling [3], [5].
SampleHST leverages a Bag-of-Words (BoW) model [7]
as a count-based representation for each trace. By taking this
representation as an input, we can generate a distribution of the
mass values obtained from a forest of a tree-based classifier,
namely Half Space Trees (HSTs) [8]. This distribution is then
used to perform an online clustering of the traces based on
an algorithm we have developed which is part of the mean-
shift clustering algorithm family [9]. Once the clustering is
complete, we decide to sample the trace based on its cluster
association, i.e., a trace is more likely to be sampled if it is
associated with a cluster with low mass values as such clusters
represent rarely observed traces.
We evaluate the performance of SampleHST, using data
provided by a commercial cloud service operator and com-
paring the results with a recently proposed approach for point
anomalies developed in [3]. For this production dataset, we
see that SampleHST yields 2.3×to 9.5×better sampling
performance in terms of precision, recall and F1-Score than
prior work. When we consider representative sampling in a
arXiv:2210.04595v1 [cs.DC] 10 Sep 2022