SampleHST Efficient On-the-Fly Selection of Distributed Traces Alim Ul Gias Yicheng Gaoy Matthew Sheldony Jos e A. Perusqu ıaz Owen OBrienx Giuliano Casaley

2025-05-03 0 0 537.13KB 10 页 10玖币
侵权投诉
SampleHST: Efficient On-the-Fly Selection of
Distributed Traces
Alim Ul Gias, Yicheng Gao, Matthew Sheldon, Jos´
e A. Perusqu´
ıa, Owen O’Brien§, Giuliano Casale
University of Westminster, Email: a.gias@westminster.ac.uk
Imperial College London, Email: {y.gao20, matthew.sheldon20, g.casale}@imperial.ac.uk
Universidad Nacional Aut´
onoma de M´
exico, Email: jose.perusquia@sigma.iimas.unam.mx
§Huawei Technologies (Ireland) Co., Ltd, Email: owen.obrien@huawei.com
Abstract—Since only a small number of traces generated
from distributed tracing helps in troubleshooting, its storage
requirement can be significantly reduced by biasing the selection
towards anomalous traces. To aid in this scenario, we propose
SampleHST, a novel approach to sample on-the-fly from a stream
of traces in an unsupervised manner. SampleHST adjusts the
storage quota of normal and anomalous traces depending on the
size of its budget. Initially, it utilizes a forest of Half Space Trees
(HSTs) for trace scoring. This is based on the distribution of the
mass scores across the trees, which characterizes the probability
of observing different traces. The mass distribution from HSTs
is subsequently used to cluster the traces online leveraging a
variant of the mean-shift algorithm. This trace-cluster association
eventually drives the sampling decision. We have compared the
performance of SampleHST with a recently suggested method
using data from a cloud data center and demonstrated that
SampleHST improves sampling performance up to by 9.5×.
Index Terms—Distributed Tracing, Microservices, Anomaly
Detection, Sampling.
I. INTRODUCTION
Distributed tracing is tailored primarily to monitoring and
profiling applications built with the microservice-based archi-
tecture [1]. In a microservice ecosystem, with the increase of
services, the volume of the trace data, used for observability
of application performance and reliability, increases signifi-
cantly [2]. In a typical production setup, each server, hosting
hundreds of microservices, generates several tens of gigabytes
of trace data every day. Considering all the servers, the total
daily generated data are in the order of several terabytes.
Nevertheless, most of the traces do not report on application
anomalies and thus there is little value in storing them all.
The fraction that can be retained is constrained by a storage
budget [3] and the problem we study is how to select the
most interesting traces to help monitoring and diagnostics of
microservices runtime behavior. This entails sampling a mix
of traces that characterizes the overall user behavior but at the
same time retaining a high relative ratio of anomalous traces.
To accommodate the storage budget, we need to deploy a
sampling strategy. It is a common industry practice to use
uniform sampling [3], which is also referred as head-based
sampling. Under this strategy, the sampling decision is taken
once the request for a service is received, leading to a lower
hit rate of anomalous traces. To address this issue, it is
increasingly preferred to use a tail-based sampling strategy
[4], which can improve the selection accuracy as it takes the
sampling decision after the response is served, i.e., when the
entire trace for the service call chain is available. This allows
to reason on the information contained in the trace itself upon
deciding whether to store it or not.
Ideally, a tail-based sampling strategy should be online and
without any batch processing. This means that we must decide
either to save or discard a trace on-the-fly rather than storing
it temporarily for batch processing. Recently, researchers have
proposed different tail-based sampling strategies based on
unsupervised learning [3], [5], [6]. However, existing research
faces multiple challenges such as difficulties in performing
clustering due to high dimensionality of data, requirements of
batch processing, low amplitude scores for anomalous traces,
and no explicit consideration of the budget size. To address all
these shortcomings, we propose a novel method, SampleHST.
On the one hand, SampleHST focuses on sampling only
anomalous traces when the storage budget is comparatively
lower than the fraction of expected anomalies. On the other
hand, when the budget is higher, SampleHST samples both the
normal and anomalous traces, with a bias towards anomalous
ones. Such a bias is fair because it increases the representation
of the anomalous traces, which are rare compared to normal
ones, among the sampled traces. In other words, the bias
allows representative sampling [3], [5].
SampleHST leverages a Bag-of-Words (BoW) model [7]
as a count-based representation for each trace. By taking this
representation as an input, we can generate a distribution of the
mass values obtained from a forest of a tree-based classifier,
namely Half Space Trees (HSTs) [8]. This distribution is then
used to perform an online clustering of the traces based on
an algorithm we have developed which is part of the mean-
shift clustering algorithm family [9]. Once the clustering is
complete, we decide to sample the trace based on its cluster
association, i.e., a trace is more likely to be sampled if it is
associated with a cluster with low mass values as such clusters
represent rarely observed traces.
We evaluate the performance of SampleHST, using data
provided by a commercial cloud service operator and com-
paring the results with a recently proposed approach for point
anomalies developed in [3]. For this production dataset, we
see that SampleHST yields 2.3×to 9.5×better sampling
performance in terms of precision, recall and F1-Score than
prior work. When we consider representative sampling in a
arXiv:2210.04595v1 [cs.DC] 10 Sep 2022
high budget scenario, we see SampleHST is 1.6×fairer with
respect to the Jain fairness index [10]. In summary, the key
contributions are:
A novel approach to sample distributed traces by forming
clusters using the mass distribution of the traces obtained
from Half Space Trees.
An online clustering method, generalizing the mean
shift algorithm [11], that considers non-spherical cluster
shapes such as hyper-cubes and hyper-rectangles.
Experiments using real-world data to compare the sam-
pling performance of SampleHST with a recent tail-based
sampling approach [3].
The rest of the paper is organized as follows. Section
II presents the related work and motivation for developing
SampleHST. Section III demonstrates how to model traces
and detect anomalies. Section IV discusses how to transform
anomaly detection processes to a sampling method. Section
V and VI present the SampleHST clustering and sampling
algorithms respectively. Section VII evaluates the sampling
performance. Section VIII concludes the paper. Proofs are
given in the Appendix.
II. BACKGROUND
A. Related Work
The first step of designing a sampler is to differentiate the
anomalous traces from the normal ones. There have been
many works on anomaly detection for microservices using
their generated traces. The authors in [12], [13] learn from
the patterns of call trees and request execution respectively to
detect anomalies. Some studies [14]–[16] also consider deep
learning based methods focusing on different aspects, e.g.,
response times and causal relationships. However, these works
do not consider our sampling scenario, i.e., they only focus
on anomaly detection but not on transforming the anomaly
detection result to a sampling decision.
To the best of our knowledge, there are only a few research
papers focusing on sampling anomalous traces generated by
microservices. In [3], the authors propose a sampler based
on a hierarchical clustering method PERCH [17]. Authors
demonstrate that their method can achieve representative sam-
pling, meaning equal share for both normal and anomalous
traces. Such clustering methods can incur the curse of the data
dimensionality [18] and they often require batch processing,
which is not always supported under low latency requirements.
Sifter [5] avoids batch processing by taking sampling de-
cisions trace-by-trace. It generates a sampling probability by
utilizing the loss of training a neural network for a particu-
lar trace. A potential issue with loss-based methods is that
anomalous traces may still have small probabilities overall,
closer to 0 than to 1, allowing several anomalous traces to
go unsampled. This problem is studied in recently proposed
sampler, Sieve [6], which uses a threshold to first separate the
anomalous traces and then amplify the sampling probability.
This still leaves an open challenge regarding the optimal and
automated choice of threshold.
B. Sampling performance
As a classification problem, it may be natural to study trace
sampling performance in terms of F1-Score, as this strikes a
balance between Precision and Recall. We however observe
that this is not always an ideal performance criterion in the
presence of budget constraints. For example, an abundant
storage budget with few constraints is more appropriate to
consider Recall, while a heavily constrained storage budget
expects more from achieving high Precision. Summarizing, we
set the following overall performance evaluation principles for
trace sampling methods:
For infrequent anomalous traces, where the prevalence
of anomalies is less than the storage budget, the primary
evaluation metric should be the Recall.
For low storage budgets, where the prevalence of anoma-
lies is greater than the storage budget, the primary eval-
uation metric should be the Precision.
When sampling Ntraces from a collection of traces
containing Nanomalies, the primary evaluation metric
should be the F1-Score.
C. Comparing State-of-the-Art Anomaly Detection Methods
Since anomaly detection is a key step for a sampling
process, we here illustrate why off-the-shelf anomaly de-
tection methods are not fit for purpose. We consider the
following popular techniques: 1) local density estimate: K-
Nearest Neighbor (KNN) and Local Outlier Factor (LOF),
2) tree-based classification: Isolation Forest and Half Space
Trees (HST) [8], 3) boosting: Lightweight Online Detec-
tion of Anomalies (LODA) [19], and 4) neural network:
Deep Belief Net and One Class Support Vector Machine
(DBN+OCSVM) [20]. A notable advantage of using the tree-
based methods is that they can work on one trace at a time,
while the other methods, off-the-shelf, require batching.
To evaluate the performance of the above methods, we con-
sider a production dataset from a cloud data center consisting
of trace data spanning a week over a set of 14 microservices.
As the trace is unlabelled, we identify 5% point anomalies
using the popular offline DBSCAN clustering algorithm, and
evaluate the ability of the listed methods to obtain similar
results. DBSCAN, being resource intensive, is not feasible in
an online scenario such as distributed trace sampling, but is
considered as a generally reliable technique in industry [21].
We use Matlab’s native implementation of DBSCAN with
= 2.5and minpts = 5, where indicates the size of the
local neighborhood of the data points and minpts indicates
the minimum number of points per cluster. Once the traces
are clustered, we regard the smallest clusters as anomalies,
accounting for 5% of the total traces.
The results of the experiment are presented in Table I. The
dataset contains traces from six consecutive days with 77577
traces. For all the batch methods, we keep a similar batch
size of 2000 traces. We see that HST is the best method with
respect to F1-Score. This motivates further investigation in
HST methods to address the problem under study. In addition,
HST has other benefits from the perspective of a streaming
2
摘要:

SampleHST:EfcientOn-the-FlySelectionofDistributedTracesAlimUlGias,YichengGaoy,MatthewSheldony,Jos´eA.Perusqu´az,OwenO'Brienx,GiulianoCasaleyUniversityofWestminster,Email:a.gias@westminster.ac.ukyImperialCollegeLondon,Email:fy.gao20,matthew.sheldon20,g.casaleg@imperial.ac.ukzUniversidadNacionalAu...

展开>> 收起<<
SampleHST Efficient On-the-Fly Selection of Distributed Traces Alim Ul Gias Yicheng Gaoy Matthew Sheldony Jos e A. Perusqu ıaz Owen OBrienx Giuliano Casaley.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:537.13KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注