SampleHST Efﬁcient On-the-Fly Selection of Distributed Traces Alim Ul Gias Yicheng Gaoy Matthew Sheldony Jos e A. Perusqu ıaz Owen OBrienx Giuliano Casaley

2025-05-03 0 0 537.13KB 10 页 10玖币

侵权投诉

SampleHST: Efﬁcient On-the-Fly Selection of

Distributed Traces

Alim Ul Gias∗, Yicheng Gao†, Matthew Sheldon†, Jos´

e A. Perusqu´

ıa‡, Owen O’Brien§, Giuliano Casale†

∗University of Westminster, Email: a.gias@westminster.ac.uk

†Imperial College London, Email: {y.gao20, matthew.sheldon20, g.casale}@imperial.ac.uk

‡Universidad Nacional Aut´

onoma de M´

exico, Email: jose.perusquia@sigma.iimas.unam.mx

§Huawei Technologies (Ireland) Co., Ltd, Email: owen.obrien@huawei.com

Abstract—Since only a small number of traces generated

from distributed tracing helps in troubleshooting, its storage

requirement can be signiﬁcantly reduced by biasing the selection

towards anomalous traces. To aid in this scenario, we propose

SampleHST, a novel approach to sample on-the-ﬂy from a stream

of traces in an unsupervised manner. SampleHST adjusts the

storage quota of normal and anomalous traces depending on the

size of its budget. Initially, it utilizes a forest of Half Space Trees

(HSTs) for trace scoring. This is based on the distribution of the

mass scores across the trees, which characterizes the probability

of observing different traces. The mass distribution from HSTs

is subsequently used to cluster the traces online leveraging a

variant of the mean-shift algorithm. This trace-cluster association

eventually drives the sampling decision. We have compared the

performance of SampleHST with a recently suggested method

using data from a cloud data center and demonstrated that

SampleHST improves sampling performance up to by 9.5×.

Index Terms—Distributed Tracing, Microservices, Anomaly

Detection, Sampling.

I. INTRODUCTION

Distributed tracing is tailored primarily to monitoring and

proﬁling applications built with the microservice-based archi-

tecture [1]. In a microservice ecosystem, with the increase of

services, the volume of the trace data, used for observability

of application performance and reliability, increases signiﬁ-

cantly [2]. In a typical production setup, each server, hosting

hundreds of microservices, generates several tens of gigabytes

of trace data every day. Considering all the servers, the total

daily generated data are in the order of several terabytes.

Nevertheless, most of the traces do not report on application

anomalies and thus there is little value in storing them all.

The fraction that can be retained is constrained by a storage

budget [3] and the problem we study is how to select the

most interesting traces to help monitoring and diagnostics of

microservices runtime behavior. This entails sampling a mix

of traces that characterizes the overall user behavior but at the

same time retaining a high relative ratio of anomalous traces.

To accommodate the storage budget, we need to deploy a

sampling strategy. It is a common industry practice to use

uniform sampling [3], which is also referred as head-based

sampling. Under this strategy, the sampling decision is taken

once the request for a service is received, leading to a lower

hit rate of anomalous traces. To address this issue, it is

increasingly preferred to use a tail-based sampling strategy

[4], which can improve the selection accuracy as it takes the

sampling decision after the response is served, i.e., when the

entire trace for the service call chain is available. This allows

to reason on the information contained in the trace itself upon

deciding whether to store it or not.

Ideally, a tail-based sampling strategy should be online and

without any batch processing. This means that we must decide

either to save or discard a trace on-the-ﬂy rather than storing

it temporarily for batch processing. Recently, researchers have

proposed different tail-based sampling strategies based on

unsupervised learning [3], [5], [6]. However, existing research

faces multiple challenges such as difﬁculties in performing

clustering due to high dimensionality of data, requirements of

batch processing, low amplitude scores for anomalous traces,

and no explicit consideration of the budget size. To address all

these shortcomings, we propose a novel method, SampleHST.

On the one hand, SampleHST focuses on sampling only

anomalous traces when the storage budget is comparatively

lower than the fraction of expected anomalies. On the other

hand, when the budget is higher, SampleHST samples both the

normal and anomalous traces, with a bias towards anomalous

ones. Such a bias is fair because it increases the representation

of the anomalous traces, which are rare compared to normal

ones, among the sampled traces. In other words, the bias

allows representative sampling [3], [5].

SampleHST leverages a Bag-of-Words (BoW) model [7]

as a count-based representation for each trace. By taking this

representation as an input, we can generate a distribution of the

mass values obtained from a forest of a tree-based classiﬁer,

namely Half Space Trees (HSTs) [8]. This distribution is then

used to perform an online clustering of the traces based on

an algorithm we have developed which is part of the mean-

shift clustering algorithm family [9]. Once the clustering is

complete, we decide to sample the trace based on its cluster

association, i.e., a trace is more likely to be sampled if it is

associated with a cluster with low mass values as such clusters

represent rarely observed traces.

We evaluate the performance of SampleHST, using data

provided by a commercial cloud service operator and com-

paring the results with a recently proposed approach for point

anomalies developed in [3]. For this production dataset, we

see that SampleHST yields 2.3×to 9.5×better sampling

performance in terms of precision, recall and F1-Score than

prior work. When we consider representative sampling in a

arXiv:2210.04595v1 [cs.DC] 10 Sep 2022

high budget scenario, we see SampleHST is 1.6×fairer with

respect to the Jain fairness index [10]. In summary, the key

contributions are:

•A novel approach to sample distributed traces by forming

clusters using the mass distribution of the traces obtained

from Half Space Trees.

•An online clustering method, generalizing the mean

shift algorithm [11], that considers non-spherical cluster

shapes such as hyper-cubes and hyper-rectangles.

•Experiments using real-world data to compare the sam-

pling performance of SampleHST with a recent tail-based

sampling approach [3].

The rest of the paper is organized as follows. Section

II presents the related work and motivation for developing

SampleHST. Section III demonstrates how to model traces

and detect anomalies. Section IV discusses how to transform

anomaly detection processes to a sampling method. Section

V and VI present the SampleHST clustering and sampling

algorithms respectively. Section VII evaluates the sampling

performance. Section VIII concludes the paper. Proofs are

given in the Appendix.

II. BACKGROUND

A. Related Work

The ﬁrst step of designing a sampler is to differentiate the

anomalous traces from the normal ones. There have been

many works on anomaly detection for microservices using

their generated traces. The authors in [12], [13] learn from

the patterns of call trees and request execution respectively to

detect anomalies. Some studies [14]–[16] also consider deep

learning based methods focusing on different aspects, e.g.,

response times and causal relationships. However, these works

do not consider our sampling scenario, i.e., they only focus

on anomaly detection but not on transforming the anomaly

detection result to a sampling decision.

To the best of our knowledge, there are only a few research

papers focusing on sampling anomalous traces generated by

microservices. In [3], the authors propose a sampler based

on a hierarchical clustering method PERCH [17]. Authors

demonstrate that their method can achieve representative sam-

pling, meaning equal share for both normal and anomalous

traces. Such clustering methods can incur the curse of the data

dimensionality [18] and they often require batch processing,

which is not always supported under low latency requirements.

Sifter [5] avoids batch processing by taking sampling de-

cisions trace-by-trace. It generates a sampling probability by

utilizing the loss of training a neural network for a particu-

lar trace. A potential issue with loss-based methods is that

anomalous traces may still have small probabilities overall,

closer to 0 than to 1, allowing several anomalous traces to

go unsampled. This problem is studied in recently proposed

sampler, Sieve [6], which uses a threshold to ﬁrst separate the

anomalous traces and then amplify the sampling probability.

This still leaves an open challenge regarding the optimal and

automated choice of threshold.

B. Sampling performance

As a classiﬁcation problem, it may be natural to study trace

sampling performance in terms of F1-Score, as this strikes a

balance between Precision and Recall. We however observe

that this is not always an ideal performance criterion in the

presence of budget constraints. For example, an abundant

storage budget with few constraints is more appropriate to

consider Recall, while a heavily constrained storage budget

expects more from achieving high Precision. Summarizing, we

set the following overall performance evaluation principles for

trace sampling methods:

•For infrequent anomalous traces, where the prevalence

of anomalies is less than the storage budget, the primary

evaluation metric should be the Recall.

•For low storage budgets, where the prevalence of anoma-

lies is greater than the storage budget, the primary eval-

uation metric should be the Precision.

•When sampling Ntraces from a collection of traces

containing Nanomalies, the primary evaluation metric

should be the F1-Score.

C. Comparing State-of-the-Art Anomaly Detection Methods

Since anomaly detection is a key step for a sampling

process, we here illustrate why off-the-shelf anomaly de-

tection methods are not ﬁt for purpose. We consider the

following popular techniques: 1) local density estimate: K-

Nearest Neighbor (KNN) and Local Outlier Factor (LOF),

2) tree-based classiﬁcation: Isolation Forest and Half Space

Trees (HST) [8], 3) boosting: Lightweight Online Detec-

tion of Anomalies (LODA) [19], and 4) neural network:

Deep Belief Net and One Class Support Vector Machine

(DBN+OCSVM) [20]. A notable advantage of using the tree-

based methods is that they can work on one trace at a time,

while the other methods, off-the-shelf, require batching.

To evaluate the performance of the above methods, we con-

sider a production dataset from a cloud data center consisting

of trace data spanning a week over a set of 14 microservices.

As the trace is unlabelled, we identify ∼5% point anomalies

using the popular ofﬂine DBSCAN clustering algorithm, and

evaluate the ability of the listed methods to obtain similar

results. DBSCAN, being resource intensive, is not feasible in

an online scenario such as distributed trace sampling, but is

considered as a generally reliable technique in industry [21].

We use Matlab’s native implementation of DBSCAN with

= 2.5and minpts = 5, where indicates the size of the

local neighborhood of the data points and minpts indicates

the minimum number of points per cluster. Once the traces

are clustered, we regard the smallest clusters as anomalies,

accounting for ∼5% of the total traces.

The results of the experiment are presented in Table I. The

dataset contains traces from six consecutive days with 77577

traces. For all the batch methods, we keep a similar batch

size of 2000 traces. We see that HST is the best method with

respect to F1-Score. This motivates further investigation in

HST methods to address the problem under study. In addition,

HST has other beneﬁts from the perspective of a streaming

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SampleHST:EfcientOn-the-FlySelectionofDistributedTracesAlimUlGias,YichengGaoy,MatthewSheldony,Jos´eA.Perusqu´az,OwenO'Brienx,GiulianoCasaleyUniversityofWestminster,Email:a.gias@westminster.ac.ukyImperialCollegeLondon,Email:fy.gao20,matthew.sheldon20,g.casaleg@imperial.ac.ukzUniversidadNacionalAu...

展开>> 收起<<

SampleHST Efﬁcient On-the-Fly Selection of Distributed Traces Alim Ul Gias Yicheng Gaoy Matthew Sheldony Jos e A. Perusqu ıaz Owen OBrienx Giuliano Casaley.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SampleHST Efﬁcient On-the-Fly Selection of Distributed Traces Alim Ul Gias Yicheng Gaoy Matthew Sheldony Jos e A. Perusqu ıaz Owen OBrienx Giuliano Casaley

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: