Client Error Clustering Approaches in Content Delivery Networks CDN Ermiyas Birihanu1 Jiyan Mahmud1 P eter Kiss1 Adolf Kamuzora1 Wadie

2025-04-27 0 0 501.54KB 13 页 10玖币

侵权投诉

Client Error Clustering Approaches in Content

Delivery Networks (CDN)

Ermiyas Birihanu1, Jiyan Mahmud1, P´eter Kiss,1, Adolf Kamuzora1, Wadie

Skaf1, Tom´aˇs Horv´ath1, Tam´as Jursonovics,2, Peter Pogrzeba,2, and Imre

Lend´ak1

1Telekom Innovation Laboratories, Data Science and Engineering Department,

Faculty of Informatics, E¨otv¨os Lor´and University P´azm´any P´eter str. 1/A, 1117

Budapest, Hungary

{ermiyasbirihanu, jiyan, axx6v4, adolfnfsp, skaf, tomas.horvath

lendak}@inf.elte.hu ,

home page: http://t-labs.elte.hu/

2Deutsche Telekom, Berlin, Germany

Abstract. Content delivery networks (CDNs) are the backbone of the

Internet and are key in delivering high quality video on demand (VoD),

web content and ﬁle services to billions of users. CDNs usually con-

sist of hierarchically organized content servers positioned as close to the

customers as possible. CDN operators face a signiﬁcant challenge when

analyzing billions of web server and proxy logs generated by their sys-

tems. The main objective of this study was to analyze the applicability

of various clustering methods in CDN error log analysis. We worked

with real-life CDN proxy logs, identiﬁed key features included in the

logs (e.g., content type, HTTP status code, time-of-day, host) and clus-

tered the log lines corresponding to diﬀerent host types oﬀering live TV,

video on demand, ﬁle caching and web content. Our experiments were

run on a dataset consisting of proxy logs collected over a 7-day period

from a single, physical CDN server running multiple types of services

(VoD, live TV, ﬁle). The dataset consisted of 2.2 billion log lines. Our

analysis showed that CDN error clustering is a viable approach towards

identifying recurring errors and improving overall quality of service.

Keywords: Content Delivery Network, Error Clustering, HTTP proxy

logs, HTTP status codes.

1 Introduction and problem deﬁnition

Customers consuming live TV, video on demand (VoD) or ﬁle services are ac-

customed to a high quality of service which is made possible by content delivery

networks (CDN). Modern CDNs usually oﬀer their services via protocols built

on top of the HyperText Transfer Protocol (HTTP).The key infrastructure com-

ponents of CDNs are origin and edge/surrogate servers as well as the communi-

cation infrastructure connecting them to their customers. Origin servers contain

arXiv:2210.05314v1 [cs.NI] 11 Oct 2022

2 E. Birihanu et al.

large volumes of movies, series and other multimedia content and are usually lim-

ited in number. Edge servers are more numerous and they are positioned closer

to the customers and cache only the most relevant content due to their lim-

ited storage capacity.Various HTTP proxy solutions are used on edge/surrogate

servers to cache and serve content to thousands of concurrent users. Software

and infrastructure errors occur both on the customer and CDN side. Customer

software in set-top boxes, smart televisions and mobile devices might contain

bugs or issue erroneous commands.The CDN infrastructure might contain bot-

tlenecks, be under cyber attack or just have partial hardware or software failures.

Considering the large number of customers and ever increasing numbers of nodes

in modern CDNs, the number of such errors is continuously increasing.

Log ﬁles can provide important insights about the condition of a system or

device. These ﬁles provide information about whether or not a system is working

properly, as well as how actions or services perform. Various types of information

are stored in log ﬁles on diﬀerent web servers such as username, timestamp,

last page accessed, success rate, user agent, Universal Resource Locator (URL),

etc. By analysing these log ﬁles, one can better understand user and system

behaviour. Since complex system generate large quantities of log information,

manual analysis is not practical; thus, automated approaches are warranted [1].

The goal of this research is to contribute to the state of the art by analyzing

edge server logs collected at a European CDN provider. The error logs analyzed

were clustered on the host (i.e., CDN node) level. The ultimate goal of this

research was to identify the causes of the diverse errors, provide valuable insights

to the operators and developers of the analyzed CDN and thereby contribute

towards solving recurring problems and propose potential improvements.

2 Related works

The growing number of Internet users and their increasing demand for the deliv-

ery of low-latency content led to the emergence of networks of content delivery

networks (CDN) [11]. A CDN is composed of a variety of points of presence which

are essentially proxies containing the most popular content oﬀered on the CDN

[2] [13]. The core aspect of CDN is the hierarchical design in which top-level,

origin servers contain all available content and lower-level nodes contain only

the most frequently requested content in a the last mile, i.e., close to customers

in a single geographic region [3].

Similarly to other large-scale enterprise systems, the daily amount of log lines

produced by CDNs is measured in the tens or hundreds of millions or maybe

even billions. As the manual analysis of such datasets is impractical and often

impossible, it is a viable approach to rely on machine learning algorithms which

automatically process log lines and discover interesting patterns. One such ML-

based approach is error log clustering [5].

Neha et al [4]used a web log analyzer tool called Web Log Expert to identify

users’ behavior in terms of number of hits, page views visitors and bandwidth.

An in-depth analysis of user behaviour of the NASA website was performed in

Client Error Clustering Approaches in Content Delivery Networks (CDN) 3

[15]. The goal of that research was to obtain information about top errors, web-

sites and potential visitors of the site. The study showed that machine learning

techniques such as association, clustering, and classiﬁcation can be used to iden-

tify regular users of a website. The goal of study [7] was to determine the best

approach for identifying user behavior, whether quantitative or qualitative meth-

ods are used. The researchers claimed that clustering users faces two challenges:

reasoning and surrounding behavior. They came to the conclusion that com-

bining qualitative and quantitative methodologies is the best way to understand

user behavior. Haifei Xiang in [18] used weblog data to cluster user behavior and

analyze user access patterns. The study explained how to analyze user behavior

from weblog data and apply the K-means clustering algorithm. It considers the

disadvantages of K-means to obtain local optimum solutions. Methods such as

selecting the initial centers based on data sparsity were proposed, which may

eﬀectively minimize algorithm iteration time and increase clustering quality. In

[16] the researchers clustered users into networks according to their browsing

behaviour.Their methodology consisted of web log pre-processing and clustering

with the K-means algorithm with the ultimate goal to group users into diﬀerent

categories and analyze their behaviour based on the category of the web sites

with which they interact.

Log parsing is a technique for converting unstructured content from log mes-

sages into a format appropriate for data mining. The aim of the study [16] was to

analyze how log parsing strategies using natural language processing aﬀected log

mining performance. The researchers utilized two datasets: the ﬁrst consisted of

log data collected in an aviation system which included over 4,500,000 messages

gathered over the period of a year. The second dataset was log data from public

benchmarks acquired from a Hadoop distributed ﬁle system (HDFS) cluster.

A novel way for presenting textual clustering to automatically ﬁnd the syn-

tactic structures of log messages logs collected on super-computing systems was

proposed in [6]. The researchers managed to utilize their approach to extract

meaningful structural and temporal message patterns. The goal of study [14]

was to investigate user sessions via a frequent pattern mining approach in web

logs.

3 Data exploration

The CDN service provider delivered logs in multiple batches, incrementally im-

proving the data collection and anonymization methods for better ﬁtting to the

research goals. This work builds on multiple experimental data sets extracted

from the base data set, which contains more than 2.2 billion CDN proxy log

entries collected during a 7-day period. This dataset has 29 diﬀerent features

and most of them are categorical data - table 1 contains the list of features in

the dataset. In the experimentation and prototyping phase of this research work

we extracted diﬀerent sample datasets from the logs.

Web and web proxy servers send a set of known HTTP status codes when

processing a client request results in an error. These errors can be grouped

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ClientErrorClusteringApproachesinContentDeliveryNetworks(CDN)ErmiyasBirihanu1,JiyanMahmud1,PeterKiss,1,AdolfKamuzora1,WadieSkaf1,TomasHorvath1,TamasJursonovics,2,PeterPogrzeba,2,andImreLendak11TelekomInnovationLaboratories,DataScienceandEngineeringDepartment,FacultyofInformatics,EotvosLoran...

展开>> 收起<<

Client Error Clustering Approaches in Content Delivery Networks CDN Ermiyas Birihanu1 Jiyan Mahmud1 P eter Kiss1 Adolf Kamuzora1 Wadie.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Client Error Clustering Approaches in Content Delivery Networks CDN Ermiyas Birihanu1 Jiyan Mahmud1 P eter Kiss1 Adolf Kamuzora1 Wadie

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: