Client Error Clustering Approaches in Content Delivery Networks CDN Ermiyas Birihanu1 Jiyan Mahmud1 P eter Kiss1 Adolf Kamuzora1 Wadie

2025-04-27 0 0 501.54KB 13 页 10玖币
侵权投诉
Client Error Clustering Approaches in Content
Delivery Networks (CDN)
Ermiyas Birihanu1, Jiyan Mahmud1, P´eter Kiss,1, Adolf Kamuzora1, Wadie
Skaf1, Tom´aˇs Horv´ath1, Tam´as Jursonovics,2, Peter Pogrzeba,2, and Imre
Lend´ak1
1Telekom Innovation Laboratories, Data Science and Engineering Department,
Faculty of Informatics, E¨otv¨os Lor´and University P´azm´any P´eter str. 1/A, 1117
Budapest, Hungary
{ermiyasbirihanu, jiyan, axx6v4, adolfnfsp, skaf, tomas.horvath
lendak}@inf.elte.hu ,
home page: http://t-labs.elte.hu/
2Deutsche Telekom, Berlin, Germany
Abstract. Content delivery networks (CDNs) are the backbone of the
Internet and are key in delivering high quality video on demand (VoD),
web content and file services to billions of users. CDNs usually con-
sist of hierarchically organized content servers positioned as close to the
customers as possible. CDN operators face a significant challenge when
analyzing billions of web server and proxy logs generated by their sys-
tems. The main objective of this study was to analyze the applicability
of various clustering methods in CDN error log analysis. We worked
with real-life CDN proxy logs, identified key features included in the
logs (e.g., content type, HTTP status code, time-of-day, host) and clus-
tered the log lines corresponding to different host types offering live TV,
video on demand, file caching and web content. Our experiments were
run on a dataset consisting of proxy logs collected over a 7-day period
from a single, physical CDN server running multiple types of services
(VoD, live TV, file). The dataset consisted of 2.2 billion log lines. Our
analysis showed that CDN error clustering is a viable approach towards
identifying recurring errors and improving overall quality of service.
Keywords: Content Delivery Network, Error Clustering, HTTP proxy
logs, HTTP status codes.
1 Introduction and problem definition
Customers consuming live TV, video on demand (VoD) or file services are ac-
customed to a high quality of service which is made possible by content delivery
networks (CDN). Modern CDNs usually offer their services via protocols built
on top of the HyperText Transfer Protocol (HTTP).The key infrastructure com-
ponents of CDNs are origin and edge/surrogate servers as well as the communi-
cation infrastructure connecting them to their customers. Origin servers contain
arXiv:2210.05314v1 [cs.NI] 11 Oct 2022
2 E. Birihanu et al.
large volumes of movies, series and other multimedia content and are usually lim-
ited in number. Edge servers are more numerous and they are positioned closer
to the customers and cache only the most relevant content due to their lim-
ited storage capacity.Various HTTP proxy solutions are used on edge/surrogate
servers to cache and serve content to thousands of concurrent users. Software
and infrastructure errors occur both on the customer and CDN side. Customer
software in set-top boxes, smart televisions and mobile devices might contain
bugs or issue erroneous commands.The CDN infrastructure might contain bot-
tlenecks, be under cyber attack or just have partial hardware or software failures.
Considering the large number of customers and ever increasing numbers of nodes
in modern CDNs, the number of such errors is continuously increasing.
Log files can provide important insights about the condition of a system or
device. These files provide information about whether or not a system is working
properly, as well as how actions or services perform. Various types of information
are stored in log files on different web servers such as username, timestamp,
last page accessed, success rate, user agent, Universal Resource Locator (URL),
etc. By analysing these log files, one can better understand user and system
behaviour. Since complex system generate large quantities of log information,
manual analysis is not practical; thus, automated approaches are warranted [1].
The goal of this research is to contribute to the state of the art by analyzing
edge server logs collected at a European CDN provider. The error logs analyzed
were clustered on the host (i.e., CDN node) level. The ultimate goal of this
research was to identify the causes of the diverse errors, provide valuable insights
to the operators and developers of the analyzed CDN and thereby contribute
towards solving recurring problems and propose potential improvements.
2 Related works
The growing number of Internet users and their increasing demand for the deliv-
ery of low-latency content led to the emergence of networks of content delivery
networks (CDN) [11]. A CDN is composed of a variety of points of presence which
are essentially proxies containing the most popular content offered on the CDN
[2] [13]. The core aspect of CDN is the hierarchical design in which top-level,
origin servers contain all available content and lower-level nodes contain only
the most frequently requested content in a the last mile, i.e., close to customers
in a single geographic region [3].
Similarly to other large-scale enterprise systems, the daily amount of log lines
produced by CDNs is measured in the tens or hundreds of millions or maybe
even billions. As the manual analysis of such datasets is impractical and often
impossible, it is a viable approach to rely on machine learning algorithms which
automatically process log lines and discover interesting patterns. One such ML-
based approach is error log clustering [5].
Neha et al [4]used a web log analyzer tool called Web Log Expert to identify
users’ behavior in terms of number of hits, page views visitors and bandwidth.
An in-depth analysis of user behaviour of the NASA website was performed in
Client Error Clustering Approaches in Content Delivery Networks (CDN) 3
[15]. The goal of that research was to obtain information about top errors, web-
sites and potential visitors of the site. The study showed that machine learning
techniques such as association, clustering, and classification can be used to iden-
tify regular users of a website. The goal of study [7] was to determine the best
approach for identifying user behavior, whether quantitative or qualitative meth-
ods are used. The researchers claimed that clustering users faces two challenges:
reasoning and surrounding behavior. They came to the conclusion that com-
bining qualitative and quantitative methodologies is the best way to understand
user behavior. Haifei Xiang in [18] used weblog data to cluster user behavior and
analyze user access patterns. The study explained how to analyze user behavior
from weblog data and apply the K-means clustering algorithm. It considers the
disadvantages of K-means to obtain local optimum solutions. Methods such as
selecting the initial centers based on data sparsity were proposed, which may
effectively minimize algorithm iteration time and increase clustering quality. In
[16] the researchers clustered users into networks according to their browsing
behaviour.Their methodology consisted of web log pre-processing and clustering
with the K-means algorithm with the ultimate goal to group users into different
categories and analyze their behaviour based on the category of the web sites
with which they interact.
Log parsing is a technique for converting unstructured content from log mes-
sages into a format appropriate for data mining. The aim of the study [16] was to
analyze how log parsing strategies using natural language processing affected log
mining performance. The researchers utilized two datasets: the first consisted of
log data collected in an aviation system which included over 4,500,000 messages
gathered over the period of a year. The second dataset was log data from public
benchmarks acquired from a Hadoop distributed file system (HDFS) cluster.
A novel way for presenting textual clustering to automatically find the syn-
tactic structures of log messages logs collected on super-computing systems was
proposed in [6]. The researchers managed to utilize their approach to extract
meaningful structural and temporal message patterns. The goal of study [14]
was to investigate user sessions via a frequent pattern mining approach in web
logs.
3 Data exploration
The CDN service provider delivered logs in multiple batches, incrementally im-
proving the data collection and anonymization methods for better fitting to the
research goals. This work builds on multiple experimental data sets extracted
from the base data set, which contains more than 2.2 billion CDN proxy log
entries collected during a 7-day period. This dataset has 29 different features
and most of them are categorical data - table 1 contains the list of features in
the dataset. In the experimentation and prototyping phase of this research work
we extracted different sample datasets from the logs.
Web and web proxy servers send a set of known HTTP status codes when
processing a client request results in an error. These errors can be grouped
摘要:

ClientErrorClusteringApproachesinContentDeliveryNetworks(CDN)ErmiyasBirihanu1,JiyanMahmud1,PeterKiss,1,AdolfKamuzora1,WadieSkaf1,TomasHorvath1,TamasJursonovics,2,PeterPogrzeba,2,andImreLendak11TelekomInnovationLaboratories,DataScienceandEngineeringDepartment,FacultyofInformatics,EotvosLoran...

展开>> 收起<<
Client Error Clustering Approaches in Content Delivery Networks CDN Ermiyas Birihanu1 Jiyan Mahmud1 P eter Kiss1 Adolf Kamuzora1 Wadie.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:501.54KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注