2 E. Birihanu et al.
large volumes of movies, series and other multimedia content and are usually lim-
ited in number. Edge servers are more numerous and they are positioned closer
to the customers and cache only the most relevant content due to their lim-
ited storage capacity.Various HTTP proxy solutions are used on edge/surrogate
servers to cache and serve content to thousands of concurrent users. Software
and infrastructure errors occur both on the customer and CDN side. Customer
software in set-top boxes, smart televisions and mobile devices might contain
bugs or issue erroneous commands.The CDN infrastructure might contain bot-
tlenecks, be under cyber attack or just have partial hardware or software failures.
Considering the large number of customers and ever increasing numbers of nodes
in modern CDNs, the number of such errors is continuously increasing.
Log files can provide important insights about the condition of a system or
device. These files provide information about whether or not a system is working
properly, as well as how actions or services perform. Various types of information
are stored in log files on different web servers such as username, timestamp,
last page accessed, success rate, user agent, Universal Resource Locator (URL),
etc. By analysing these log files, one can better understand user and system
behaviour. Since complex system generate large quantities of log information,
manual analysis is not practical; thus, automated approaches are warranted [1].
The goal of this research is to contribute to the state of the art by analyzing
edge server logs collected at a European CDN provider. The error logs analyzed
were clustered on the host (i.e., CDN node) level. The ultimate goal of this
research was to identify the causes of the diverse errors, provide valuable insights
to the operators and developers of the analyzed CDN and thereby contribute
towards solving recurring problems and propose potential improvements.
2 Related works
The growing number of Internet users and their increasing demand for the deliv-
ery of low-latency content led to the emergence of networks of content delivery
networks (CDN) [11]. A CDN is composed of a variety of points of presence which
are essentially proxies containing the most popular content offered on the CDN
[2] [13]. The core aspect of CDN is the hierarchical design in which top-level,
origin servers contain all available content and lower-level nodes contain only
the most frequently requested content in a the last mile, i.e., close to customers
in a single geographic region [3].
Similarly to other large-scale enterprise systems, the daily amount of log lines
produced by CDNs is measured in the tens or hundreds of millions or maybe
even billions. As the manual analysis of such datasets is impractical and often
impossible, it is a viable approach to rely on machine learning algorithms which
automatically process log lines and discover interesting patterns. One such ML-
based approach is error log clustering [5].
Neha et al [4]used a web log analyzer tool called Web Log Expert to identify
users’ behavior in terms of number of hits, page views visitors and bandwidth.
An in-depth analysis of user behaviour of the NASA website was performed in