Conference’17, July 2017, Washington, DC, USA Mohammed Alaqdhi, Abdulrahman Alabduljabbar, Kyle Thomas, Saeed Salem, DaeHun Nyang, and David Mohaisen
annotation. In section 4, we provide an overview of the methods
utilized in this paper. In section 5, we provide the results and the
discussion. Finally, in section 6, we provide the conclusion and
recommendation for future research or work.
2 RELATED WORK
In the following, we sample and review the most related pieces of
prior work to the work presented in this study.
Online Website Analysis.
Researchers have held that diverse
constituents might be subject to increased risks when using free
content websites, given the evolution of online services and web
applications. These risks have been examined across various web-
site features, including digital certicates, content, and addressing
infrastructure. [
4
]. In another study, component and website-level
analyses were conducted to understand vulnerabilities utilizing
two main o-the-shelf tools, VirusTotal and Sucuri [
3
], linking free
content websites to signicant threats.
Privacy Practices Reporting.
Mindful of the implicit security
cost, another work has looked into the interplay between privacy
policies and the quality of those websites. Namely, the prior work
examined user comprehension of risks linked to service use through
privacy policy understanding [
5
]. The researchers passed several
ltered privacy policies into a custom pipeline that annotates the
policies against various categories (e.g., rst and third-party usage,
data retention) [
14
]. The authors found that the privacy policies
of free content websites are vague, lack essential policy elements,
or are lax in specifying the responsibilities of the service provider
(website owner) against possible compromise and exposure of user
data. On the other hand, they found that the privacy policies of
the premium content websites are more transparent and elaborate
about reporting their practices on data gathering, sharing, and
retention [5].
Tracking and Website Structure.
Another study has contributed
to this eld by revealing the tracking mechanisms of corporate
ownership [
17
]. To comprehend the web tracking phenomenon
and subsequently craft material policies to regulate it, the authors
argued that it is imperative to know the actual degree and reach
of corporations that may be subject to the increased regulations.
The most signicant nding in this research was that 78.07 per-
cent of websites within Alexa’s top million instigated third-party
HTTP requests to the domain owned by Google. Furthermore, the
researchers observed that the overall trend shown by past surveys
is not only that many of the users of websites value privacy but also
that the present privacy state online denotes an area of material
anxiety. Concerning measurement, the same study highlights that
the level of tracking on the web is on the rise and does not show
indications of abating.
3 DATASET AND DATA ANNOTATION
Websites.
For this study, we compiled a dataset that contains 1,562
websites, with 834 free content websites and 728 premium web-
sites, which have been used in prior work [
3
–
5
]. In selecting those
websites, we consider their popularity while maintaining a balance
per the sub-category of a website. To determine the popularity of a
website, we used the results of search engines Bingo, DuckDuckGo,
and Google as a proxy, where highly ranked websites are considered
popular. To balance the dataset, we undertook a manual verication
approach to vet each website across the sub-category (see below).
Namely, we sorted the websites into ve categories based on the
content they predominantly serve: software, music, movies, games,
or books. The following are the free and premium content websites
count per category: books (154 free, 195 premium), games (80 free,
113 premium), movies (331 free, 152 premium), music (83 free, 86
premium), and software (186 free, 182 premium).
Dataset annotation.
For our analysis, we augment the dataset in
various ways. We primarily focused on information reecting the
exposure to the risk of users [
4
]. We determine whether a website
is malicious or benign using the VirusTotal API [
24
]. VirusTotal
is a framework that oers cyber threat detection, which helps us
analyze, detect, and correlate threats while reducing the required
eort through automation. Specically, the API allowed us to iden-
tify malicious IP addresses, domains, or URLs associated with the
websites we use for augmentation.
CMS’s.
Since this work aims to understand the role of software
(CMS, in particular) used across websites and its contribution to
threat exposure, we follow a two-step approach: (1) website crawl-
ing and (2) manual inspection and annotation. First, we crawl each
of the websites and inspect its elements to nd the source folder for
the website. From the source folder, we list the source and content
for each website to identify the CMS used to develop this website.
This approach requires us to build a database of the dierent avail-
able CMS’s to allow automation of the annotation through regular
expression matching. We cross-validate our annotation utilizing ex-
isting online tools used for CMS detection. We use CMS-detector [
9
]
and w3techs [
25
], two popular tools, to extract the CMS’s used for
the list of websites. For automation, we build a wrapper that pre-
pares the query with the website, retrieves the response of the CMS
used from the corresponding tool, and compares it to the manually
identied set in the previous step. Among the CMS’s identied,
WordPress is the most popular, followed by Drupal, Django, Next.js,
Laravel, CodeIgniter, and DataLife. In total, we nd 77 unique CMS’s
used across the dierent websites, not including websites that rely
on a custom-coded CMS.
Vulnerabilities.
Our dataset’s nal augmentation and annotation
are the vulnerability count and patching patterns. For each CMS, we
crawl the results available in various portals concerning the current
version of the CMS to identify the associated vulnerability. Namely,
we crawl such information from cvedetails [
11
], snyk.io [
22
], open-
bugbounty [
19
], and Wordfence [
12
]. Finally, to determine whether
a vulnerability is patched or not (thus counting the number of
unpatched vulnerabilities), we query cybersecurity-help [10].
4 ANALYSIS METHODS
The key motivation behind our analysis is to understand the po-
tential contribution of CMS’s to the (in)security of free content
websites, which has been established already in the prior work, as
highlighted in section 2. To achieve this goal, we pursue two direc-
tions. The rst is a holistic analysis geared toward understanding
the distribution of various features associated with free content
and premium websites (combined). The second is a ne-grained