Common Vulnerability Scoring System Prediction based on Open Source Intelligence Information Sources_2

2025-04-27 0 0 915.58KB 12 页 10玖币
侵权投诉
Common Vulnerability Scoring System Prediction based on
Open Source Intelligence Information Sources
Philipp Kühn
David N. Relke
Christian Reuter
kuehn@peasec.tu-darmstadt.de
david.relke@stud.tu-darmstadt.de
reuter@peasec.tu-darmstadt.de
Science and Technology for Peace and Security (PEASEC), Technical University of Darmstadt
Darmstadt, Germany
ABSTRACT
The number of newly published vulnerabilities is constantly increas-
ing. Until now, the information available when a new vulnerability
is published is manually assessed by experts using a Common Vul-
nerability Scoring System (
CVSS
) vector and score. This assessment
is time consuming and requires expertise. Various works already
try to predict
CVSS
vectors or scores using machine learning based
on the textual descriptions of the vulnerability to enable faster as-
sessment. However, for this purpose, previous works only use the
texts available in databases such as National Vulnerability Database.
With this work, the publicly available web pages referenced in the
National Vulnerability Database are analyzed and made available
as sources of texts through web scraping. A Deep Learning based
method for predicting the
CVSS
vector is implemented and eval-
uated. The present work provides a classication of the National
Vulnerability Database’s reference texts based on the suitability
and crawlability of their texts. While we identied the overall in-
uence of the additional texts is negligible, we outperformed the
state-of-the-art with our Deep Learning prediction models.
1 INTRODUCTION
IT systems are now ubiquitous and fundamental to society, busi-
nesses, and individuals. Failures and disruptions can have cata-
strophic consequences for those aected. In 2017, for example, two
waves of ransomware attacks occurred, each resulting in major
outages to businesses and infrastructure [
11
,
16
]. The vulnerability
that enabled these attacks had been known and xed a month be-
fore the rst attack. In other attacks, such as the one on Microsoft
Exchange Server in early 2021, only a few days passed between the
discovery of the vulnerability and the start of attacks [5].
It is therefore important for researchers or system administra-
tors to learn about vulnerabilities as early as possible, analyze them
and initiate countermeasures. Various publicly accessible databases,
such as the National Vulnerability Database (
NVD
)
1
and the Com-
mon Vulnerabilities and Exposures (
CVE
)
2
collect, structure and
prepare the published vulnerabilities for this purpose. However,
relevant information can also be found on many other platforms,
such as social media (especially Twitter), blogs, news portals, and
company websites.
Corresponding author
1nvd.nist.gov
2cve.mitre.org
The Common Vulnerability Scoring System (
CVSS
) is used to
categorize dierent aspects of vulnerabilities. The result of this
categorization is a vector whose elements are a machine-readable
representation of the vulnerability’s properties
3
. Based on the com-
ponents of the
CVSS
vector a numerical vulnerability score (
CVSS
severity score) is calculated. The vulnerability assessment is usu-
ally performed by IT security experts based on the available Open
Source Intelligence (
OSINT
) information.
OSINT
refers to the struc-
tured collection and analysis of information that is freely available
to the public.
There is a certain period of time when the information about a
new vulnerability is published, but the assessment made by experts
is not yet available [
12
,
28
]. Due to the large mass of published vul-
nerabilities, it is dicult for researchers or e.g. responsible persons
in companies to assess each new vulnerability themselves. They are
therefore dependent on the assessments of experts. Accordingly, the
longer it takes for the assessment to become available, the longer it
takes for countermeasures to be taken to mitigate the vulnerability.
During this period, the vulnerable systems are vulnerable to attack
without the responsible parties knowing about it. It is therefore
important that the assessment is available as soon as possible.
Various works [
12
,
17
,
32
] try to perform this assessment auto-
matically based on the textual information available about a vulner-
ability using Machine Learning (
ML
). This would allow for a much
faster assessment. The vulnerability could already be assessed in
an automated way when it is published and the time window in
which no at least preliminary assessment is available is kept small.
It would also allow experts to prioritize and make recommendations
for the assessment.
Previous work largely uses only the short descriptions of vul-
nerabilities from
NVD
and
CVE
with some exceptions [
1
,
7
]. Han
et al
. [17]
, for instance, present a system for classifying vulnera-
bilities into dierent severity levels based on
CVSS
. From Khazaei
et al
. [21]
comes a work on predicting the numerical
CVSS
severity
score. In addition, there are methods that automatically predict
the entire
CVSS
vector [
12
]. Another work by Kuehn et al
. [22]
describes a system that uses Deep Learning to predict the
CVSS
vector. However, the system requires labels created by experts to
train, which signicantly increases the required eort for larger
datasets. Further, Deep Learning (
DL
) prots from large training
3https://www.rst.org/cvss/specication-document
arXiv:2210.02143v1 [cs.CR] 5 Oct 2022
datasets to which the reference texts could contribute, which is
currently not leverage by related work.
Goal. This work aims to use as much textual data as possible to
predict the
CVSS
vector of a vulnerability. This is to achieve the
most accurate estimation of the
CVSS
vector possible. It should be
possible to use not only the short description of the vulnerability,
but also other types of texts, such as Twitter posts and news arti-
cles for prediction in case of a new vulnerability. Possible sources
of textual information about vulnerabilities should be found and
categorized. We aim to answer the following research questions:
Where can relevant textual information on vulnerabilities be found
outside vulnerability databases
(RQ1)
?and To which degree are pub-
lic data sources beyond vulnerability databases suitable for predicting
the
CVSS
vector
(RQ2)
? This will clarify whether there are typical
sources that regularly report on current vulnerabilities and whether
these are suitable as a basis for building a dataset for training a
ML
system.
Here, a rst impression shall be gained by a rough manual
search and then the sources referenced in the databases shall be
analyzed automatically with regard to the type and scope of the
references (e.g., blog posts, patchnotes, GitHub issues). With the
help of the texts, a
ML
model for predicting the
CVSS
vector is to be
trained. The data must be ltered and cleaned for this purpose. The
ML
model shall use Deep Learning and use state-of-the-art models
as a basis. The model is evaluated and compared to previous work.
Contributions. The contribution to current research is an analy-
sis of the references contained in the databases. This will categorize
the references in terms of certain characteristics and suitable for
ML
models and can serve as a starting point for further work on
the use of the references
(C1)
. A method that collects and processes
the text contained on the referenced web pages will be presented.
In addition, a system is implemented and evaluated that, unlike
previous work, such as Elbaz et al
. [12]
and Kuehn et al
. [22]
, uses
more extensive text from the references in addition to descriptions
of vulnerabilities from the databases
(C2)
. This method for predict-
ing
CVSS
vectors surpasses the current state-of-the-art. Further,
do we present an extensive explainability analysis of our trained
models as part of our evaluation (C3).
Outline. The state of the art in research is considered in §2,
followed by a preliminary analysis of the references included in
NVD
see also §3 Requirements for references and the texts contained
in them are dened and consequently the individual references are
evaluated, resulting in a selection of references. §4 explains the
procedure for collecting the texts from the references and a system
for retrieving, processing, and storing the texts is presented. §5
evaluates the
ML
system, while §6 discusses and compares the
results with other work. Finally, a conclusion is drawn in §7.
2 RELATED WORK
This section gives an overview over the state of the art in research.
We focus literature dealing with the prediction of
CVSS
vectors,
scores, or levels. In addition, work that uses sources other than
NVD
in this context is considered. Automated assessment should
provide a time advantage over the assessment by human experts. In
this regard, dierent papers come to dierent conclusions regarding
the duration of the assessment, and the exact methodology is not
always clear. Elbaz et al
. [12]
state for the observed period from
2007 to 2019 that 90% of vulnerabilities were assessed within just
under 30 days, with a median of only one day, while Chen et al
. [7]
indicate an average of 132 days between publication and assessment
for an observed period of 23 months in 2018 and 2019.
NVD, CVSS, Information Sources. Johnson et al
. [20]
perform a
statistical analysis of
CVSS
vectors in dierent databases contain-
ing vulnerabilities. In doing so, they show that despite dierent
sources, the
CVSS
vector is always comparable and, consequently,
seem to be robust. They state the
NVD
is the most robust informa-
tion source for
CVSS
information. On the other hand, Dong et al
.
[10]
show that information in the
NVD
itself is sometimes incon-
sistent and propose a system that relies on external sources to nd,
for example, missing versions of the software in question in the
NVD
. Accordingly, Kuehn et al
. [22]
present an information quality
metric for vulnerability databases and improve several drawbacks
in the
NVD
. In addition to vulnerability databases, other sources of
information are used in vulnerability management. Sabottke et al
.
[29]
use Twitter to predict whether a vulnerability will actually be
exploited. Almukaynizi et al
. [1]
go a step further and use other
data sources, such as ExploitDB
4
and Zero Day Initiative
5
. How-
ever, no text is used, but the simple existence of an article about a
vulnerability is used as a feature for the ML model.
CVSS Prediction. A large number of works deal with the predic-
tion of
CVSS
vector, scores, or levels starting from text. As one of
the rst works, Yamamoto et al
. [37]
use sLDA [
26
] to predict the
CVSS
vector based on the descriptions. For predicting the score,
Khazaei et al
. [21]
use Support Vector Machines (
SVM
s), random
forests [
4
], and fuzzy logic. Spanos and Angelis
[33]
predict the
CVSS
vector using random forests and boosting [
13
].
DL
is rst
used in this context by Han et al
. [17]
. By using an Convolutional
Neural Network (
CNN
), no feature engineering is required. How-
ever, in doing so, the model only determines the
CVSS
severity
level from the options Critical,High,Medium, and Low. Gawron
et al
. [14]
use
DL
in addition to Naive Bayes, but here the result is
a
CVSS
vector. Twitter serves as the data source for Chen et al
. [6]
.
The
ML
model is based on Long Short-Term Memory (
LSTM
) [
18
]
and predicts
CVSS
score. Sahin and Tosun
[30]
also improve on the
Han et al
. [17]
approach by using a
LSTM
. Gong et al
. [15]
show
a multi-task learning method that sets up multiple classiers on a
single Neural Network (
NN
), making it more ecient. Liu et al
. [25]
use the Chinese equivalent, the China National Vulnerability Data-
base of Information Security (
CNNVD
), as the data source rather
than the
NVD
. Jiang and Atif
[19]
take scores not only from the
NVD
but also from other sources as a basis for their prediction of
the score. The work of Elbaz et al
. [12]
focuses on a particularly
tractable classication of the
CVSS
vector. Therefore, they do not
use dimension reduction techniques. Kuehn et al
. [22]
use
DL
to
predict the
CVSS
vector, based on the
NVD
’s descriptions, with the
goal to aid security experts in their nal decision. The most recent
approach proposed Shahid and Debar
[32]
, which uses a separate
classier based on a Bidirectional Encoder Representations from
4https://www.exploit-db.com/
5https://www.zerodayinitiative.com/
2
Transformers (
BERT
) model [
9
] to determine the
CVSS
vector for
each component of the vector. Several proposals rely solely on the
textual data from the
NVD
. Some use text from Twitter or simple
binary features, such as the existence of an article about a particular
vulnerability. Other vulnerability context tasks also use few dier-
ent data sources. Yitagesu et al
. [38]
also use Twitter as a source for
a model for Part-of-Speech (
POS
) tagging. Liao et al
. [24]
propose
a system which draws on several sources to lter Indicators of
Compromise (IoC) from natural text.
Research Gap.
OSINT
is widely used in IT security [
8
,
24
,
27
,
29
].
Various works exist on the prediction of
CVSS
vectors based on
descriptions. However, as research shows, few
OSINT
vulnerability
sources are used [
23
], especially in the context of
CVSS
score, level,
or vector prediction, and if they are, very simple features from other
sources are used [
1
]. Furthermore, there is no systematic analysis
of the suitablility of
NVD
references for
CVSS
vector prediction
approaches.
3 PRELIMINARY ANALYSIS
The authors performed an exploratory analysis of the available data,
i.e., vulnerability descriptions and outgoing references from the
NVD
, to identify data suitability criteria and requirements for the
web scraping process. Suitable in the sense of the present work are
texts that describe a vulnerability and can be directly assigned to a
vulnerability via the
CVE
identication number. In the following
we list some assumptions we considered.
Each text shall be uniquely assignable to one and only one
vulnerability via the
CVE
identication number. Without
this criterion a text could be used as a training example for
two dierent permutations of one of the components of the
CVSS
vector. This makes it dicult for the
ML
algorithm
to identify the relevant properties of the vulnerability. The
vulnerabilities covered in a text may be very dierent, so
it does not make sense to use the same text for multiple
vulnerabilities. It is even possible that only one vulnerability
is described, although several with dierent target vectors
are mentioned.
The texts should not contain the target variable, i.e., the
CVSS
vector. Otherwise, the
ML
model could predict the
target parameter based on the variable present in the input,
without any actual meaningful learning eect.
There should be as little noise as possible. This ensures a high
quality of the prediction. As stated in §2, the data otherwise
contain patterns that could negatively aect the ML model.
Our secondary goal with this exploratory analysis is to identify
where to nd usable data,assess the data quality and how it can be
used. Those questions correlate with our research questions see also
§1
3.1 Descriptions in the NVD
The rst and most important starting point for nding texts about
vulnerabilities is the
NVD
. We consider
NVD
entries from 2016
to 2021, based on the introduction of the current
CVSS
standard
version 3. Entries without
CVSS
version 3 information are excluded.
This is the case for vulnerabilities in 2016, when
CVSS
v3 was still
Figure 1: Distribution of National Vulnerability Database de-
scription lengths.
in the process of wide adoption, and in 2021, where the
CVSS
v3
vector was not yet available at the time the entries were retrieved.
In total, we collected 88 979 entries.
Individual entries in the
NVD
contain a short, expert curated
6
description of the vulnerability. The length of the descriptions for
our collected entries ranges between
23
and
3 835
characters, with
an average of
310
and a median of
249
. Fig. 1 shows the distribution
of the length of the descriptions. Descriptions longer than
1 000
characters are very rare, with the 95
th
percentile already at
746
characters. The information content of texts correlates with the
pure length of the texts, apart from some exceptions
7
. Likewise, a
single, short sentence cannot describe all aspects of the vulnerability.
As Fig. 1 illustrates, there are a large number of vulnerabilities in
NVD with very short descriptions.
Literature shows that the quality of vulnerability descriptions
in the
NVD
diers [
22
] and the quality can only be assessed to a
limited extent without a deeper analysis. A random sample shows
that many descriptions contain less information about the actual
vulnerability, but list, e.g., aected products and version numbers.
Such information is unrelated to the characteristics of the vulnera-
bility and is therefore of little usefulness to predict the vulnerability
severity. Nevertheless, Shahid and Debar
[32]
show that good re-
sults in the prediction of the
CVSS
vector are possible based only on
NVD
descriptions. Their method of
CVSS
score prediction achieves
a Mean Squared Error (
MSE
) of
1.79
and a correctly predicted score
in 53% of all cases.
3.2 Reference Analysis
Each
NVD
entry references websites. To identify, which websites
are suitable to be crawled we rst analyze what kind of references
are involved and, based on these insights, build categories for refer-
ence domains. Second, we rate these groups based on their crawla-
bility and potential text quality.
6
https://www.cve.org/ResourcesSupport/FAQs#pc_cve_recordscve_record_
descriptions_created
7
Some descriptions list other, non-identical, vulnerabilities, which articially increases
the length of the description without giving further content.
3
摘要:

CommonVulnerabilityScoringSystemPredictionbasedonOpenSourceIntelligenceInformationSourcesPhilippKühn∗DavidN.RelkeChristianReuterkuehn@peasec.tu-darmstadt.dedavid.relke@stud.tu-darmstadt.dereuter@peasec.tu-darmstadt.deScienceandTechnologyforPeaceandSecurity(PEASEC),TechnicalUniversityofDarmstadtDarms...

展开>> 收起<<
Common Vulnerability Scoring System Prediction based on Open Source Intelligence Information Sources_2.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:915.58KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注