Common Vulnerability Scoring System Prediction based on Open Source Intelligence Information Sources_2

2025-04-27 0 0 915.58KB 12 页 10玖币

侵权投诉

Common Vulnerability Scoring System Prediction based on

Open Source Intelligence Information Sources

Philipp Kühn∗

David N. Relke

Christian Reuter

kuehn@peasec.tu-darmstadt.de

david.relke@stud.tu-darmstadt.de

reuter@peasec.tu-darmstadt.de

Science and Technology for Peace and Security (PEASEC), Technical University of Darmstadt

Darmstadt, Germany

ABSTRACT

The number of newly published vulnerabilities is constantly increas-

ing. Until now, the information available when a new vulnerability

is published is manually assessed by experts using a Common Vul-

nerability Scoring System (

CVSS

) vector and score. This assessment

is time consuming and requires expertise. Various works already

try to predict

CVSS

vectors or scores using machine learning based

on the textual descriptions of the vulnerability to enable faster as-

sessment. However, for this purpose, previous works only use the

texts available in databases such as National Vulnerability Database.

With this work, the publicly available web pages referenced in the

National Vulnerability Database are analyzed and made available

as sources of texts through web scraping. A Deep Learning based

method for predicting the

CVSS

vector is implemented and eval-

uated. The present work provides a classication of the National

Vulnerability Database’s reference texts based on the suitability

and crawlability of their texts. While we identied the overall in-

uence of the additional texts is negligible, we outperformed the

state-of-the-art with our Deep Learning prediction models.

1 INTRODUCTION

IT systems are now ubiquitous and fundamental to society, busi-

nesses, and individuals. Failures and disruptions can have cata-

strophic consequences for those aected. In 2017, for example, two

waves of ransomware attacks occurred, each resulting in major

outages to businesses and infrastructure [

]. The vulnerability

that enabled these attacks had been known and xed a month be-

fore the rst attack. In other attacks, such as the one on Microsoft

Exchange Server in early 2021, only a few days passed between the

discovery of the vulnerability and the start of attacks [5].

It is therefore important for researchers or system administra-

tors to learn about vulnerabilities as early as possible, analyze them

and initiate countermeasures. Various publicly accessible databases,

such as the National Vulnerability Database (

NVD

)

and the Com-

mon Vulnerabilities and Exposures (

CVE

)

collect, structure and

prepare the published vulnerabilities for this purpose. However,

relevant information can also be found on many other platforms,

such as social media (especially Twitter), blogs, news portals, and

company websites.

∗Corresponding author

1nvd.nist.gov

2cve.mitre.org

The Common Vulnerability Scoring System (

CVSS

) is used to

categorize dierent aspects of vulnerabilities. The result of this

categorization is a vector whose elements are a machine-readable

representation of the vulnerability’s properties

. Based on the com-

ponents of the

CVSS

vector a numerical vulnerability score (

CVSS

severity score) is calculated. The vulnerability assessment is usu-

ally performed by IT security experts based on the available Open

Source Intelligence (

OSINT

) information.

OSINT

refers to the struc-

tured collection and analysis of information that is freely available

to the public.

There is a certain period of time when the information about a

new vulnerability is published, but the assessment made by experts

is not yet available [

]. Due to the large mass of published vul-

nerabilities, it is dicult for researchers or e.g. responsible persons

in companies to assess each new vulnerability themselves. They are

therefore dependent on the assessments of experts. Accordingly, the

longer it takes for the assessment to become available, the longer it

takes for countermeasures to be taken to mitigate the vulnerability.

During this period, the vulnerable systems are vulnerable to attack

without the responsible parties knowing about it. It is therefore

important that the assessment is available as soon as possible.

Various works [

] try to perform this assessment auto-

matically based on the textual information available about a vulner-

ability using Machine Learning (

). This would allow for a much

faster assessment. The vulnerability could already be assessed in

an automated way when it is published and the time window in

which no at least preliminary assessment is available is kept small.

It would also allow experts to prioritize and make recommendations

for the assessment.

Previous work largely uses only the short descriptions of vul-

nerabilities from

NVD

and

CVE

with some exceptions [

]. Han

et al

. [17]

, for instance, present a system for classifying vulnera-

bilities into dierent severity levels based on

CVSS

. From Khazaei

et al

. [21]

comes a work on predicting the numerical

CVSS

severity

score. In addition, there are methods that automatically predict

the entire

CVSS

vector [

]. Another work by Kuehn et al

. [22]

describes a system that uses Deep Learning to predict the

CVSS

vector. However, the system requires labels created by experts to

train, which signicantly increases the required eort for larger

datasets. Further, Deep Learning (

) prots from large training

3https://www.rst.org/cvss/specication-document

arXiv:2210.02143v1 [cs.CR] 5 Oct 2022

datasets to which the reference texts could contribute, which is

currently not leverage by related work.

Goal. This work aims to use as much textual data as possible to

predict the

CVSS

vector of a vulnerability. This is to achieve the

most accurate estimation of the

CVSS

vector possible. It should be

possible to use not only the short description of the vulnerability,

but also other types of texts, such as Twitter posts and news arti-

cles for prediction in case of a new vulnerability. Possible sources

of textual information about vulnerabilities should be found and

categorized. We aim to answer the following research questions:

Where can relevant textual information on vulnerabilities be found

outside vulnerability databases

(RQ1)

?and To which degree are pub-

lic data sources beyond vulnerability databases suitable for predicting

the

CVSS

vector

(RQ2)

? This will clarify whether there are typical

sources that regularly report on current vulnerabilities and whether

these are suitable as a basis for building a dataset for training a

system.

Here, a rst impression shall be gained by a rough manual

search and then the sources referenced in the databases shall be

analyzed automatically with regard to the type and scope of the

references (e.g., blog posts, patchnotes, GitHub issues). With the

help of the texts, a

model for predicting the

CVSS

vector is to be

trained. The data must be ltered and cleaned for this purpose. The

model shall use Deep Learning and use state-of-the-art models

as a basis. The model is evaluated and compared to previous work.

Contributions. The contribution to current research is an analy-

sis of the references contained in the databases. This will categorize

the references in terms of certain characteristics and suitable for

models and can serve as a starting point for further work on

the use of the references

(C1)

. A method that collects and processes

the text contained on the referenced web pages will be presented.

In addition, a system is implemented and evaluated that, unlike

previous work, such as Elbaz et al

. [12]

and Kuehn et al

. [22]

, uses

more extensive text from the references in addition to descriptions

of vulnerabilities from the databases

(C2)

. This method for predict-

ing

CVSS

vectors surpasses the current state-of-the-art. Further,

do we present an extensive explainability analysis of our trained

models as part of our evaluation (C3).

Outline. The state of the art in research is considered in §2,

followed by a preliminary analysis of the references included in

NVD

see also §3 Requirements for references and the texts contained

in them are dened and consequently the individual references are

evaluated, resulting in a selection of references. §4 explains the

procedure for collecting the texts from the references and a system

for retrieving, processing, and storing the texts is presented. §5

evaluates the

system, while §6 discusses and compares the

results with other work. Finally, a conclusion is drawn in §7.

2 RELATED WORK

This section gives an overview over the state of the art in research.

We focus literature dealing with the prediction of

CVSS

vectors,

scores, or levels. In addition, work that uses sources other than

NVD

in this context is considered. Automated assessment should

provide a time advantage over the assessment by human experts. In

this regard, dierent papers come to dierent conclusions regarding

the duration of the assessment, and the exact methodology is not

always clear. Elbaz et al

. [12]

state for the observed period from

2007 to 2019 that 90% of vulnerabilities were assessed within just

under 30 days, with a median of only one day, while Chen et al

. [7]

indicate an average of 132 days between publication and assessment

for an observed period of 23 months in 2018 and 2019.

NVD, CVSS, Information Sources. Johnson et al

. [20]

perform a

statistical analysis of

CVSS

vectors in dierent databases contain-

ing vulnerabilities. In doing so, they show that despite dierent

sources, the

CVSS

vector is always comparable and, consequently,

seem to be robust. They state the

NVD

is the most robust informa-

tion source for

CVSS

information. On the other hand, Dong et al

[10]

show that information in the

NVD

itself is sometimes incon-

sistent and propose a system that relies on external sources to nd,

for example, missing versions of the software in question in the

NVD

. Accordingly, Kuehn et al

. [22]

present an information quality

metric for vulnerability databases and improve several drawbacks

in the

NVD

. In addition to vulnerability databases, other sources of

information are used in vulnerability management. Sabottke et al

[29]

use Twitter to predict whether a vulnerability will actually be

exploited. Almukaynizi et al

. [1]

go a step further and use other

data sources, such as ExploitDB

and Zero Day Initiative

. How-

ever, no text is used, but the simple existence of an article about a

vulnerability is used as a feature for the ML model.

CVSS Prediction. A large number of works deal with the predic-

tion of

CVSS

vector, scores, or levels starting from text. As one of

the rst works, Yamamoto et al

. [37]

use sLDA [

] to predict the

CVSS

vector based on the descriptions. For predicting the score,

Khazaei et al

. [21]

use Support Vector Machines (

SVM

s), random

forests [

], and fuzzy logic. Spanos and Angelis

[33]

predict the

CVSS

vector using random forests and boosting [

is rst

used in this context by Han et al

. [17]

. By using an Convolutional

Neural Network (

CNN

), no feature engineering is required. How-

ever, in doing so, the model only determines the

CVSS

severity

level from the options Critical,High,Medium, and Low. Gawron

et al

. [14]

use

in addition to Naive Bayes, but here the result is

CVSS

vector. Twitter serves as the data source for Chen et al

. [6]

The

model is based on Long Short-Term Memory (

LSTM

) [

]

and predicts

CVSS

score. Sahin and Tosun

[30]

also improve on the

Han et al

. [17]

approach by using a

LSTM

. Gong et al

. [15]

show

a multi-task learning method that sets up multiple classiers on a

single Neural Network (

), making it more ecient. Liu et al

. [25]

use the Chinese equivalent, the China National Vulnerability Data-

base of Information Security (

CNNVD

), as the data source rather

than the

NVD

. Jiang and Atif

[19]

take scores not only from the

NVD

but also from other sources as a basis for their prediction of

the score. The work of Elbaz et al

. [12]

focuses on a particularly

tractable classication of the

CVSS

vector. Therefore, they do not

use dimension reduction techniques. Kuehn et al

. [22]

use

predict the

CVSS

vector, based on the

NVD

’s descriptions, with the

goal to aid security experts in their nal decision. The most recent

approach proposed Shahid and Debar

[32]

, which uses a separate

classier based on a Bidirectional Encoder Representations from

4https://www.exploit-db.com/

5https://www.zerodayinitiative.com/

Transformers (

BERT

) model [

] to determine the

CVSS

vector for

each component of the vector. Several proposals rely solely on the

textual data from the

NVD

. Some use text from Twitter or simple

binary features, such as the existence of an article about a particular

vulnerability. Other vulnerability context tasks also use few dier-

ent data sources. Yitagesu et al

. [38]

also use Twitter as a source for

a model for Part-of-Speech (

POS

) tagging. Liao et al

. [24]

propose

a system which draws on several sources to lter Indicators of

Compromise (IoC) from natural text.

Research Gap.

OSINT

is widely used in IT security [

Various works exist on the prediction of

CVSS

vectors based on

descriptions. However, as research shows, few

OSINT

vulnerability

sources are used [

], especially in the context of

CVSS

score, level,

or vector prediction, and if they are, very simple features from other

sources are used [

]. Furthermore, there is no systematic analysis

of the suitablility of

NVD

references for

CVSS

vector prediction

approaches.

3 PRELIMINARY ANALYSIS

The authors performed an exploratory analysis of the available data,

i.e., vulnerability descriptions and outgoing references from the

NVD

, to identify data suitability criteria and requirements for the

web scraping process. Suitable in the sense of the present work are

texts that describe a vulnerability and can be directly assigned to a

vulnerability via the

CVE

identication number. In the following

we list some assumptions we considered.

•

Each text shall be uniquely assignable to one and only one

vulnerability via the

CVE

identication number. Without

this criterion a text could be used as a training example for

two dierent permutations of one of the components of the

CVSS

vector. This makes it dicult for the

algorithm

to identify the relevant properties of the vulnerability. The

vulnerabilities covered in a text may be very dierent, so

it does not make sense to use the same text for multiple

vulnerabilities. It is even possible that only one vulnerability

is described, although several with dierent target vectors

are mentioned.

•

The texts should not contain the target variable, i.e., the

CVSS

vector. Otherwise, the

model could predict the

target parameter based on the variable present in the input,

without any actual meaningful learning eect.

•

There should be as little noise as possible. This ensures a high

quality of the prediction. As stated in §2, the data otherwise

contain patterns that could negatively aect the ML model.

Our secondary goal with this exploratory analysis is to identify

where to nd usable data,assess the data quality and how it can be

used. Those questions correlate with our research questions see also

§1

3.1 Descriptions in the NVD

The rst and most important starting point for nding texts about

vulnerabilities is the

NVD

. We consider

NVD

entries from 2016

to 2021, based on the introduction of the current

CVSS

standard

version 3. Entries without

CVSS

version 3 information are excluded.

This is the case for vulnerabilities in 2016, when

CVSS

v3 was still

Figure 1: Distribution of National Vulnerability Database de-

scription lengths.

in the process of wide adoption, and in 2021, where the

CVSS

vector was not yet available at the time the entries were retrieved.

In total, we collected 88 979 entries.

Individual entries in the

NVD

contain a short, expert curated

description of the vulnerability. The length of the descriptions for

our collected entries ranges between

and

3 835

characters, with

an average of

310

and a median of

249

. Fig. 1 shows the distribution

of the length of the descriptions. Descriptions longer than

1 000

characters are very rare, with the 95

percentile already at

746

characters. The information content of texts correlates with the

pure length of the texts, apart from some exceptions

. Likewise, a

single, short sentence cannot describe all aspects of the vulnerability.

As Fig. 1 illustrates, there are a large number of vulnerabilities in

NVD with very short descriptions.

Literature shows that the quality of vulnerability descriptions

in the

NVD

diers [

] and the quality can only be assessed to a

limited extent without a deeper analysis. A random sample shows

that many descriptions contain less information about the actual

vulnerability, but list, e.g., aected products and version numbers.

Such information is unrelated to the characteristics of the vulnera-

bility and is therefore of little usefulness to predict the vulnerability

severity. Nevertheless, Shahid and Debar

[32]

show that good re-

sults in the prediction of the

CVSS

vector are possible based only on

NVD

descriptions. Their method of

CVSS

score prediction achieves

a Mean Squared Error (

MSE

) of

1.79

and a correctly predicted score

in 53% of all cases.

3.2 Reference Analysis

Each

NVD

entry references websites. To identify, which websites

are suitable to be crawled we rst analyze what kind of references

are involved and, based on these insights, build categories for refer-

ence domains. Second, we rate these groups based on their crawla-

bility and potential text quality.

https://www.cve.org/ResourcesSupport/FAQs#pc_cve_recordscve_record_

descriptions_created

Some descriptions list other, non-identical, vulnerabilities, which articially increases

the length of the description without giving further content.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CommonVulnerabilityScoringSystemPredictionbasedonOpenSourceIntelligenceInformationSourcesPhilippKühn∗DavidN.RelkeChristianReuterkuehn@peasec.tu-darmstadt.dedavid.relke@stud.tu-darmstadt.dereuter@peasec.tu-darmstadt.deScienceandTechnologyforPeaceandSecurity(PEASEC),TechnicalUniversityofDarmstadtDarms...

展开>> 收起<<

Common Vulnerability Scoring System Prediction based on Open Source Intelligence Information Sources_2.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Common Vulnerability Scoring System Prediction based on Open Source Intelligence Information Sources_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: