Enriching Vulnerability Reports Through Automated and Augmented Description Summarization Hattan Althebeiti and David Mohaisen

2025-04-29 0 0 477.35KB 14 页 10玖币

侵权投诉

Enriching Vulnerability Reports Through Automated

and Augmented Description Summarization

Hattan Althebeiti and David Mohaisen

University of Central Florida, Orlando, USA

{hattan.althebeiti,mohaisen}@ucf.edu

Abstract. Security incidents and data breaches are increasing rapidly, and only

a fraction of them is being reported. Public vulnerability databases, e.g., national

vulnerability database (NVD) and common vulnerability and exposure (CVE),

have been leading the effort in documenting vulnerabilities and sharing them

to aid defenses. Both are known for many issues, including brief vulnerability

descriptions. Those descriptions play an important role in communicating the

vulnerability information to security analysts in order to develop the appropriate

countermeasure. Many resources provide additional information about vulnera-

bilities, however, they are not utilized to boost public repositories. In this pa-

per, we devise a pipeline to augment vulnerability description through third party

reference (hyperlink) scrapping. To normalize the description, we build a natu-

ral language summarization pipeline utilizing a pretrained language model that

is ﬁne-tuned using labeled instances and evaluate its performance against both

human evaluation (golden standard) and computational metrics, showing initial

promising results in terms of summary ﬂuency, completeness, correctness, and

understanding.

Keywords: Vulnerability · NVD · CVE · Natural Language Processing · Sum-

marization · Sentence Encoder · Transformer.

1 Introduction

Vulnerabilities are weaknesses in systems that render them exposed to any threat or

exploitation. They are prevalent in software and are constantly being discovered and

patched. However, given the rapid development in technologies, discovering a vulnera-

bility and developing a mitigation technique become challenging. Moreover, document-

ing vulnerabilities and keeping track of their development become cumbersome.

The common vulnerability and exposure CVE managed by MITRE and the Na-

tional vulnerability database NVD managed by NIST are two key resources for report-

ing and sharing vulnerabilities. The content of each resource may differ slightly accord-

ing to [8], but they are mostly synchronized and any update to the CVE should appear

eventually in the NVD. However, NVD/CVE descriptions have several shortcomings.

For example, the description might be incomplete, outdated or even contain inaccurate

information which could delay the development and deployment of patches. In 2017

Risk Based Security also known as VulbDB reported 7,900 more vulnerabilities than

what was reported by CVE [9,10]. Another concern with the existing framework is that

arXiv:2210.01260v1 [cs.CR] 3 Oct 2022

the description provided for vulnerabilities is often incomplete, brief, or does not carry

sufﬁcient contextual information [3,5].

To address some of these gaps, this work focuses on the linguistic aspects of vul-

nerability description and attempts to improve them by formulating the problem as a

summarization task over augmented initial description. We exploit the existence of third

party reports associated with vulnerabilities, which include more detailed information

about the vulnerabilities that goes beyond the basic description in the CVE. Therefore,

we leverage these additional resources employing a natural language processing (NLP)

pipeline towards that goal, providing informative summaries that cover more details and

perform well on both computational and human metrics.

Contributions. The main contributions of this work are as follows. (1) we present a

pipeline that enriches the description of vulnerabilities by considering semantically

similar contents from various third party resources (reference URLs). (2) In order to

normalize the enriched description and alleviate some of the drawbacks of the augmen-

tation (e.g., redundancy and repetition, largely variable length of description), we build

an NLP pipeline that exploits advances in representation, pretrained language models

that are ﬁne-tuned using the original (short description) as a label, and generate se-

mantically similar summaries of vulnerabilities. (3) We evaluate the performance of the

proposed NLP pipeline on NVD, a popular vulnerability database, with both computa-

tional and human metric evaluations.

2 Related Work

Vulnerabilities are constantly being exploited due to the wide spread of malware and

viruses along with the improper deployment of countermeasures or missing security

updates. Mohaisen et al. [16] proposed AMAL, an automated system to analyze and

classify malware based on its behaviour. AMAL is composed of two components Au-

toMal and MaLabel. AutoMal collect information about malware samples based on

their behaviours for monitoring and proﬁling. On the other hand, MaLabel utilizes the

artifacts generated by AutoMal to build a feature vector representation for malware

samples. Moreover, MaLabel builds multiple classiﬁers to classify unlabeled malware

samples and to cluster them into separate groups such that each group have malware

samples with similar proﬁles.

Public repositories provide comprehensive information about vulnerabilities, how-

ever, they still suffer from quality and consistency issues as demonstrated in previ-

ous works [3,8]. Anwar et al. [3] have identiﬁed and quantiﬁed multiple quality issues

with the NVD and addressed their implications and ramiﬁcations. The authors present

a method for each matter to remedy the discovered deﬁciency and improve the NVD.

Similarly, Anwar et al. [4] studied the impact of vulnerability disclosure on the stock

market and how it affects different industries. They were able to cluster industries into

three categories based on the vulnerabilities impact on the vendor’s return.

Limited prior works studied different characteristics of vulnerabilities and used NLP

based-approach on the task, although NLP has been utilized extensively for other se-

curity and privacy applications. Alabduljabbar et al. [1] conducted a comprehensive

study to classify privacy policies established by a third party. A pipeline was developed

to classify text segments into a high-level category that correspond to the content of

that segment. Likewise, Alabduljabbar et al. [2] used NLP to conduct a comparative

analysis of privacy policies presented by free and premium content websites. The study

highlighted that premium content websites are more transparent in terms of reporting

their practices with respect to data collection and tracking.

Dong et al. [8] built VIEM, a system to capture inconsistency between CVE/NVD

and third party reports utilizing Named Entity Recognition model (NER) and a Relation

Extractor model (RE). The NER is responsible for identifying the name and version of

vulnerable software based on their semantics and structure within the description and

label each of them accordingly. The RE component utilizes the the identiﬁed labels and

pairs the appropriate software name and version to predict which software is vulnerable.

Other research focused on studying the relationship between CVE and Common

Attack Pattern Enumeration and Classiﬁcation (CAPEC) and if it is possible to trace a

CVE description to a particular CAPEC using NLP as in Kanakogi et al. [12]. Simi-

larly, Kanakogi et al. [11] tested a new method for the same task but using Doc2Vec.

Wareus and Hell [22] proposed a method to automatically assigns Common Platform

Enumeration (CPE) to a CVEs from their description using NLP.

This work. We propose a pipeline for enriching the vulnerability description, and a

pipeline for normalizing description through summarization and associated evaluation.

3 Dataset: Baseline and Data Augmentation

Data Source and Scraping. Our data source is NVD because it is a well-known stan-

dard accepted across the globe, in both industry and academia, with many strengths: (1)

detailed structured information, including the severity score and publication date, (2)

human-readable descriptions, (3) capabilities for reanalysis with updated information,

and (4) powerful API for vulnerability information retrieval.

In our data collection, we limit our timeframe to vulnerabilities reported between

2019 and 2021 (inclusive). Based on our analysis, CVEs reported before 2019 do not

include sufﬁcient hyperlinks with additional text, which is our main source for aug-

mentation. We list all the vulnerabilities reported in this period, and scrap them. For

each vulnerability, we scrap the URLs pointing to the NVD page that hosts a particu-

lar vulnerability. As a result, we obtain 35,657 vulnerabilities with their unique URLs.

Second, we iterate through every URL various data elements. After retrieving the URL,

we scrap the description and the hyperlinks for that vulnerability.

Description Augmentation. To augment the description, we iterate through the scrapped

hyperlinks. Each hyperlink directs us to a page hosted by a third party, which could be

an ofﬁcial page belonging to the vendor or the developer or an unofﬁcial page; e.g.,

GitHub issue tracking page. We scrape every paragraph tag in each page separately and

apply various preprocessing steps to the extracted paragraph to clean up the text. This

preprocessing includes removing web links, special characters, white redundant spaces,

phone numbers, and email addresses. We also check the length of the paragraph and en-

sure it is more than 20 words after preprocessing. We conjecture that paragraphs shorter

than 20 words will not contribute to our goal.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EnrichingVulnerabilityReportsThroughAutomatedandAugmentedDescriptionSummarizationHattanAlthebeitiandDavidMohaisenUniversityofCentralFlorida,Orlando,USAfhattan.althebeiti,mohaiseng@ucf.eduAbstract.Securityincidentsanddatabreachesareincreasingrapidly,andonlyafractionofthemisbeingreported.Publicvulnera...

展开>> 收起<<

Enriching Vulnerability Reports Through Automated and Augmented Description Summarization Hattan Althebeiti and David Mohaisen.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Enriching Vulnerability Reports Through Automated and Augmented Description Summarization Hattan Althebeiti and David Mohaisen

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: