Enriching Vulnerability Reports Through Automated and Augmented Description Summarization Hattan Althebeiti and David Mohaisen

2025-04-29 0 0 477.35KB 14 页 10玖币
侵权投诉
Enriching Vulnerability Reports Through Automated
and Augmented Description Summarization
Hattan Althebeiti and David Mohaisen
University of Central Florida, Orlando, USA
{hattan.althebeiti,mohaisen}@ucf.edu
Abstract. Security incidents and data breaches are increasing rapidly, and only
a fraction of them is being reported. Public vulnerability databases, e.g., national
vulnerability database (NVD) and common vulnerability and exposure (CVE),
have been leading the effort in documenting vulnerabilities and sharing them
to aid defenses. Both are known for many issues, including brief vulnerability
descriptions. Those descriptions play an important role in communicating the
vulnerability information to security analysts in order to develop the appropriate
countermeasure. Many resources provide additional information about vulnera-
bilities, however, they are not utilized to boost public repositories. In this pa-
per, we devise a pipeline to augment vulnerability description through third party
reference (hyperlink) scrapping. To normalize the description, we build a natu-
ral language summarization pipeline utilizing a pretrained language model that
is fine-tuned using labeled instances and evaluate its performance against both
human evaluation (golden standard) and computational metrics, showing initial
promising results in terms of summary fluency, completeness, correctness, and
understanding.
Keywords: Vulnerability · NVD · CVE · Natural Language Processing · Sum-
marization · Sentence Encoder · Transformer.
1 Introduction
Vulnerabilities are weaknesses in systems that render them exposed to any threat or
exploitation. They are prevalent in software and are constantly being discovered and
patched. However, given the rapid development in technologies, discovering a vulnera-
bility and developing a mitigation technique become challenging. Moreover, document-
ing vulnerabilities and keeping track of their development become cumbersome.
The common vulnerability and exposure CVE managed by MITRE and the Na-
tional vulnerability database NVD managed by NIST are two key resources for report-
ing and sharing vulnerabilities. The content of each resource may differ slightly accord-
ing to [8], but they are mostly synchronized and any update to the CVE should appear
eventually in the NVD. However, NVD/CVE descriptions have several shortcomings.
For example, the description might be incomplete, outdated or even contain inaccurate
information which could delay the development and deployment of patches. In 2017
Risk Based Security also known as VulbDB reported 7,900 more vulnerabilities than
what was reported by CVE [9,10]. Another concern with the existing framework is that
arXiv:2210.01260v1 [cs.CR] 3 Oct 2022
the description provided for vulnerabilities is often incomplete, brief, or does not carry
sufficient contextual information [3,5].
To address some of these gaps, this work focuses on the linguistic aspects of vul-
nerability description and attempts to improve them by formulating the problem as a
summarization task over augmented initial description. We exploit the existence of third
party reports associated with vulnerabilities, which include more detailed information
about the vulnerabilities that goes beyond the basic description in the CVE. Therefore,
we leverage these additional resources employing a natural language processing (NLP)
pipeline towards that goal, providing informative summaries that cover more details and
perform well on both computational and human metrics.
Contributions. The main contributions of this work are as follows. (1) we present a
pipeline that enriches the description of vulnerabilities by considering semantically
similar contents from various third party resources (reference URLs). (2) In order to
normalize the enriched description and alleviate some of the drawbacks of the augmen-
tation (e.g., redundancy and repetition, largely variable length of description), we build
an NLP pipeline that exploits advances in representation, pretrained language models
that are fine-tuned using the original (short description) as a label, and generate se-
mantically similar summaries of vulnerabilities. (3) We evaluate the performance of the
proposed NLP pipeline on NVD, a popular vulnerability database, with both computa-
tional and human metric evaluations.
2 Related Work
Vulnerabilities are constantly being exploited due to the wide spread of malware and
viruses along with the improper deployment of countermeasures or missing security
updates. Mohaisen et al. [16] proposed AMAL, an automated system to analyze and
classify malware based on its behaviour. AMAL is composed of two components Au-
toMal and MaLabel. AutoMal collect information about malware samples based on
their behaviours for monitoring and profiling. On the other hand, MaLabel utilizes the
artifacts generated by AutoMal to build a feature vector representation for malware
samples. Moreover, MaLabel builds multiple classifiers to classify unlabeled malware
samples and to cluster them into separate groups such that each group have malware
samples with similar profiles.
Public repositories provide comprehensive information about vulnerabilities, how-
ever, they still suffer from quality and consistency issues as demonstrated in previ-
ous works [3,8]. Anwar et al. [3] have identified and quantified multiple quality issues
with the NVD and addressed their implications and ramifications. The authors present
a method for each matter to remedy the discovered deficiency and improve the NVD.
Similarly, Anwar et al. [4] studied the impact of vulnerability disclosure on the stock
market and how it affects different industries. They were able to cluster industries into
three categories based on the vulnerabilities impact on the vendor’s return.
Limited prior works studied different characteristics of vulnerabilities and used NLP
based-approach on the task, although NLP has been utilized extensively for other se-
curity and privacy applications. Alabduljabbar et al. [1] conducted a comprehensive
study to classify privacy policies established by a third party. A pipeline was developed
to classify text segments into a high-level category that correspond to the content of
that segment. Likewise, Alabduljabbar et al. [2] used NLP to conduct a comparative
analysis of privacy policies presented by free and premium content websites. The study
highlighted that premium content websites are more transparent in terms of reporting
their practices with respect to data collection and tracking.
Dong et al. [8] built VIEM, a system to capture inconsistency between CVE/NVD
and third party reports utilizing Named Entity Recognition model (NER) and a Relation
Extractor model (RE). The NER is responsible for identifying the name and version of
vulnerable software based on their semantics and structure within the description and
label each of them accordingly. The RE component utilizes the the identified labels and
pairs the appropriate software name and version to predict which software is vulnerable.
Other research focused on studying the relationship between CVE and Common
Attack Pattern Enumeration and Classification (CAPEC) and if it is possible to trace a
CVE description to a particular CAPEC using NLP as in Kanakogi et al. [12]. Simi-
larly, Kanakogi et al. [11] tested a new method for the same task but using Doc2Vec.
Wareus and Hell [22] proposed a method to automatically assigns Common Platform
Enumeration (CPE) to a CVEs from their description using NLP.
This work. We propose a pipeline for enriching the vulnerability description, and a
pipeline for normalizing description through summarization and associated evaluation.
3 Dataset: Baseline and Data Augmentation
Data Source and Scraping. Our data source is NVD because it is a well-known stan-
dard accepted across the globe, in both industry and academia, with many strengths: (1)
detailed structured information, including the severity score and publication date, (2)
human-readable descriptions, (3) capabilities for reanalysis with updated information,
and (4) powerful API for vulnerability information retrieval.
In our data collection, we limit our timeframe to vulnerabilities reported between
2019 and 2021 (inclusive). Based on our analysis, CVEs reported before 2019 do not
include sufficient hyperlinks with additional text, which is our main source for aug-
mentation. We list all the vulnerabilities reported in this period, and scrap them. For
each vulnerability, we scrap the URLs pointing to the NVD page that hosts a particu-
lar vulnerability. As a result, we obtain 35,657 vulnerabilities with their unique URLs.
Second, we iterate through every URL various data elements. After retrieving the URL,
we scrap the description and the hyperlinks for that vulnerability.
Description Augmentation. To augment the description, we iterate through the scrapped
hyperlinks. Each hyperlink directs us to a page hosted by a third party, which could be
an official page belonging to the vendor or the developer or an unofficial page; e.g.,
GitHub issue tracking page. We scrape every paragraph tag in each page separately and
apply various preprocessing steps to the extracted paragraph to clean up the text. This
preprocessing includes removing web links, special characters, white redundant spaces,
phone numbers, and email addresses. We also check the length of the paragraph and en-
sure it is more than 20 words after preprocessing. We conjecture that paragraphs shorter
than 20 words will not contribute to our goal.
摘要:

EnrichingVulnerabilityReportsThroughAutomatedandAugmentedDescriptionSummarizationHattanAlthebeitiandDavidMohaisenUniversityofCentralFlorida,Orlando,USAfhattan.althebeiti,mohaiseng@ucf.eduAbstract.Securityincidentsanddatabreachesareincreasingrapidly,andonlyafractionofthemisbeingreported.Publicvulnera...

展开>> 收起<<
Enriching Vulnerability Reports Through Automated and Augmented Description Summarization Hattan Althebeiti and David Mohaisen.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:477.35KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注