Reducing Information Overload Because Even Security Experts Need to Blink

2025-04-29 0 0 769.11KB 11 页 10玖币
侵权投诉
Reducing Information Overload:
Because Even Security Experts Need to Blink
Philipp Kuehn
kuehn@peasec.tu-darmstadt.de
Science and Technology for Peace and
Security (PEASEC), Technical
University of Darmstadt
Darmstadt, Germany
Markus Bayer
bayer@peasec.tu-darmstadt.de
Science and Technology for Peace and
Security (PEASEC), Technical
University of Darmstadt
Darmstadt, Germany
Tobias Frey
tobiasjonathan.frey@stud.tu-
darmstadt.de
Science and Technology for Peace and
Security (PEASEC), Technical
University of Darmstadt
Darmstadt, Germany
Moritz Kerk
kerkmoritz1@gmail.com
Science and Technology for Peace and
Security (PEASEC), Technical
University of Darmstadt
Darmstadt, Germany
Christian Reuter
reuter@peasec.tu-darmstadt.de
Science and Technology for Peace and
Security (PEASEC), Technical
University of Darmstadt
Darmstadt, Germany
ABSTRACT
Computer Emergency Response Teams (
CERT
s) face increasing
challenges processing the growing volume of security-related infor-
mation. Daily manual analysis of threat reports, security advisories,
and vulnerability announcements leads to information overload,
contributing to burnout and attrition among security professionals.
This work evaluates 196 combinations of clustering algorithms and
embedding models across ve security-related datasets to identify
optimal approaches for automated information consolidation. We
demonstrate that clustering can reduce information processing re-
quirements by over 90 % while maintaining semantic coherence,
with deep clustering achieving homogeneity of 0
.
88 for security bug
report (
SBR
) and partition-based clustering reaching 0
.
51 for advi-
sory data. Our solution requires minimal conguration, preserves
all data points, and processes new information within ve minutes
on consumer hardware. The ndings suggest that clustering ap-
proaches can signicantly enhance
CERT
operational eciency,
potentially saving over 3 750 work hours annually per analyst while
maintaining analytical integrity. However, complex threat reports
require careful parameter tuning to achieve acceptable performance,
indicating areas for future optimization. The code is made available
at https://github.com/PEASEC/reducing-information-overload.
CCS CONCEPTS
Computing methodologies
Cluster analysis;Supervised
learning;Security and privacy
Usability in security and
privacy.
KEYWORDS
Clustering, Security, Machine Learning, Computer Emergency Re-
sponse Teams
Corresponding author
1 INTRODUCTION
The cybersecurity threat landscape continuously evolves, with at-
tackers deploying increasingly sophisticated tactics while security
ndings proliferate across multiple channels. Security personnel
struggle to process high volumes of textual reports [
12
], impeding
their primary mission of threat identication and infrastructure
protection. Despite existing frameworks like the Cyber Threat Intel-
ligence (
CTI
) cycle [
42
] and automation methods [
18
], information
processing challenges persist. While
CTI
– the process of collecting
and analyzing security data to derive actionable recommendations
– can be aggregated in Threat Intelligence Platforms (
TIP
s) [
35
],
the diversity of sources and evolving threats creates signicant
information overload [16].
Computer Emergency Response Teams (
CERT
s), as organiza-
tional security incident coordinators [
44
], require current threat
intelligence for eective response. Studies reveal that 45% of
CERT
teams process only critical reports due to understang [
9
], while
13% lack capacity for new information and 11% cannot manage
existing volumes. Recent research [
12
,
17
] reinforces these chal-
lenges, with 47
.
6 % of analysts reporting burnout and 46
.
6% iden-
tifying threat monitoring as their most time-consuming task. For
19
.
2 % of analysts, automating threat alert enrichment through in-
cident correlation represents a critical priority [
12
]. Kaufhold et al
.
[17]
highlight persistent manual processes in technical informa-
tion exchange, redundancy checks, and general automation needs,
underscoring the urgency for enhanced information processing
solutions.
Goal. This research evaluates clustering algorithms’ ecacy in
supporting
CERT
threat information processing. Clustering enables
ecient threat analysis by allowing rapid overview of related data
points before detailed investigation. We assess various embedding-
clustering algorithm combinations against derived requirements,
with particular emphasis on threat messages and security advisories
from both commercial vendors and security researchers. This in-
vestigation addresses our primary research question: Which cluster
arXiv:2210.14067v5 [cs.CR] 5 Feb 2025
algorithm and embedding combination is suitable to reduce
CERT
personnel’s information overload (RQ)?
Contributions. This work advances current research through
two primary contributions: (i) introduction of ThreatReport, a
novel labeled threat report corpus (C1) and (ii) a comprehensive
performance comparisons of 14 clustering algorithms on the created
embeddings across the ve diverse datasets (C2).
Outline. The remainder of this paper is structured as follows:
Section 2 examines related work and identies research gaps. Sec-
tion 3 details our methodology, followed by our comprehensive
evaluation results in Section 4. Section 5 discusses ndings and
limitations, while Section 6 summarizes our contributions.
2 RELATED WORK
We present related work in embeddings, clustering, and evaluation,
culminating in the identication of our research gap.
Embeddings. Embedding methods transform data points into vec-
tor representations where similarity is preserved through spatial
proximity. These range from simple word frequency approaches to
sophisticated language models encoding semantic relationships [
36
,
49
]. Document-level encoding presents unique challenges for threat
intelligence processing. Traditional approaches include Bag of Words
(
BoW
), which records absolute term frequencies using a global
vocabulary, and Term Frequency-Inverse Document Frequency
(
TF-IDF
), which weights terms by their document frequency [
49
].
Recent approaches use
BERT
[
7
], with Sentence-BERT (
SBERT
)
specically optimized for embedding longer text units [
36
]. The
MTEB benchmark provides comprehensive performance compar-
isons of dierently-sized large language models (
LLM
s), including
clustering ecacy [33].
Clustering. Clustering algorithms group data points based on
similarity metrics such as cosine distance or silhouette scores [
13
,
41
,
49
]. Traditional methods range from centroid-based K-Means [
25
]
requiring predened cluster counts to density-based Density-Based
Spatial Clustering of Applications with Noise (
DBSCAN
) [
8
] sup-
porting arbitrary cluster shapes. Recent research explores deep
learning approaches that leverage intermediate representations [
20
,
21
,
27
,
28
,
30
,
37
,
46
]. In security contexts, clustering facilitates log
summarization [
10
], Android permission analysis [
29
], and cyberse-
curity event detection in social media [
39
] using various techniques
from local sensitivity hashing to neural networks. Vulnerability
management benets from clustering through alternative vulnera-
bility classication [2].
Evaluation. Text clustering evaluation employs both internal
metrics (assessing compactness and separability without ground
truth) [
38
] and external metrics (requiring labeled data) [
49
]. Rosen-
berg and Hirschberg
[40]
highlight limitations of traditional metrics
like purity and entropy, particularly for edge cases. The V-measure
framework [
40
] combines homogeneity (cluster label consistency)
and completeness (label distribution) metrics, providing compre-
hensive clustering quality assessment. Recent frameworks [
22
]
integrate multiple algorithms, datasets, and metrics for systematic
evaluation.
Table 1: This table outlines the structural information of
the datasets
𝑐∈ [CySecAlert,MSE,reatReport,SBR,SMS]
.
𝐿𝑐
is
the sequence
len(dp𝑖)
in character for all
dp𝑖𝑐
. It shows
the size
|𝐿𝑐|
, the average length
𝐿𝑐
, the median
e
𝐿𝑐
, the min-
imum
𝐿𝑐
and maximum
𝐿𝑐
data point length, and the
number of ground truth clusters (#𝐿𝑐) of 𝑐.
𝑐|𝐿𝑐|𝐿𝑐e
𝐿𝑐𝐿𝑐⌋ ⌈𝐿𝑐#𝐿𝑐
CySecAlert 13 306 136 119 6 486 2
MSE 3 001 284 277 57 686 2
ThreatReport 461 4 370 3 366 7 26 853 39
SBR 5 000 887 458 29 32 785 5
SMS 5 574 80 61 2 910 2
Research Gap. While existing research addresses clustering of se-
curity information, it primarily focuses on short-form content (e.g.,
social media posts) [
39
] or traditional embedding methods [
19
,
39
]. No comprehensive evaluation exists, which compares modern
embedding-clustering combinations for longer security texts, such
as security advisories or threat reports. This gap is particularly
signicant given the increasing volume and complexity of security
documentation requiring ecient processing by
CERT
personnel.
3 METHODOLOGY
We present the data used in this work and the requirements for doc-
ument embeddings, clustering algorithms, and evaluation metrics.
3.1 Text Corpora
This study employs multiple datasets to evaluate the selected clus-
tering algorithms across three distinct use cases: (I) eectiveness in
processing threat-related short messages and threat reports, (II) per-
formance in handling security bug report (
SBR
) across diverse prod-
ucts, and (III) comparative analysis on non-security short messages.
Exemplar texts from each corpus are presented in Listing 1, while
Table 1 provides a comprehensive overview of the datasets’ struc-
tural characteristics.
For security-centric analysis, we utilize three primary datasets:
CySecAlert [
39
], Microsoft Exchange (
MSE
) [
5
], and ThreatRe-
port (self-labeled). The CySecAlert and
MSE
datasets comprise
security-related short messages extracted from X (formerly Twitter).
The ThreatReport dataset encompasses security-related content
aggregated from news outlets and security feeds. While the for-
mer two datasets are representative for
CERT
data aggregations
in crisis, the third represents data of the daily work of
CERT
s. In
both areas the volume of information increased tremendously in
recent years, while understang remained on a high level [
14
]. For
product-specic analysis, the
SBR
dataset contains security-related
messages from issue trackers spanning ve distinct products [
43
].
Both use-cases To establish a baseline for general text classication,
we incorporate the UCI SMS Spam Collection [
1
], which features
characteristics common to security domain texts, including abbre-
viations, non-standard nomenclature, and spam content.
The labeling of ThreatReport was done by two researchers in
the eld of information security. After the rst independent labeling
2
CyberRange : The Open-Source AWS Cyber Range [. . .]
(a) Example text for the CySecAlert dataset (use-case I).
SMBs need to take immediate action on #microsoft #exchange
#vulnerabilities [URL] [. . . ]
(b) Example text for the Microsoft Exchange dataset (use-case I).
New CacheWarp AMD CPU attack lets hackers gain root in
Linux VMs- November 14, 2023- 03:34 PM-2 A new software-
based fault injection attack, CacheWarp, can let threat actors
hack into AMD SEV-protected [. . . ]
(c) Example text for the ThreatReport dataset (use-case I).
SYSCS_UTIL.SYSCS_COMPRESS_TABLE
should cre-
ate statistics if they do not exist There must be
an entry in the
SYSSTATISTICS
table in order for
the cardinality statistics in
SYSSTATISTICS
to be
created with
SYSCS_UTIL.SYSCS_COMPRESS_TABLE
SYSCS_UTIL.SYSCS_COMPRESS_TABLE
should create
statistics if they don’t exist. [... ]
(d) Example text for the security bug report dataset (use-case II).
Auction round 4. The highest bid is now £54. Next maximum
bid is £71. To bid, send BIDS e. g. 10 (to bid £10) to 83383. Good
luck
(e) Example text for the SMS dataset (use-case III).
Listing 1: Example texts from the dierent evaluation
datasets (CySecAlert, MSE, ThreatReport,
SBR
, and SMS).
of 10 data points, both researchers discussed their labeling process
and aligned dierences. Afterward, both researchers continued
independent on a half dataset each.
3.2 Operational Requirements for Automated
CERT Information Processing
The exponential growth in security-related data requires
CERT
s
to conduct extensive manual analysis daily, leading to signicant
operational strain [
12
,
17
]. This sustained cognitive load frequently
results in professional burnout and potential workforce attrition.
The automation of routine analytical tasks presents a critical op-
portunity for operational improvement, particularly in the identi-
cation of duplicate and related information within incoming data
streams. Research indicates that security personnel estimate “half
of their tasks to all of their tasks could be automated today” [
12
].
Drawing from multiple empirical studies [
4
,
12
,
17
], we establish
the following core requirements for an eective
CERT
clustering
system.
R1
Reducing information overload for
CERT
s.: The cluster-
ing system must demonstrably reduce the volume of informa-
tion requiring manual review through eective cluster con-
solidation. Cluster homogeneity must be maximized through
Table 2: Overview of the
LLM
s used in combination with
SBERT (sorted alphabetically).
Huggingface Model ID Params
Alibaba-NLP/gte-base-en-v1.5 137M
jxm/cde-small-v1 281M
markusbayer/cysecbert 110M
meta-llama/Llama-3.2-1B 1.24B
meta-llama/Llama-3.2-3B 3.21B
mistralai/Mistral-7B-v0.1 7.24B
mistralai/Mistral-7B-v0.3 7.25B
NovaSearch/stella_en_1.5b_v5 1.54B
NovaSearch/stella_en_400M_v5 435M
nvidia/NV-Embed-v2 7.85B
sentence-transformers/all-MiniLM-L12-v1 33.4M
sentence-transformers/all-mpnet-base-v2 109M
thenlper/gte-large 335M
sentence-transformers/gtr-t5-xxl 4.86B
rigorous outlier management. The presence of misclassied
data points would signicantly compromise cluster integrity
and negate the intended benets of information reduction.
Therefore, the system must prioritize classication accuracy
over cluster completeness.
R2
Unburden
CERT
s.: The proposed algorithms must operate
with minimal conguration requirements, eliminating the
need for continuous model adjustment with emerging vul-
nerabilities or technologies. Model ne-tuning represents
a signicant operational overhead. While potentially more
engaging than routine document review, such tasks divert
resources from core responsibilities: threat identication,
analysis, and stakeholder communication.
R3
Retention of data.: All alerts must remain accessible, re-
gardless of their cluster assignability. The system must pre-
serve outliers during analysis rather than forcing them into
inappropriate clusters. Otherwise, important information
might be either missed, due to being assigned to an outlier
cluster or simply confuses personnel if found in wrong clus-
ters. Either case would diminish the benets of clustering the
data due to lost trust in the system. This requirement aligns
with information overload reduction by enabling a discrete
outlier cluster for manual review, rather than discarding or
misclassifying these data points.
R4
Runtime performance.: While not the primary optimiza-
tion target, the system must complete clustering operations
on new data and present results within an operationally ac-
ceptable timeframe, dened here as several minutes. The sys-
tem might be run on-demand by
CERT
personnel in prepa-
ration of the daily inbound information review.
3.3 Embeddings, Clustering, and Evaluation
For the embedding process, we evaluate a diverse range of locally
deployable
LLM
s. The selected models span from lightweight ar-
chitectures with 33
.
4M parameters (all-MiniLM-L12-v1) to large-
scale models with 7
.
85B parameters (NV-Embed-v2) [
6
,
15
,
23
,
3
摘要:

ReducingInformationOverload:BecauseEvenSecurityExpertsNeedtoBlinkPhilippKuehn∗kuehn@peasec.tu-darmstadt.deScienceandTechnologyforPeaceandSecurity(PEASEC),TechnicalUniversityofDarmstadtDarmstadt,GermanyMarkusBayerbayer@peasec.tu-darmstadt.deScienceandTechnologyforPeaceandSecurity(PEASEC),TechnicalUni...

展开>> 收起<<
Reducing Information Overload Because Even Security Experts Need to Blink.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:769.11KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注