Reducing Information Overload Because Even Security Experts Need to Blink

2025-04-29 0 0 769.11KB 11 页 10玖币

侵权投诉

Reducing Information Overload:

Because Even Security Experts Need to Blink

Philipp Kuehn∗

kuehn@peasec.tu-darmstadt.de

Science and Technology for Peace and

Security (PEASEC), Technical

University of Darmstadt

Darmstadt, Germany

Markus Bayer

bayer@peasec.tu-darmstadt.de

Science and Technology for Peace and

Security (PEASEC), Technical

University of Darmstadt

Darmstadt, Germany

Tobias Frey

tobiasjonathan.frey@stud.tu-

darmstadt.de

Science and Technology for Peace and

Security (PEASEC), Technical

University of Darmstadt

Darmstadt, Germany

Moritz Kerk

kerkmoritz1@gmail.com

Science and Technology for Peace and

Security (PEASEC), Technical

University of Darmstadt

Darmstadt, Germany

Christian Reuter

reuter@peasec.tu-darmstadt.de

Science and Technology for Peace and

Security (PEASEC), Technical

University of Darmstadt

Darmstadt, Germany

ABSTRACT

Computer Emergency Response Teams (

CERT

s) face increasing

challenges processing the growing volume of security-related infor-

mation. Daily manual analysis of threat reports, security advisories,

and vulnerability announcements leads to information overload,

contributing to burnout and attrition among security professionals.

This work evaluates 196 combinations of clustering algorithms and

embedding models across ve security-related datasets to identify

optimal approaches for automated information consolidation. We

demonstrate that clustering can reduce information processing re-

quirements by over 90 % while maintaining semantic coherence,

with deep clustering achieving homogeneity of 0

88 for security bug

report (

SBR

) and partition-based clustering reaching 0

51 for advi-

sory data. Our solution requires minimal conguration, preserves

all data points, and processes new information within ve minutes

on consumer hardware. The ndings suggest that clustering ap-

proaches can signicantly enhance

CERT

operational eciency,

potentially saving over 3 750 work hours annually per analyst while

maintaining analytical integrity. However, complex threat reports

require careful parameter tuning to achieve acceptable performance,

indicating areas for future optimization. The code is made available

at https://github.com/PEASEC/reducing-information-overload.

CCS CONCEPTS

•Computing methodologies

→

Cluster analysis;Supervised

learning;•Security and privacy

→

Usability in security and

privacy.

KEYWORDS

Clustering, Security, Machine Learning, Computer Emergency Re-

sponse Teams

∗Corresponding author

1 INTRODUCTION

The cybersecurity threat landscape continuously evolves, with at-

tackers deploying increasingly sophisticated tactics while security

ndings proliferate across multiple channels. Security personnel

struggle to process high volumes of textual reports [

], impeding

their primary mission of threat identication and infrastructure

protection. Despite existing frameworks like the Cyber Threat Intel-

ligence (

CTI

) cycle [

] and automation methods [

], information

processing challenges persist. While

CTI

– the process of collecting

and analyzing security data to derive actionable recommendations

– can be aggregated in Threat Intelligence Platforms (

TIP

s) [

the diversity of sources and evolving threats creates signicant

information overload [16].

Computer Emergency Response Teams (

CERT

s), as organiza-

tional security incident coordinators [

], require current threat

intelligence for eective response. Studies reveal that 45% of

CERT

teams process only critical reports due to understang [

], while

13% lack capacity for new information and 11% cannot manage

existing volumes. Recent research [

] reinforces these chal-

lenges, with 47

6 % of analysts reporting burnout and 46

6% iden-

tifying threat monitoring as their most time-consuming task. For

2 % of analysts, automating threat alert enrichment through in-

cident correlation represents a critical priority [

]. Kaufhold et al

[17]

highlight persistent manual processes in technical informa-

tion exchange, redundancy checks, and general automation needs,

underscoring the urgency for enhanced information processing

solutions.

Goal. This research evaluates clustering algorithms’ ecacy in

supporting

CERT

threat information processing. Clustering enables

ecient threat analysis by allowing rapid overview of related data

points before detailed investigation. We assess various embedding-

clustering algorithm combinations against derived requirements,

with particular emphasis on threat messages and security advisories

from both commercial vendors and security researchers. This in-

vestigation addresses our primary research question: Which cluster

arXiv:2210.14067v5 [cs.CR] 5 Feb 2025

algorithm and embedding combination is suitable to reduce

CERT

personnel’s information overload (RQ)?

Contributions. This work advances current research through

two primary contributions: (i) introduction of ThreatReport, a

novel labeled threat report corpus (C1) and (ii) a comprehensive

performance comparisons of 14 clustering algorithms on the created

embeddings across the ve diverse datasets (C2).

Outline. The remainder of this paper is structured as follows:

Section 2 examines related work and identies research gaps. Sec-

tion 3 details our methodology, followed by our comprehensive

evaluation results in Section 4. Section 5 discusses ndings and

limitations, while Section 6 summarizes our contributions.

2 RELATED WORK

We present related work in embeddings, clustering, and evaluation,

culminating in the identication of our research gap.

Embeddings. Embedding methods transform data points into vec-

tor representations where similarity is preserved through spatial

proximity. These range from simple word frequency approaches to

sophisticated language models encoding semantic relationships [

]. Document-level encoding presents unique challenges for threat

intelligence processing. Traditional approaches include Bag of Words

(

BoW

), which records absolute term frequencies using a global

vocabulary, and Term Frequency-Inverse Document Frequency

(

TF-IDF

), which weights terms by their document frequency [

Recent approaches use

BERT

[

], with Sentence-BERT (

SBERT

)

specically optimized for embedding longer text units [

]. The

MTEB benchmark provides comprehensive performance compar-

isons of dierently-sized large language models (

LLM

s), including

clustering ecacy [33].

Clustering. Clustering algorithms group data points based on

similarity metrics such as cosine distance or silhouette scores [

]. Traditional methods range from centroid-based K-Means [

]

requiring predened cluster counts to density-based Density-Based

Spatial Clustering of Applications with Noise (

DBSCAN

) [

] sup-

porting arbitrary cluster shapes. Recent research explores deep

learning approaches that leverage intermediate representations [

]. In security contexts, clustering facilitates log

summarization [

], Android permission analysis [

], and cyberse-

curity event detection in social media [

] using various techniques

from local sensitivity hashing to neural networks. Vulnerability

management benets from clustering through alternative vulnera-

bility classication [2].

Evaluation. Text clustering evaluation employs both internal

metrics (assessing compactness and separability without ground

truth) [

] and external metrics (requiring labeled data) [

]. Rosen-

berg and Hirschberg

[40]

highlight limitations of traditional metrics

like purity and entropy, particularly for edge cases. The V-measure

framework [

] combines homogeneity (cluster label consistency)

and completeness (label distribution) metrics, providing compre-

hensive clustering quality assessment. Recent frameworks [

]

integrate multiple algorithms, datasets, and metrics for systematic

evaluation.

Table 1: This table outlines the structural information of

the datasets

𝑐∈ [CySecAlert,MSE,reatReport,SBR,SMS]

𝐿𝑐

the sequence

len(dp𝑖)

in character for all

dp𝑖∈𝑐

. It shows

the size

|𝐿𝑐|

, the average length

𝐿𝑐

, the median

𝐿𝑐

, the min-

imum

⌊𝐿𝑐⌋

and maximum

⌈𝐿𝑐⌉

data point length, and the

number of ground truth clusters (#𝐿𝑐) of 𝑐.

𝑐|𝐿𝑐|𝐿𝑐e

𝐿𝑐⌊𝐿𝑐⌋ ⌈𝐿𝑐⌉#𝐿𝑐

CySecAlert 13 306 136 119 6 486 2

MSE 3 001 284 277 57 686 2

ThreatReport 461 4 370 3 366 7 26 853 39

SBR 5 000 887 458 29 32 785 5

SMS 5 574 80 61 2 910 2

Research Gap. While existing research addresses clustering of se-

curity information, it primarily focuses on short-form content (e.g.,

social media posts) [

] or traditional embedding methods [

]. No comprehensive evaluation exists, which compares modern

embedding-clustering combinations for longer security texts, such

as security advisories or threat reports. This gap is particularly

signicant given the increasing volume and complexity of security

documentation requiring ecient processing by

CERT

personnel.

3 METHODOLOGY

We present the data used in this work and the requirements for doc-

ument embeddings, clustering algorithms, and evaluation metrics.

3.1 Text Corpora

This study employs multiple datasets to evaluate the selected clus-

tering algorithms across three distinct use cases: (I) eectiveness in

processing threat-related short messages and threat reports, (II) per-

formance in handling security bug report (

SBR

) across diverse prod-

ucts, and (III) comparative analysis on non-security short messages.

Exemplar texts from each corpus are presented in Listing 1, while

Table 1 provides a comprehensive overview of the datasets’ struc-

tural characteristics.

For security-centric analysis, we utilize three primary datasets:

CySecAlert [

], Microsoft Exchange (

MSE

) [

], and ThreatRe-

port (self-labeled). The CySecAlert and

MSE

datasets comprise

security-related short messages extracted from X (formerly Twitter).

The ThreatReport dataset encompasses security-related content

aggregated from news outlets and security feeds. While the for-

mer two datasets are representative for

CERT

data aggregations

in crisis, the third represents data of the daily work of

CERT

s. In

both areas the volume of information increased tremendously in

recent years, while understang remained on a high level [

]. For

product-specic analysis, the

SBR

dataset contains security-related

messages from issue trackers spanning ve distinct products [

Both use-cases To establish a baseline for general text classication,

we incorporate the UCI SMS Spam Collection [

], which features

characteristics common to security domain texts, including abbre-

viations, non-standard nomenclature, and spam content.

The labeling of ThreatReport was done by two researchers in

the eld of information security. After the rst independent labeling

“CyberRange : The Open-Source AWS Cyber Range [. . .]”

(a) Example text for the CySecAlert dataset (use-case I).

“SMBs need to take immediate action on #microsoft #exchange

#vulnerabilities [URL] [. . . ]”

(b) Example text for the Microsoft Exchange dataset (use-case I).

“New CacheWarp AMD CPU attack lets hackers gain root in

Linux VMs- November 14, 2023- 03:34 PM-2 A new software-

based fault injection attack, CacheWarp, can let threat actors

hack into AMD SEV-protected [. . . ]”

“

SYSCS_UTIL.SYSCS_COMPRESS_TABLE

should cre-

ate statistics if they do not exist There must be

an entry in the

SYSSTATISTICS

table in order for

the cardinality statistics in

SYSSTATISTICS

to be

created with

SYSCS_UTIL.SYSCS_COMPRESS_TABLE

should create

statistics if they don’t exist. [... ]”

(d) Example text for the security bug report dataset (use-case II).

“Auction round 4. The highest bid is now £54. Next maximum

bid is £71. To bid, send BIDS e. g. 10 (to bid £10) to 83383. Good

luck”

(e) Example text for the SMS dataset (use-case III).

Listing 1: Example texts from the dierent evaluation

datasets (CySecAlert, MSE, ThreatReport,

SBR

, and SMS).

of 10 data points, both researchers discussed their labeling process

and aligned dierences. Afterward, both researchers continued

independent on a half dataset each.

3.2 Operational Requirements for Automated

CERT Information Processing

The exponential growth in security-related data requires

CERT

to conduct extensive manual analysis daily, leading to signicant

operational strain [

]. This sustained cognitive load frequently

results in professional burnout and potential workforce attrition.

The automation of routine analytical tasks presents a critical op-

portunity for operational improvement, particularly in the identi-

cation of duplicate and related information within incoming data

streams. Research indicates that security personnel estimate “half

of their tasks to all of their tasks could be automated today” [

Drawing from multiple empirical studies [

], we establish

the following core requirements for an eective

CERT

clustering

system.

Reducing information overload for

CERT

s.: The cluster-

ing system must demonstrably reduce the volume of informa-

tion requiring manual review through eective cluster con-

solidation. Cluster homogeneity must be maximized through

Table 2: Overview of the

LLM

s used in combination with

SBERT (sorted alphabetically).

Huggingface Model ID ↓Params

Alibaba-NLP/gte-base-en-v1.5 137M

jxm/cde-small-v1 281M

markusbayer/cysecbert 110M

meta-llama/Llama-3.2-1B 1.24B

meta-llama/Llama-3.2-3B 3.21B

mistralai/Mistral-7B-v0.1 7.24B

mistralai/Mistral-7B-v0.3 7.25B

NovaSearch/stella_en_1.5b_v5 1.54B

NovaSearch/stella_en_400M_v5 435M

nvidia/NV-Embed-v2 7.85B

sentence-transformers/all-MiniLM-L12-v1 33.4M

sentence-transformers/all-mpnet-base-v2 109M

thenlper/gte-large 335M

sentence-transformers/gtr-t5-xxl 4.86B

rigorous outlier management. The presence of misclassied

data points would signicantly compromise cluster integrity

and negate the intended benets of information reduction.

Therefore, the system must prioritize classication accuracy

over cluster completeness.

Unburden

CERT

s.: The proposed algorithms must operate

with minimal conguration requirements, eliminating the

need for continuous model adjustment with emerging vul-

nerabilities or technologies. Model ne-tuning represents

a signicant operational overhead. While potentially more

engaging than routine document review, such tasks divert

resources from core responsibilities: threat identication,

analysis, and stakeholder communication.

Retention of data.: All alerts must remain accessible, re-

gardless of their cluster assignability. The system must pre-

serve outliers during analysis rather than forcing them into

inappropriate clusters. Otherwise, important information

might be either missed, due to being assigned to an outlier

cluster or simply confuses personnel if found in wrong clus-

ters. Either case would diminish the benets of clustering the

data due to lost trust in the system. This requirement aligns

with information overload reduction by enabling a discrete

outlier cluster for manual review, rather than discarding or

misclassifying these data points.

Runtime performance.: While not the primary optimiza-

tion target, the system must complete clustering operations

on new data and present results within an operationally ac-

ceptable timeframe, dened here as several minutes. The sys-

tem might be run on-demand by

CERT

personnel in prepa-

ration of the daily inbound information review.

3.3 Embeddings, Clustering, and Evaluation

For the embedding process, we evaluate a diverse range of locally

deployable

LLM

s. The selected models span from lightweight ar-

chitectures with 33

4M parameters (all-MiniLM-L12-v1) to large-

scale models with 7

85B parameters (NV-Embed-v2) [

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ReducingInformationOverload:BecauseEvenSecurityExpertsNeedtoBlinkPhilippKuehn∗kuehn@peasec.tu-darmstadt.deScienceandTechnologyforPeaceandSecurity(PEASEC),TechnicalUniversityofDarmstadtDarmstadt,GermanyMarkusBayerbayer@peasec.tu-darmstadt.deScienceandTechnologyforPeaceandSecurity(PEASEC),TechnicalUni...

展开>> 收起<<

Reducing Information Overload Because Even Security Experts Need to Blink.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Reducing Information Overload Because Even Security Experts Need to Blink

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: