
algorithm and embedding combination is suitable to reduce
CERT
personnel’s information overload (RQ)?
Contributions. This work advances current research through
two primary contributions: (i) introduction of ThreatReport, a
novel labeled threat report corpus (C1) and (ii) a comprehensive
performance comparisons of 14 clustering algorithms on the created
embeddings across the ve diverse datasets (C2).
Outline. The remainder of this paper is structured as follows:
Section 2 examines related work and identies research gaps. Sec-
tion 3 details our methodology, followed by our comprehensive
evaluation results in Section 4. Section 5 discusses ndings and
limitations, while Section 6 summarizes our contributions.
2 RELATED WORK
We present related work in embeddings, clustering, and evaluation,
culminating in the identication of our research gap.
Embeddings. Embedding methods transform data points into vec-
tor representations where similarity is preserved through spatial
proximity. These range from simple word frequency approaches to
sophisticated language models encoding semantic relationships [
36
,
49
]. Document-level encoding presents unique challenges for threat
intelligence processing. Traditional approaches include Bag of Words
(
BoW
), which records absolute term frequencies using a global
vocabulary, and Term Frequency-Inverse Document Frequency
(
TF-IDF
), which weights terms by their document frequency [
49
].
Recent approaches use
BERT
[
7
], with Sentence-BERT (
SBERT
)
specically optimized for embedding longer text units [
36
]. The
MTEB benchmark provides comprehensive performance compar-
isons of dierently-sized large language models (
LLM
s), including
clustering ecacy [33].
Clustering. Clustering algorithms group data points based on
similarity metrics such as cosine distance or silhouette scores [
13
,
41
,
49
]. Traditional methods range from centroid-based K-Means [
25
]
requiring predened cluster counts to density-based Density-Based
Spatial Clustering of Applications with Noise (
DBSCAN
) [
8
] sup-
porting arbitrary cluster shapes. Recent research explores deep
learning approaches that leverage intermediate representations [
20
,
21
,
27
,
28
,
30
,
37
,
46
]. In security contexts, clustering facilitates log
summarization [
10
], Android permission analysis [
29
], and cyberse-
curity event detection in social media [
39
] using various techniques
from local sensitivity hashing to neural networks. Vulnerability
management benets from clustering through alternative vulnera-
bility classication [2].
Evaluation. Text clustering evaluation employs both internal
metrics (assessing compactness and separability without ground
truth) [
38
] and external metrics (requiring labeled data) [
49
]. Rosen-
berg and Hirschberg
[40]
highlight limitations of traditional metrics
like purity and entropy, particularly for edge cases. The V-measure
framework [
40
] combines homogeneity (cluster label consistency)
and completeness (label distribution) metrics, providing compre-
hensive clustering quality assessment. Recent frameworks [
22
]
integrate multiple algorithms, datasets, and metrics for systematic
evaluation.
Table 1: This table outlines the structural information of
the datasets
𝑐∈ [CySecAlert,MSE,reatReport,SBR,SMS]
.
𝐿𝑐
is
the sequence
len(dp𝑖)
in character for all
dp𝑖∈𝑐
. It shows
the size
|𝐿𝑐|
, the average length
𝐿𝑐
, the median
e
𝐿𝑐
, the min-
imum
⌊𝐿𝑐⌋
and maximum
⌈𝐿𝑐⌉
data point length, and the
number of ground truth clusters (#𝐿𝑐) of 𝑐.
𝑐|𝐿𝑐|𝐿𝑐e
𝐿𝑐⌊𝐿𝑐⌋ ⌈𝐿𝑐⌉#𝐿𝑐
CySecAlert 13 306 136 119 6 486 2
MSE 3 001 284 277 57 686 2
ThreatReport 461 4 370 3 366 7 26 853 39
SBR 5 000 887 458 29 32 785 5
SMS 5 574 80 61 2 910 2
Research Gap. While existing research addresses clustering of se-
curity information, it primarily focuses on short-form content (e.g.,
social media posts) [
39
] or traditional embedding methods [
19
,
39
]. No comprehensive evaluation exists, which compares modern
embedding-clustering combinations for longer security texts, such
as security advisories or threat reports. This gap is particularly
signicant given the increasing volume and complexity of security
documentation requiring ecient processing by
CERT
personnel.
3 METHODOLOGY
We present the data used in this work and the requirements for doc-
ument embeddings, clustering algorithms, and evaluation metrics.
3.1 Text Corpora
This study employs multiple datasets to evaluate the selected clus-
tering algorithms across three distinct use cases: (I) eectiveness in
processing threat-related short messages and threat reports, (II) per-
formance in handling security bug report (
SBR
) across diverse prod-
ucts, and (III) comparative analysis on non-security short messages.
Exemplar texts from each corpus are presented in Listing 1, while
Table 1 provides a comprehensive overview of the datasets’ struc-
tural characteristics.
For security-centric analysis, we utilize three primary datasets:
CySecAlert [
39
], Microsoft Exchange (
MSE
) [
5
], and ThreatRe-
port (self-labeled). The CySecAlert and
MSE
datasets comprise
security-related short messages extracted from X (formerly Twitter).
The ThreatReport dataset encompasses security-related content
aggregated from news outlets and security feeds. While the for-
mer two datasets are representative for
CERT
data aggregations
in crisis, the third represents data of the daily work of
CERT
s. In
both areas the volume of information increased tremendously in
recent years, while understang remained on a high level [
14
]. For
product-specic analysis, the
SBR
dataset contains security-related
messages from issue trackers spanning ve distinct products [
43
].
Both use-cases To establish a baseline for general text classication,
we incorporate the UCI SMS Spam Collection [
1
], which features
characteristics common to security domain texts, including abbre-
viations, non-standard nomenclature, and spam content.
The labeling of ThreatReport was done by two researchers in
the eld of information security. After the rst independent labeling
2