An Empirical Analysis of SMS Scam Detection Systems

2025-04-30 0 0 1.35MB 14 页 10玖币
侵权投诉
An Empirical Analysis of SMS Scam Detection Systems
Muhammad Salman
Macquarie University
Sydney, Australia
muhammad.salman2@students.mq.edu.au
Muhammad Ikram
Macquarie University
Sydney, Australia
muhammad.ikram@mq.edu.au
Mohamed Ali Kaafar
Macquarie University
Sydney, Australia
dali.kaafar@mq.edu.au
ABSTRACT
The short message service (SMS) was introduced a generation ago to
the mobile phone users. They make up the world’s oldest large-scale
network, with billions of users and therefore attracts a lot of fraud.
Due to the convergence of mobile network with internet, SMS based
scams can potentially compromise the security of internet services
as well. In this study, we present a new SMS scam dataset consisting
of 153,551 SMSes. This dataset that we will release publicly for re-
search purposes represents the largest publicly-available SMS scam
dataset. We evaluate and compare the performance achieved by
several established machine learning methods on the new dataset,
ranging from shallow machine learning approaches to deep neural
networks to syntactic and semantic feature models. We then study
the existing models from an adversarial viewpoint by assessing
its robustness against dierent level of adversarial manipulation.
This perspective consolidates the current state of the art in SMS
Spam ltering, highlights the limitations and the opportunities to
improve the existing approaches.
1 INTRODUCTION
SMS scams have reached alarming volumes in recent years. In the
US in 2021, an estimated USD 86 million was lost to SMS scammers
(reported by Federal Trade Commission
1
). Similarly, the Australian
Competition and Consumer Commission (ACCC)’s ScamWatch
body
2
reported a near-doubling of annual losses between 2020
(AUD 175 millions) and 2021 (AUD 323 millions). In 2021, there
were 67,180 SMS fraud reports, up from 32,337 reported in 2020:
more than 8,835 SMS scams were reported only in February 2022,
the highest among all scam delivery methods. Despite almost two
decades of research on SMS Spam detection [
1
8
], it continues to
be a challenging and serious issue to our modern digital societies.
Researchers have developed various techniques to combat SMS
Spam, and there are several surveys of these attempts [
8
10
]. How-
ever, unlike email spam detection, it is dicult to identify the ap-
propriate system and directly compare the proposed systems in
literature, since most of them used highly outdated and dierent
datasets for evaluation. Although, traditional ML and deep learning
(DL) methods have substantially improved Spam detection in SMS;
however, scammers’ dynamic nature frequently defeats security
barriers. Recently, studies have revealed dierent adversarial at-
tacks in the text domain, which could eectively evade various
ML-based text analyzers [
8
,
11
13
]. We discovered that there are
certain methodological issues with current SMS Spam detection re-
search and it requires a lot more research. For example, we observe:
The use of highly outdated and imbalanced datasets with
only few hundreds of Spam messages
1https://www.fcc.gov/covid-19-text-scams
2https://www.scamwatch.gov.au/scam-statistics
Lack of benchmark, which resulted in partial and segmented
evaluation of literature
Lack of comprehensive quantication and comparison of
dierent classes of ML such as DL, positive and unlabeled
(PU) learning, traditional two class and one class ML etc. The
same is observed for dierent set of syntactic and semantic
features
Importantly, the literature overlook new challenges that arise
as convergence of Internet and Telcom especially the new
adversarial attempts of scammers to evade detection
The focus of this paper is to investigate the landscape of fraud-
ulent SMS messages, existing methods for combating SMS scams,
state-of-the-art research in SMS Spam defense, points out the re-
quired set of challenges and to analyse how current tactics may be
abused by scammers and subsequently identify the requirements
that are crucial to an acceptable solution. Therefore, we assess the
performance and robustness of the SMS Spam models proposed in
the literature on the new super dataset (shortly discussed in section
3), based on the requirements of the security domain, which we
elaborate upon in section 6, to determine the appropriateness of
the proposed solutions. To our knowledge, a comprehensive exper-
imental evaluation of the appropriateness of the previous research
on SMS Spam from the security perspective has not been done
before. To ll this important gap, this study will assist drive the
development of eective/ successful SMS Spam defenses, as well as
provide a framework for evaluating SMS Spam methods. The major
contributions of our paper are:
Large scale Spam Dataset.
We make available a new real, pub-
lic and large SMS Spam corpus from various sources over the previ-
ous decade, that is the largest labeled dataset as far as we are aware.
We aggregate and characterise publicly available Spam dataset.
Given that the benchmark dataset [
14
] used for SMS Spam research
is outdated and incomplete consisting only 5,574 messages with 13%
Spam and 87% Ham (or legitimate) messages, it is not suitable for
detecting the most recent, stealthy SMS Spam even using advanced
machine or DL techniques. We propose data collection methods
to crawl 55,686 SMS messages from Spam observatories such as
ScamWatch
3
and Action Fraud
4
. We labelled the dataset into Spam
and non-Spam messages constituting 15,209 (27.31 %) and 40,477
(72.69 %) SMSes, respectively.
Valuation of machine learning models.
Usually, supervised
text classication methods (like binary classication) or DL have
been used to identify SMS Spam. However, given the imbalance
nature of the SMS Spam datasets, we propose one-class learning
(unsupervised learning) and positive unlabeled (PU) learning (semi-
supervised learning) algorithms alongside other ML models. To
the best of our knowledge, the notion of one-class learning and
3https://www.scamwatch.gov.au/
4https://www.actionfraud.police.uk/
arXiv:2210.10451v1 [cs.CR] 19 Oct 2022
Oct, 2022, Muhammad Salman, Muhammad Ikram, and Mohamed Ali Kaafar
PU learning for SMS Spam detection was never explored earlier.
We evaluate and compare the performance of PU learning and
one-class learning models with both traditional two class ML as
well as with the state of the art neural network and DL such as
the latest transformers based architecture [
15
]. Furthermore, we
analysed and compared the performance of several well-known
ML methods in order to establish a good baseline for future com-
parisons. Additionally, we evaluated the ML models over the new
dataset with dierent set of features/ word embeddings starting
from Non-semantic Count based Vector space model (Count Vector-
ization, TF-IDF) to semantic Non Context-Based Vector Space Model
(Word2Vec, fastText, GloVe) with static and dynamic modes and
with semantic Context Based Vector Space Models (BERT, ELMo
etc).
Robustness analysis of Spam models in the face of new
challenges.
To evade detection, Spammers may leverage obfusca-
tion or perturbation methods [
16
] to Smish [
17
] or Punycode [
18
]
Spam SMS texts and embedded URLs. We evaluate the robustness of
ML models against 4 levels of adversarial perturbations: charlevel,
word-level, sentence-level, and multi-level attacks. To the best of
our knowledge, such extensive adversarial assessment of the ML
models on Spam SMS have not been performed before.
2 RESEARCH CHALLENGES AND RELATED
WORK
In the following, we discuss challenges and overview sate-of-the-art
SMS Spam ltering and detection methods in literature.
2.1 Challenges
Availability of data.
The lack of updated, genuine, and publicly
large SMS Spam dataset is a major concern. The SMS Spam ltering
has a lack of good dataset sources and a dearth of diversity in the
data used for evaluation of detection systems. Recency of the data is
another challenge. The datasets used by researchers in the literature
are extremely outdated and raises questions on its suitability in the
present SMS scam landscapes due to non-representative of latest
attacks (see table 1). Other concern is the message ambiguity due
to short length), limited header information, presence of emojis
and abbreviations in their text, therefore, established email Spam
lters may have their performance seriously degraded when directly
employed to dealing with mobile Spam.
Lack of benchmark.
Dierent methods have been proposed by
researchers to study SMS Spam detection and adversarial attacks
in texts, but there is no benchmark [
19
,
20
]. Moreover, dierent
datasets have been used by researchers in their work, making it
dicult to compare these methods. Meanwhile, it also aects the
selection of evaluation metrics. There is no exact statement about
which metric measure is better in a situation and why it is more
useful than others.
Robustness against new challenges and attacks.
Another
challenge in the Spam detection is the dynamic behavior of scam-
mers. They always try to nd a way to deceive the Spam lters
using dierent adversarial attacks. Moreover, due to the conver-
gence of mobile network with internet, Spammers may likely adapt
the techniques used for tricking users on internet.
Lack of strategy and collaboration
One of the less high-
lighted issue in eective tackling of Spam SMS is the lack of collab-
orations between researchers, end users, network providers, and
industry as well as absence of a robust strategy for dealing with
threats to the security of mobile users and Spam SMS detection.
2.2 Related Work
To eectively handle the threat posed by SMS Spam, several tech-
niques have been proposed in the academic literature and white
papers in the industry. However, none of these studies provides a
complete picture of the SMS Spam problem, though there are re-
sources that handle part of the problem or try to reduce the problem
to a single dimension. Importantly, there is no previous systematic
empirical analysis of SMS scam detection systems.
Abdulhamid et al.[
19
] presented a review of the currently avail-
able methods, challenges, and future research directions on SMS
Spam ltering and detection techniques. Wang et al [
21
] conducted
an empirical survey on word embeddings. Buchanan and Grant [
1
]
oered a brief description of Nigerian scam schemes, emphasiz-
ing that the growth of the Internet has aided the proliferation of
cyber-crime. To classify Spam texts, Almeida et al. [
5
] used the SVM
classier. They used word frequency as a feature and discovered
that SVM performed very well. Another study [
22
] used SVM classi-
ers and k- Nearest Neighbor (kNN) to classify Spam text messages.
The experimental results conrmed that using a combination of
BoW features and structural features to classify Spam messages
performed better. Some researchers have also proposed evolution-
ary methods for SMS Spam detection by assimilating the byte-level
features of SMS [
23
]. Researchers have recently started to use deep
neural networks. To lter SMS Spam, Popovac et al. [
24
] proposed
a CNN-based architecture with one layer of convolution and pool-
ing and achieved 98.4% accuracy rate. Jain et al. [
25
] used Long
short-term memory (LSTM) model, for SMS Spam ltering. Their
model attained a 99.01% accuracy with the help of 6000 features
and 200 LSTM nodes. They also employed various word embedding
techniques including ConcepNet, WordNet and Word2Vec. [
26
]
proposed several deep neural network-based model and attained
an accuracy of 98.51%.
Despite the preceding algorithms’ success in identifying Spam
messages, the existence of adversaries signicantly degrades Spam
lters’ performance [
27
]. Graham-Cumming presented an approach
against an individual user’s Spam lter in a talk at the 2004 MIT
Spam Conference [
28
]. Random words were added into Spam mail-
ings in this attack. A prior paper [
3
] attempted to broaden the scope
of this attack by substituting regularly used English words with
random phrases. After that, many adversarial attacks and counter-
measures have been documented in a range of applications. Attacks
on text classication jobs are typically carried out by altering char-
acteristics or changing the text sequence’s content [
20
]. Several
strategies have been identied in the eld of adversarial attacks
on Spam lters [
4
,
29
]: injection of Ham words, obfuscation of
Spam words, poisoning, alteration of labels, and synonym replace-
ment. The impact of dictionary-based attacks and well-informed
concentrated attacks, on the other hand, can be mitigated by using
classier weights [30].
An Empirical Analysis of SMS Scam Detection Systems Oct, 2022,
In many ways, our research is dierent from that of our pre-
decessors. No previous work, to our knowledge, has attempted to
systematize and experimentally evaluate the SMS Spam literature
from the perspective of security challenges.
3 INTRODUCING THE SUPER SMS SPAM
CORPUS
The Super SMS Spam corpus is a dataset of labeled SMS messages
(reported over the last decade) that we collected for Spam research
from the public and free for research sources. In any scientic
study, having the most up-to-date, reliable, and representative data
is critical. The most widely used SMS datasets including NUS SMS
Corpus
5
and UCI SMS Spam Collection
6
used for Spam research
are extremely outdated. Due to the exponential growth of online
services and ongoing COVID19 epidemic, the threat landscape
has signicantly evolved in recent years, necessitating an updated
dataset covering all the latest scams.
3.1 Data Collection
A comprehensive survey was conducted in order to identify, col-
lect, and aggregate all of the public SMS datasets. Resultantly, we
consolidated a corpus of 153,551 SMS instances from public and free-
for-research sources. Details of dierent SMS datasets aggregated
in the consolidated dataset are given in Table 1. Importantly, we
manually crawled images of latest scam messages publicly shared
on Twitter and those reported to the scam observatories including
Scamwatch (Australian Competition and Consumer Commission’s
website for public education against scams) and Action Fraud (UK’s
national reporting center for fraud and cyber-crime)
7
to cover the
recent landscape of SMS scams (categories and campaigns [
31
]).
This way, a list of 71 scam messages from Twitter and 141 Spam
SMS messages from scam observatories were added to the corpus.
3.2 Data Augmentation
All SMS messages in the consolidated dataset and newly gathered
Spam messages from various volunteers and observatories were
manually labelled using a set of carefully designed rules (list and
details regarding derivation of rules are given in Appendix A.1). We
eliminated discrepancies (duplicate and non-English messages) in
the consolidated dataset and rened the data into a format suitable
for further analysis by performing a series of processes. To this end,
we rst lter out SMS messages in non-English languages. For this
purpose, in particular, we use a two pass ltering mechanism in
order to lter the large amount of non-English SMS messages from
the consolidated dataset. In the rst pass, we use
langdetect
(an
open source Python library for language detection) [
32
] to deter-
mine the language of each SMS and lter out SMS in non-English
languages. We then passed the ltered SMS messages returned by
langdetect
to
Googletrans
(Google API for language detection
and translation) [
33
] to further lter out the non-English SMS mes-
sages.
Googletrans
API was kept in the second phase of ltering
due to the limitation of API calls for free user.
5https://github.com/kite1988/nus-sms-corpus
6https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
7https://www.actionfraud.police.uk/
Next, we removed the duplicate messages by importing the SMS
corpus in
mysql
table and deleted the duplicate rows. Lastly, we
manually labelled all of the remaining unlabeled SMS messages
(more than 47,022) as Spam and Ham (legitimate). Over all, we
obtain a dataset of 52,814 unique English language SMSes from the
consolidated dataset.
Lastly, the images crawled from Twitter and scam observatories
were converted to text using online
OCR
tool
8
. Table 2 presents
“Super Dataset” which aggregates all SMSes from the augmented
dataset, scam observatories, Twitter, and volunteers. The Super
Dataset consists of 53,286 SMSes with 72.69% and 27.31% of Ham
and Spam SMSes, respectively.
Table 1: Overview of SMS Spam datasets consolidated to gen-
erate an augmented dataset.
Dataset # of SMSes Language Labeling Year
UCI [5, 6] 5,574 English Labelled 2012
NUS [34] 67,063 English, Chinese Unlabelled 2015
Github1 [35] 77,039 English, Roman, Hindi Unlabelled 2019
Github2 [36] 557 English Labelled 2018
Gupta [7] 3,318 English, Roman Hindi Labelled 2018
Consolidated 153,551 Multi Partial -
Consolidated
[Augmented] 52,814 English Partial -
Table 2: Characterisation of Super Dataset.
Dataset # of SMSes Language Labeling
Consolidated [Augmented] (cf Table 1) 52,814 English Partial
DS7 [Volunteers] 260 English Labelled
DS8 [Scamwatch, ActionFraud] 141 English Labelled
DS9 [Twitter] 71 English Labelled
Super Dataset 53,286 English Labelled
4 ANALYSIS METHODOLOGY
To identify the state-of-the-art in preventing SMS scams, we gath-
ered existing techniques and counter measures from academic,
industry, internet domain, and systematically categorize them. Fig-
ure 1 depicts our experiment methodology to perform comparative
analysis of various feature models and machine learning techniques.
4.1 Data Split
Splitting the dataset is essential to build a reliable ML model and
for an unbiased evaluation of classication performance. Common
techniques for training and testing the data involve
𝑘
-fold cross
validation or splitting the original dataset into training (usually 80%-
70%) and testing (usually 20%-30%) data. In this study, we divided
the data set into three subsets: train (80%), test (20%), and hold-out.
The hold-out split was created by randomly selecting 225 Spam
SMS messages from all over the dataset. The train set was used to
t the ML model, whereas the test data set was used to estimate
the performance of the model on data not used to train the model.
8https://www.onlineocr.net/
摘要:

AnEmpiricalAnalysisofSMSScamDetectionSystemsMuhammadSalmanMacquarieUniversitySydney,Australiamuhammad.salman2@students.mq.edu.auMuhammadIkramMacquarieUniversitySydney,Australiamuhammad.ikram@mq.edu.auMohamedAliKaafarMacquarieUniversitySydney,Australiadali.kaafar@mq.edu.auABSTRACTTheshortmessageservi...

收起<<
An Empirical Analysis of SMS Scam Detection Systems.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:14 页 大小:1.35MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注