An Empirical Analysis of SMS Scam Detection Systems

2025-04-30 1 0 1.35MB 14 页 10玖币

侵权投诉

Muhammad Salman

Macquarie University

Sydney, Australia

muhammad.salman2@students.mq.edu.au

Muhammad Ikram

Macquarie University

Sydney, Australia

muhammad.ikram@mq.edu.au

Mohamed Ali Kaafar

Macquarie University

Sydney, Australia

dali.kaafar@mq.edu.au

ABSTRACT

The short message service (SMS) was introduced a generation ago to

the mobile phone users. They make up the world’s oldest large-scale

network, with billions of users and therefore attracts a lot of fraud.

Due to the convergence of mobile network with internet, SMS based

scams can potentially compromise the security of internet services

as well. In this study, we present a new SMS scam dataset consisting

of 153,551 SMSes. This dataset that we will release publicly for re-

search purposes represents the largest publicly-available SMS scam

dataset. We evaluate and compare the performance achieved by

several established machine learning methods on the new dataset,

ranging from shallow machine learning approaches to deep neural

networks to syntactic and semantic feature models. We then study

the existing models from an adversarial viewpoint by assessing

its robustness against dierent level of adversarial manipulation.

This perspective consolidates the current state of the art in SMS

Spam ltering, highlights the limitations and the opportunities to

improve the existing approaches.

1 INTRODUCTION

SMS scams have reached alarming volumes in recent years. In the

US in 2021, an estimated USD 86 million was lost to SMS scammers

(reported by Federal Trade Commission

). Similarly, the Australian

Competition and Consumer Commission (ACCC)’s ScamWatch

body

reported a near-doubling of annual losses between 2020

(AUD 175 millions) and 2021 (AUD 323 millions). In 2021, there

were 67,180 SMS fraud reports, up from 32,337 reported in 2020:

more than 8,835 SMS scams were reported only in February 2022,

the highest among all scam delivery methods. Despite almost two

decades of research on SMS Spam detection [

–

], it continues to

be a challenging and serious issue to our modern digital societies.

Researchers have developed various techniques to combat SMS

Spam, and there are several surveys of these attempts [

–

]. How-

ever, unlike email spam detection, it is dicult to identify the ap-

propriate system and directly compare the proposed systems in

literature, since most of them used highly outdated and dierent

datasets for evaluation. Although, traditional ML and deep learning

(DL) methods have substantially improved Spam detection in SMS;

however, scammers’ dynamic nature frequently defeats security

barriers. Recently, studies have revealed dierent adversarial at-

tacks in the text domain, which could eectively evade various

ML-based text analyzers [

–

]. We discovered that there are

certain methodological issues with current SMS Spam detection re-

search and it requires a lot more research. For example, we observe:

•

The use of highly outdated and imbalanced datasets with

only few hundreds of Spam messages

1https://www.fcc.gov/covid-19-text-scams

2https://www.scamwatch.gov.au/scam-statistics

•

Lack of benchmark, which resulted in partial and segmented

evaluation of literature

•

Lack of comprehensive quantication and comparison of

dierent classes of ML such as DL, positive and unlabeled

(PU) learning, traditional two class and one class ML etc. The

same is observed for dierent set of syntactic and semantic

features

•

Importantly, the literature overlook new challenges that arise

as convergence of Internet and Telcom especially the new

adversarial attempts of scammers to evade detection

The focus of this paper is to investigate the landscape of fraud-

ulent SMS messages, existing methods for combating SMS scams,

state-of-the-art research in SMS Spam defense, points out the re-

quired set of challenges and to analyse how current tactics may be

abused by scammers and subsequently identify the requirements

that are crucial to an acceptable solution. Therefore, we assess the

performance and robustness of the SMS Spam models proposed in

the literature on the new super dataset (shortly discussed in section

3), based on the requirements of the security domain, which we

elaborate upon in section 6, to determine the appropriateness of

the proposed solutions. To our knowledge, a comprehensive exper-

imental evaluation of the appropriateness of the previous research

on SMS Spam from the security perspective has not been done

before. To ll this important gap, this study will assist drive the

development of eective/ successful SMS Spam defenses, as well as

provide a framework for evaluating SMS Spam methods. The major

contributions of our paper are:

Large scale Spam Dataset.

We make available a new real, pub-

lic and large SMS Spam corpus from various sources over the previ-

ous decade, that is the largest labeled dataset as far as we are aware.

We aggregate and characterise publicly available Spam dataset.

Given that the benchmark dataset [

] used for SMS Spam research

is outdated and incomplete consisting only 5,574 messages with 13%

Spam and 87% Ham (or legitimate) messages, it is not suitable for

detecting the most recent, stealthy SMS Spam even using advanced

machine or DL techniques. We propose data collection methods

to crawl 55,686 SMS messages from Spam observatories such as

ScamWatch

and Action Fraud

. We labelled the dataset into Spam

and non-Spam messages constituting 15,209 (27.31 %) and 40,477

(72.69 %) SMSes, respectively.

Valuation of machine learning models.

Usually, supervised

text classication methods (like binary classication) or DL have

been used to identify SMS Spam. However, given the imbalance

nature of the SMS Spam datasets, we propose one-class learning

(unsupervised learning) and positive unlabeled (PU) learning (semi-

supervised learning) algorithms alongside other ML models. To

the best of our knowledge, the notion of one-class learning and

3https://www.scamwatch.gov.au/

4https://www.actionfraud.police.uk/

arXiv:2210.10451v1 [cs.CR] 19 Oct 2022

Oct, 2022, Muhammad Salman, Muhammad Ikram, and Mohamed Ali Kaafar

PU learning for SMS Spam detection was never explored earlier.

We evaluate and compare the performance of PU learning and

one-class learning models with both traditional two class ML as

well as with the state of the art neural network and DL such as

the latest transformers based architecture [

]. Furthermore, we

analysed and compared the performance of several well-known

ML methods in order to establish a good baseline for future com-

parisons. Additionally, we evaluated the ML models over the new

dataset with dierent set of features/ word embeddings starting

from Non-semantic Count based Vector space model (Count Vector-

ization, TF-IDF) to semantic Non Context-Based Vector Space Model

(Word2Vec, fastText, GloVe) with static and dynamic modes and

with semantic Context Based Vector Space Models (BERT, ELMo

etc).

Robustness analysis of Spam models in the face of new

challenges.

To evade detection, Spammers may leverage obfusca-

tion or perturbation methods [

] to Smish [

] or Punycode [

]

Spam SMS texts and embedded URLs. We evaluate the robustness of

ML models against 4 levels of adversarial perturbations: charlevel,

word-level, sentence-level, and multi-level attacks. To the best of

our knowledge, such extensive adversarial assessment of the ML

models on Spam SMS have not been performed before.

2 RESEARCH CHALLENGES AND RELATED

WORK

In the following, we discuss challenges and overview sate-of-the-art

SMS Spam ltering and detection methods in literature.

2.1 Challenges

Availability of data.

The lack of updated, genuine, and publicly

large SMS Spam dataset is a major concern. The SMS Spam ltering

has a lack of good dataset sources and a dearth of diversity in the

data used for evaluation of detection systems. Recency of the data is

another challenge. The datasets used by researchers in the literature

are extremely outdated and raises questions on its suitability in the

present SMS scam landscapes due to non-representative of latest

attacks (see table 1). Other concern is the message ambiguity due

to short length), limited header information, presence of emojis

and abbreviations in their text, therefore, established email Spam

lters may have their performance seriously degraded when directly

employed to dealing with mobile Spam.

Lack of benchmark.

Dierent methods have been proposed by

researchers to study SMS Spam detection and adversarial attacks

in texts, but there is no benchmark [

]. Moreover, dierent

datasets have been used by researchers in their work, making it

dicult to compare these methods. Meanwhile, it also aects the

selection of evaluation metrics. There is no exact statement about

which metric measure is better in a situation and why it is more

useful than others.

Robustness against new challenges and attacks.

Another

challenge in the Spam detection is the dynamic behavior of scam-

mers. They always try to nd a way to deceive the Spam lters

using dierent adversarial attacks. Moreover, due to the conver-

gence of mobile network with internet, Spammers may likely adapt

the techniques used for tricking users on internet.

Lack of strategy and collaboration

One of the less high-

lighted issue in eective tackling of Spam SMS is the lack of collab-

orations between researchers, end users, network providers, and

industry as well as absence of a robust strategy for dealing with

threats to the security of mobile users and Spam SMS detection.

2.2 Related Work

To eectively handle the threat posed by SMS Spam, several tech-

niques have been proposed in the academic literature and white

papers in the industry. However, none of these studies provides a

complete picture of the SMS Spam problem, though there are re-

sources that handle part of the problem or try to reduce the problem

to a single dimension. Importantly, there is no previous systematic

empirical analysis of SMS scam detection systems.

Abdulhamid et al.[

] presented a review of the currently avail-

able methods, challenges, and future research directions on SMS

Spam ltering and detection techniques. Wang et al [

] conducted

an empirical survey on word embeddings. Buchanan and Grant [

]

oered a brief description of Nigerian scam schemes, emphasiz-

ing that the growth of the Internet has aided the proliferation of

cyber-crime. To classify Spam texts, Almeida et al. [

] used the SVM

classier. They used word frequency as a feature and discovered

that SVM performed very well. Another study [

] used SVM classi-

ers and k- Nearest Neighbor (kNN) to classify Spam text messages.

The experimental results conrmed that using a combination of

BoW features and structural features to classify Spam messages

performed better. Some researchers have also proposed evolution-

ary methods for SMS Spam detection by assimilating the byte-level

features of SMS [

]. Researchers have recently started to use deep

neural networks. To lter SMS Spam, Popovac et al. [

] proposed

a CNN-based architecture with one layer of convolution and pool-

ing and achieved 98.4% accuracy rate. Jain et al. [

] used Long

short-term memory (LSTM) model, for SMS Spam ltering. Their

model attained a 99.01% accuracy with the help of 6000 features

and 200 LSTM nodes. They also employed various word embedding

techniques including ConcepNet, WordNet and Word2Vec. [

]

proposed several deep neural network-based model and attained

an accuracy of 98.51%.

Despite the preceding algorithms’ success in identifying Spam

messages, the existence of adversaries signicantly degrades Spam

lters’ performance [

]. Graham-Cumming presented an approach

against an individual user’s Spam lter in a talk at the 2004 MIT

Spam Conference [

]. Random words were added into Spam mail-

ings in this attack. A prior paper [

] attempted to broaden the scope

of this attack by substituting regularly used English words with

random phrases. After that, many adversarial attacks and counter-

measures have been documented in a range of applications. Attacks

on text classication jobs are typically carried out by altering char-

acteristics or changing the text sequence’s content [

]. Several

strategies have been identied in the eld of adversarial attacks

on Spam lters [

]: injection of Ham words, obfuscation of

Spam words, poisoning, alteration of labels, and synonym replace-

ment. The impact of dictionary-based attacks and well-informed

concentrated attacks, on the other hand, can be mitigated by using

classier weights [30].

An Empirical Analysis of SMS Scam Detection Systems Oct, 2022,

In many ways, our research is dierent from that of our pre-

decessors. No previous work, to our knowledge, has attempted to

systematize and experimentally evaluate the SMS Spam literature

from the perspective of security challenges.

3 INTRODUCING THE SUPER SMS SPAM

CORPUS

The Super SMS Spam corpus is a dataset of labeled SMS messages

(reported over the last decade) that we collected for Spam research

from the public and free for research sources. In any scientic

study, having the most up-to-date, reliable, and representative data

is critical. The most widely used SMS datasets including NUS SMS

Corpus

and UCI SMS Spam Collection

used for Spam research

are extremely outdated. Due to the exponential growth of online

services and ongoing COVID19 epidemic, the threat landscape

has signicantly evolved in recent years, necessitating an updated

dataset covering all the latest scams.

3.1 Data Collection

A comprehensive survey was conducted in order to identify, col-

lect, and aggregate all of the public SMS datasets. Resultantly, we

consolidated a corpus of 153,551 SMS instances from public and free-

for-research sources. Details of dierent SMS datasets aggregated

in the consolidated dataset are given in Table 1. Importantly, we

manually crawled images of latest scam messages publicly shared

on Twitter and those reported to the scam observatories including

Scamwatch (Australian Competition and Consumer Commission’s

website for public education against scams) and Action Fraud (UK’s

national reporting center for fraud and cyber-crime)

to cover the

recent landscape of SMS scams (categories and campaigns [

]).

This way, a list of 71 scam messages from Twitter and 141 Spam

SMS messages from scam observatories were added to the corpus.

3.2 Data Augmentation

All SMS messages in the consolidated dataset and newly gathered

Spam messages from various volunteers and observatories were

manually labelled using a set of carefully designed rules (list and

details regarding derivation of rules are given in Appendix A.1). We

eliminated discrepancies (duplicate and non-English messages) in

the consolidated dataset and rened the data into a format suitable

for further analysis by performing a series of processes. To this end,

we rst lter out SMS messages in non-English languages. For this

purpose, in particular, we use a two pass ltering mechanism in

order to lter the large amount of non-English SMS messages from

the consolidated dataset. In the rst pass, we use

langdetect

(an

open source Python library for language detection) [

] to deter-

mine the language of each SMS and lter out SMS in non-English

languages. We then passed the ltered SMS messages returned by

langdetect

Googletrans

(Google API for language detection

and translation) [

] to further lter out the non-English SMS mes-

sages.

Googletrans

API was kept in the second phase of ltering

due to the limitation of API calls for free user.

5https://github.com/kite1988/nus-sms-corpus

6https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

7https://www.actionfraud.police.uk/

Next, we removed the duplicate messages by importing the SMS

corpus in

mysql

table and deleted the duplicate rows. Lastly, we

manually labelled all of the remaining unlabeled SMS messages

(more than 47,022) as Spam and Ham (legitimate). Over all, we

obtain a dataset of 52,814 unique English language SMSes from the

consolidated dataset.

Lastly, the images crawled from Twitter and scam observatories

were converted to text using online

OCR

tool

. Table 2 presents

“Super Dataset” which aggregates all SMSes from the augmented

dataset, scam observatories, Twitter, and volunteers. The Super

Dataset consists of 53,286 SMSes with 72.69% and 27.31% of Ham

and Spam SMSes, respectively.

Table 1: Overview of SMS Spam datasets consolidated to gen-

erate an augmented dataset.

Dataset # of SMSes Language Labeling Year

UCI [5, 6] 5,574 English Labelled 2012

NUS [34] 67,063 English, Chinese Unlabelled 2015

Github1 [35] 77,039 English, Roman, Hindi Unlabelled 2019

Github2 [36] 557 English Labelled 2018

Gupta [7] 3,318 English, Roman Hindi Labelled 2018

Consolidated 153,551 Multi Partial -

Consolidated

[Augmented] 52,814 English Partial -

Table 2: Characterisation of Super Dataset.

Dataset # of SMSes Language Labeling

Consolidated [Augmented] (cf Table 1) 52,814 English Partial

DS7 [Volunteers] 260 English Labelled

DS8 [Scamwatch, ActionFraud] 141 English Labelled

DS9 [Twitter] 71 English Labelled

Super Dataset 53,286 English Labelled

4 ANALYSIS METHODOLOGY

To identify the state-of-the-art in preventing SMS scams, we gath-

ered existing techniques and counter measures from academic,

industry, internet domain, and systematically categorize them. Fig-

ure 1 depicts our experiment methodology to perform comparative

analysis of various feature models and machine learning techniques.

4.1 Data Split

Splitting the dataset is essential to build a reliable ML model and

for an unbiased evaluation of classication performance. Common

techniques for training and testing the data involve

𝑘

-fold cross

validation or splitting the original dataset into training (usually 80%-

70%) and testing (usually 20%-30%) data. In this study, we divided

the data set into three subsets: train (80%), test (20%), and hold-out.

The hold-out split was created by randomly selecting 225 Spam

SMS messages from all over the dataset. The train set was used to

t the ML model, whereas the test data set was used to estimate

the performance of the model on data not used to train the model.

8https://www.onlineocr.net/

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AnEmpiricalAnalysisofSMSScamDetectionSystemsMuhammadSalmanMacquarieUniversitySydney,Australiamuhammad.salman2@students.mq.edu.auMuhammadIkramMacquarieUniversitySydney,Australiamuhammad.ikram@mq.edu.auMohamedAliKaafarMacquarieUniversitySydney,Australiadali.kaafar@mq.edu.auABSTRACTTheshortmessageservi...

展开>> 收起<<

An Empirical Analysis of SMS Scam Detection Systems.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

An Empirical Analysis of SMS Scam Detection Systems

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: