Oct, 2022, Muhammad Salman, Muhammad Ikram, and Mohamed Ali Kaafar
PU learning for SMS Spam detection was never explored earlier.
We evaluate and compare the performance of PU learning and
one-class learning models with both traditional two class ML as
well as with the state of the art neural network and DL such as
the latest transformers based architecture [
15
]. Furthermore, we
analysed and compared the performance of several well-known
ML methods in order to establish a good baseline for future com-
parisons. Additionally, we evaluated the ML models over the new
dataset with dierent set of features/ word embeddings starting
from Non-semantic Count based Vector space model (Count Vector-
ization, TF-IDF) to semantic Non Context-Based Vector Space Model
(Word2Vec, fastText, GloVe) with static and dynamic modes and
with semantic Context Based Vector Space Models (BERT, ELMo
etc).
Robustness analysis of Spam models in the face of new
challenges.
To evade detection, Spammers may leverage obfusca-
tion or perturbation methods [
16
] to Smish [
17
] or Punycode [
18
]
Spam SMS texts and embedded URLs. We evaluate the robustness of
ML models against 4 levels of adversarial perturbations: charlevel,
word-level, sentence-level, and multi-level attacks. To the best of
our knowledge, such extensive adversarial assessment of the ML
models on Spam SMS have not been performed before.
2 RESEARCH CHALLENGES AND RELATED
WORK
In the following, we discuss challenges and overview sate-of-the-art
SMS Spam ltering and detection methods in literature.
2.1 Challenges
Availability of data.
The lack of updated, genuine, and publicly
large SMS Spam dataset is a major concern. The SMS Spam ltering
has a lack of good dataset sources and a dearth of diversity in the
data used for evaluation of detection systems. Recency of the data is
another challenge. The datasets used by researchers in the literature
are extremely outdated and raises questions on its suitability in the
present SMS scam landscapes due to non-representative of latest
attacks (see table 1). Other concern is the message ambiguity due
to short length), limited header information, presence of emojis
and abbreviations in their text, therefore, established email Spam
lters may have their performance seriously degraded when directly
employed to dealing with mobile Spam.
Lack of benchmark.
Dierent methods have been proposed by
researchers to study SMS Spam detection and adversarial attacks
in texts, but there is no benchmark [
19
,
20
]. Moreover, dierent
datasets have been used by researchers in their work, making it
dicult to compare these methods. Meanwhile, it also aects the
selection of evaluation metrics. There is no exact statement about
which metric measure is better in a situation and why it is more
useful than others.
Robustness against new challenges and attacks.
Another
challenge in the Spam detection is the dynamic behavior of scam-
mers. They always try to nd a way to deceive the Spam lters
using dierent adversarial attacks. Moreover, due to the conver-
gence of mobile network with internet, Spammers may likely adapt
the techniques used for tricking users on internet.
Lack of strategy and collaboration
One of the less high-
lighted issue in eective tackling of Spam SMS is the lack of collab-
orations between researchers, end users, network providers, and
industry as well as absence of a robust strategy for dealing with
threats to the security of mobile users and Spam SMS detection.
2.2 Related Work
To eectively handle the threat posed by SMS Spam, several tech-
niques have been proposed in the academic literature and white
papers in the industry. However, none of these studies provides a
complete picture of the SMS Spam problem, though there are re-
sources that handle part of the problem or try to reduce the problem
to a single dimension. Importantly, there is no previous systematic
empirical analysis of SMS scam detection systems.
Abdulhamid et al.[
19
] presented a review of the currently avail-
able methods, challenges, and future research directions on SMS
Spam ltering and detection techniques. Wang et al [
21
] conducted
an empirical survey on word embeddings. Buchanan and Grant [
1
]
oered a brief description of Nigerian scam schemes, emphasiz-
ing that the growth of the Internet has aided the proliferation of
cyber-crime. To classify Spam texts, Almeida et al. [
5
] used the SVM
classier. They used word frequency as a feature and discovered
that SVM performed very well. Another study [
22
] used SVM classi-
ers and k- Nearest Neighbor (kNN) to classify Spam text messages.
The experimental results conrmed that using a combination of
BoW features and structural features to classify Spam messages
performed better. Some researchers have also proposed evolution-
ary methods for SMS Spam detection by assimilating the byte-level
features of SMS [
23
]. Researchers have recently started to use deep
neural networks. To lter SMS Spam, Popovac et al. [
24
] proposed
a CNN-based architecture with one layer of convolution and pool-
ing and achieved 98.4% accuracy rate. Jain et al. [
25
] used Long
short-term memory (LSTM) model, for SMS Spam ltering. Their
model attained a 99.01% accuracy with the help of 6000 features
and 200 LSTM nodes. They also employed various word embedding
techniques including ConcepNet, WordNet and Word2Vec. [
26
]
proposed several deep neural network-based model and attained
an accuracy of 98.51%.
Despite the preceding algorithms’ success in identifying Spam
messages, the existence of adversaries signicantly degrades Spam
lters’ performance [
27
]. Graham-Cumming presented an approach
against an individual user’s Spam lter in a talk at the 2004 MIT
Spam Conference [
28
]. Random words were added into Spam mail-
ings in this attack. A prior paper [
3
] attempted to broaden the scope
of this attack by substituting regularly used English words with
random phrases. After that, many adversarial attacks and counter-
measures have been documented in a range of applications. Attacks
on text classication jobs are typically carried out by altering char-
acteristics or changing the text sequence’s content [
20
]. Several
strategies have been identied in the eld of adversarial attacks
on Spam lters [
4
,
29
]: injection of Ham words, obfuscation of
Spam words, poisoning, alteration of labels, and synonym replace-
ment. The impact of dictionary-based attacks and well-informed
concentrated attacks, on the other hand, can be mitigated by using
classier weights [30].