Active Countermeasures for Email Fraud Wentao Chen University of Bristol

2025-04-30 0 0 506.59KB 17 页 10玖币
侵权投诉
Active Countermeasures for Email Fraud
Wentao Chen
University of Bristol
Bristol, United Kingdom
wentao.chen.uob@gmail.com
Fuzhou Wang
City University of Hong Kong
Kowloon, Hong Kong SAR
wang.fuzhou@my.cityu.edu.hk
Matthew Edwards
University of Bristol
Bristol, United Kingdom
matthew.john.edwards@bristol.ac.uk
Abstract—As a major component of online crime, email-
based fraud is a threat that causes substantial economic
losses every year. To counteract these scammers, volun-
teers called scam-baiters play the roles of victims, reply to
scammers, and try to waste their time and attention with
long and unproductive conversations. To curb email fraud
and magnify the effectiveness of scam-baiting, we developed
and deployed an expandable scam-baiting mailserver that
can conduct scam-baiting activities automatically. We im-
plemented three reply strategies using three different models
and conducted a one-month-long experiment during which
we elicited 150 messages from 130 different scammers. We
compare the performance of each strategy at attracting and
holding the attention of scammers, finding tradeoffs between
human-written and automatically-generated response strate-
gies. We also demonstrate that scammers can be engaged
concurrently by multiple servers deploying these strategies
in a second experiment, which used two server instances to
contact 92 different scammers over 12 days. We release both
our platform and a dataset containing conversations between
our automatic scam-baiters and real human scammers, to
support future work in preventing online fraud.
1. Introduction
According to the Internet Crime Report, the FBI’s
Internet Crime Complaint Center (IC3) received 847,376
reported complaints in 2021, corresponding to over $6.9
billion in potential losses. Email, as one of the primary
online communication media, is linked to a significant
proportion of these losses. The IC3 records 19,954 reports
of email account spoofing and compromise fraud in 2021,
accounting for over $2.4 billion in losses– over a third of
all damages [38]. Online confidence tricks and romance
fraud schemes, also often carried out via email exchanges,
account for a further 24,299 victims and $956 million
in losses. While email scammers have been known to
use social engineering techniques that pre-date even the
Industrial Revolution [57, p. 58], the regular innovations
of scammers in their style and content, and the mass-
market targeting of email scamming make this threat a
constant problem that causes severe financial loss and
impacts lives worldwide.
Traditional approaches to combatting this threat have
focused on identifying which Internet users may be most
vulnerable [25], [37], educating Internet users about the
existence and risks of such schemes [9]–[11], [29], [41],
building classifiers that can automatically filter out mes-
sages containing fraud [3], [14], [34], [59], or building
and maintaining blacklists of email senders known to be
untrustworthy [8], [54]. A common thread between all
these approaches is that they are fundamentally inwardly-
directed and defensive, aiming to increase the resilience
of Internet users (or their inboxes) against the social-
engineering-based attacks they are receiving. Recent work
suggests that this approach is not working well enough,
and a fundamentally more active set of countermeasures
should be explored [6].
As an example of what active countermeasures might
look like in this domain, we turn to an existing initiative
in the voluntary anti-fraud community, known as scam-
baiting. To protect victims from contacting scammers,
some volunteers reply to fraudsters, adopting the guise of
possible victims, and engage them in long and unproduc-
tive conversations, which distract fraudsters into wasting
time and attention they would otherwise have spent on
real victims. These volunteers are called scam-baiters,
and there is reason to believe that scam-baiting activities
could be particularly efficient at disrupting the operations
of scammers, as they can decrease the density of real
victims in replies to scammers to the point where the email
scam business model becomes unprofitable [22]. However,
scam-baiting is currently a small-scale hobbyist activity,
and scam-baiters expend significant time and energy in
playing their roles. As a general countermeasure, human
scam-baiters could not respond at the current scale of
global email-based fraud. As such, we turn our attention
to methods of automating their approach.
There is a limited literature that precedes us in au-
tomated conversational interactions with scammers. Some
work has approached this topic as an extension of the hon-
eypot, as in the Honey Phish Project, in which automated
mailboxes would reply to phishing emails with links that,
when clicked, reported identifiable information about the
phishers back to the honeypot operators [17]. A more
conversational email agent was used for similar purposes
by McCoy et al. in a study gathering information on
rental scams [35]. While these could be considered a form
of active countermeasure, this approach is fundamentally
about gathering information on the attackers, rather than
disrupting their operations. More similar to our intent
is the ‘Jolly Roger Bot’ developed by Anderson, which
answers telemarketing calls by responding with audio
clips of random statements [2], [6]. The simpler of the
methods we explore resembles a textual version of this
approach. However, the time wasted in a phone call is
capped at the order of minutes, whereas email scam-baiter
arXiv:2210.15043v2 [cs.CR] 1 Jun 2023
conversations can lead fraudsters on for weeks. Edwards et
al. [14] described some of the persuasive techniques used
by human scam-baiters to achieve these results, noting
that they often mirrored the tactics used by the scammers
they targeted. It is the automatic deployment of these
techniques against cybercriminals that Canham & Tuthill
advocate [6], and that we explore.
In this paper, we describe our implementation of
a mailserver that can engage in scam-baiting activi-
ties automatically. We develop three alternative response-
generation strategies using publicly-available email cor-
pora as inputs to deep learning models, and perform a
one-month comparative evaluation and 12-day concurrent-
engagement experiment in which our system interacted
with real fraudsters in the wild. In short, our contributions
are:
We demonstrate that automated scam-baiting is
possible, with 15% of our replies to known scam
emails attracting a response and 42% of conversa-
tions primarily involving a human fraudster. Some
conversations lasted up to 21 days.
Further, we compare different approaches to au-
tomated scam-baiting in a naturalistic experi-
ment using randomised assignment. We find that
human-designed lures work best at attracting
scammer responses, but text generation methods
informed by the methods of human scam-baiters
are more effective at prolonging conversations.
We also engage the same group of scammers with
two identical scam-baiting instances simultane-
ously. We find that two concurrent instances at-
tracted responses from 25% of the targeted scam-
mers and 29% of these scammers engaged with
both scam-baiting instances simultaneously.
We release our code as a platform which can
be deployed to test alternative response strategies
and iterate on our findings. We also release both
full transcripts of our automated system’s con-
versations and a collection of human scam-baiter
conversations, to guide the development of new
active countermeasures and provide insight into
scammer operations.
The rest of this paper proceeds as follows. In Section 2
we provide some background on scam-baiting as a human
activity, as well describing the fundamental models used
within our work. Section 3 outlines our deployment plat-
form. Section 4 describes the different corpora we make
use of for finetuning and prepatory evaluation. Section 5
describes said finetuning and classifier evaluation, while
Section 6 describes our main experiments, including the
results from our comparison of the different response
strategies. Section 7 reflects on our findings, their limi-
tations, and our suggestions for future improvements, as
well as considering misuse concerns. We conclude with a
summary of our key results and recommendations.
2. Background
To provide the essential basics for active scamming
defense, this section describes the significance of scam-
baiting activities and recent advances in text generation
that enable adaptive conversational AI.
2.1. Scam-baiting protects potential scam victims
Scam-baiting is a kind of vigilante activity, in which
scam-baiters reply to the solicitation emails sent by scam-
mers and enter into conversation with them, in order to
waste scammers’ time and prevent them from scamming
other potential victims. This activity has become an In-
ternet subculture with various scam-baiter communities
across the Internet. Past research on scam-baiting has
explored the various motivations of scam-baiters [53],
[62], the strategies they use in conversations [13], [14]
and the ethics of their activities [51].
We attach importance to scam-baiting activity because,
by wasting scammers’ time, scam-baiting can help to
protect other vulnerable people from being scammed.
Herley [22] argued that scam-baiting activity can sharply
reduce the number of victims found by scammers by
decreasing the density of viable targets (i.e., the targets
that can lead to financial gain), making them less likely
to harm the potential victims. This argument seems to
be upheld in practice, as well. Scam-baiting exchanges
generally end in frustrated invectives from scammers
once they have understood what is taking place [14],
and prominent scam-baiter and comedian James Veitch
reports pointedly about scammers pleading with him to
stop emailing them [6], [56]. Some scam-baiters also use
their activities as a means to prod scammers into reflecting
on what they are doing (e.g., [47]), but the effectiveness
of this last technique is unknown.
There are existing industrial email conversation prod-
ucts that make use of NLP techniques to craft re-
sponses [33], [39], [46]. These products may conceivably
be adapted for automatic scam-baiting; however, their
focus is providing automatic customer service, and they
are tested only on responding appropriately to business
email communications. As a result, the performance of
these products at scam-baiting—which may involve inten-
tionally drawing out conversations to better waste scam-
mer resources–has not been evaluated. Previous anti-spam
conversation systems, such as RE:Scam [60], have demon-
strated the basic viability of automatic responses for dis-
rupting scammer operations, using random selection from
a series of canned template responses. This is a promising
start, but we suspect that an automatic scam-baiter that
can respond to the content of a scam message could be
significantly more effective at prolonging conversations.
Our system is the first open-source email conversation
system specialized for automatic scam-baiting. The system
aims at utilizing general NLP models for the purpose of
consuming scammers’ resources.
2.2. Text generation for email conversations
The recent successes in natural language processing
(NLP) have given rise to the prosperity of automatic
dialogue systems [7]. The most prominent architectures
include the Transformer [58] and its variants BERT [12]
and the GPT family [44]. The emergence of transformers
has enabled the pretraining-finetuning approach in NLP,
which was not possible in the era of RNN [48] and
LSTM [23]. We briefly introduce these models in this
section.
Transformers. The Transformer architecture is based
solely on attention mechanisms, without recurrence. The
model is more parallelizable and requires significantly less
time to train [55].
The core of Transformer models is the attention mech-
anism. An embedded word vector is multiplied by three
different matrices to get three feature matrices, then the
features of words are operated with other words’ features
to obtain the attention, which refers to the relationship
between this word and the whole sentence. In practice,
we usually calculate multiple attention matrices of a single
word, an approach known as multi-head attention. These
attention matrices are concatenated into a larger matrix,
which is then multiplied with a weight matrix, returning
it to its original size for feeding forward.
Qi=X·Wq
i
Ki=X·Wk
i
Vi=X·Wv
i
Zi=softmax(Qi·KT
i
dk
)Vi
Z=Concat(Z1, Z2, ..., Zh)·WO
The Qi,Kiand Videnote the feature values of input
sentence X. The Ziis the i-th attention values while Z
denotes the final output of the multi-head attention pro-
cess. The feature matrices (Wq
i,Wk
iand Wv
i) and weight
matrix WOare obtained from training as the parameters
of the model. The term dkdenotes the dimension of the
Kifeature vector.
Since the attention matrix only relates to the word
vector itself, it does not contain any context and position
information. To solve this problem, a position embedding
was added to represent the position features of word vec-
tors. The position vectors are added before word vectors
are input into multi-head attentions.
X=Xraw +E
In the following sections, we will briefly introduce
two specialized Transformer models that are widely used
in the field of NLP.
BERT. Bidirectional Encoder Representations from Trans-
former (BERT) is a language representation model based
on the Transformer architecture, described by Devlin et
al. in 2018 [12]. Unlike traditional language models, the
pre-training process of the model is split into two tasks. In
the first, a Masked Language Modelling task, the training
data generator masks 15% of word tokens at random,
which the model learns to predict. The loss function of
this step only calculates the difference between the origin
of masked token and the predicted token, excluding the
position. In the second task, Next Sentence Prediction, in
order to train the model to understand the coherence and
relationship of sentences, when choosing the sentence A
and sentence B for a training example, the data generator
replaces the sentence B into another random sentence in
the corpus 50% of the time. In the pre-training step, two
masked sentences are given as input to the model, and
the model needs to determine whether these two sentences
have been randomly concatenated.
After the pre-training process, the model can be put
into a finetuning process for specific training on a task.
The finetuning methodology can take a straightforward
supervised learning structure, and as a result the BERT
model has been applied to a variety of different tasks.
In [12], the model is finetuned for 11 NLP tasks, in-
cluding question answering, single sentence tagging and
sentence classification. The BERT has also been used
in machine translation [61], target-dependent sentiment
classification [19] and sentence similarity [45]. For the
classification tasks in this study, a fully connected layer is
applied after the output of transformer structure to predict
the category of input text.
GPT by OpenAI. GPT [42] is one of the major mile-
stones of language modelling that precipitated the rapid
development of self-supervised natural language genera-
tion. In a self-supervised manner, the model is trained to
generate conditional synthetic text by predicting the next
word based on the given context. The variants of GPT,
including GPT-2 [43], GPT-3 [5], and GPT-Neo [4], are
direct scaling-ups on top of the GPT algorithm. At the
time of this study, the latest versions of GPT were GPT-3
and GPT-Neo1.
The architecture of GPT is composed of multiple
layers of transformer decoder. Similar to BERT and most
transformer-based NLP models, the training framework
of GPT models is composed of two steps – the pre-
training step and the finetuning step. At the pre-training
step, given a series of tokens encoded from the context,
the GPT model tries to maximize the likelihood of the
corpus tokens V={v1, ..., vn}:
Θ= argmaxΘ
n
X
i
log P(vi|vik, ..., vi1)
where Θdenotes the parameters of a neural network,
and krepresents the size of the context window.
For each iteration of training, the context matrix is
forwarded through the multi-layer transformer decoder,
which can be formulated as below:
z0=V We+Wp
zl= decoder block (zl1)l[1, n]
P(v) = softmax znWT
e
where Wedenotes the learnable token embedding ma-
trix, and Wpis the position embedding matrix. Vdenotes
the context matrix.
In the finetuning step, the model accepts additional
supervision information for other downstream tasks or
adjusts itself to generate text in specific domains.
By training on extremely large corpora, the GPT
models achieved impressive performance on a variety of
tasks. In particular, the GPT models achieved state-of-the-
art performance in terms of conditioning on long-range
contexts [43]. In this study, the contexts inputted to the
model were email text, which are usually lengthy. The
long-range context dependency of GPT models enabled
1. An open-source implementation of GPT3-like models, available at
https://github.com/EleutherAI/gpt-neo.
a longer-term memory of linguistic information, and thus
helped address the problem of losing conversational con-
text when interacting with scammers.
Conversational Artificial Intelligence. With the afore-
mentioned developments in NLP, conversational artificial
intelligence has been widely deployed in the real world.
Conversational AI is classified into three categories by
Gao et al. [18], including question-answering (QA) sys-
tems, task-oriented dialogue systems, and fully data-driven
chat systems. QA systems involve a pipeline of reading
comprehension, knowledge base extraction, and answer
generation, and task-oriented dialogue systems understand
the users’ instructions or queries to decide the back-end
operations, which are usually deployed as AI assistants
(e.g., Alexa by Amazon and Assistant by Google).
The task in our study, however, is most relevant to
the fully data-driven chat systems, but with longer prior
contexts (i.e., email messages). This type of neural system
consists of end-to-end models trained on open-domain
text. For example, Li et al. [30] used a diversity-promoting
loss function to make the model generate diversified and
meaningful responses, and in [31], reinforcement learning
was adopted to improve the quality of the text generated
by the model. In recent years, after the introduction of the
GPT family, more advances have been made in the field
of conversational AI. In previous work [21], [36], GPT-2
and GPT-3 were examined for the purpose of constructing
conversational AIs, and both of them delivered remarkable
performance, although, in one evaluation [36] GPT-3 was
criticized for its lower language variability than human-
written text.
3. Scam-baiting Mailserver
In this section, we describe a mailserver that is capable
of conducting scam-baiting experiments automatically. We
used this architecture for the experiments described later
in this paper and implemented three different response
strategies, using an AI natural language classifier and
two text generators(see Section 4 and Section 5). It is
important to mention that this mailserver design is not
particular to our specific response strategies, and could
easily be deployed to trial new automatic scam-baiting
techniques2.
3.1. Server Structure
We designed a modular server structure, as shown in
Figure 1. Scam emails are crawled automatically from
online reporting platforms using a Crawler module, and
placed into a queue. The application server then regularly
polls the queue and distributes work evenly between the
registered Responders. Responders will use unique mail-
names for each conversation. When a scammer responds
to an email, their response is routed back through the
queue for scheduling purposes, but is always assigned
to the Responder which they originally addressed. It is
important to note that the server is designed to run on
a unique domain independently, so it is not capable of
2. The source code can be found at https://github.com/
scambaitermailbox/scambaiter backend
Figure 1: The modular mailserver architecture
spoofing an existing email address, nor connecting to the
inboxes of others.
3.2. Server Modules
Crawler. The Crawler module is used to fetch scam
emails from online sources at a regular intervals. The
emails we crawl are public copies of solicitation emails
sent by scammers which have been captured by spam traps
or reported by anti-fraud volunteers. The Crawler modules
store cleaned copies of these emails in a queue structure,
along with the email address to which replies should be
addressed.
Email Sender & Receiver. Due to the port limitation
on most VPS providers, we chose to use a relay email
service provider for receiving and sending mail. The Email
Sender module is an intermediate layer for transforming
email arguments into the API-required JSON content,
and sending the HTTP request to submit an outbound
message. To receive responses from scammers, our server
listens for POST requests from the relay service. Once
the Email Receiver receives the request, it extracts the
response content, and submits it to the email queue. These
relay email servers are correctly configured with SPF
and DKIM, with different RSA-SHA256 public keys and
unrelated domain names. In a production mail service
deployment, the reliance on a relay provider would not
be necessary, and a traditional mail agent could replace
these modules.
Responders. The Responders are used to process the
incoming messages from scammers and generate text
replies. They play the role of the automatic scam-baiters
in the server. In this study, we implemented three Re-
sponders using two AI text generation models and one
AI text classification model. We further described how we
prepared these models in Section 5, and how we used three
摘要:

ActiveCountermeasuresforEmailFraudWentaoChenUniversityofBristolBristol,UnitedKingdomwentao.chen.uob@gmail.comFuzhouWangCityUniversityofHongKongKowloon,HongKongSARwang.fuzhou@my.cityu.edu.hkMatthewEdwardsUniversityofBristolBristol,UnitedKingdommatthew.john.edwards@bristol.ac.ukAbstract—Asamajorcompon...

展开>> 收起<<
Active Countermeasures for Email Fraud Wentao Chen University of Bristol.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:506.59KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注