Application of Explainable Machine Learning in Detecting and Classifying Ransomware Families Based on API Call Analysis

2025-04-30 0 0 528.33KB 8 页 10玖币
侵权投诉
Application of Explainable Machine Learning in
Detecting and Classifying Ransomware Families
Based on API Call Analysis
Rawshan Ara Mowri
Department of Computer Science
North Carolina A&T State University
Greensboro, USA
rmowri@aggies.ncat.edu
Madhuri Siddula
Department of Computer Science
North Carolina A&T State University
Greensboro, USA
msiddula@ncat.edu
Kaushik Roy
Department of Computer Science
North Carolina A&T State University
Greensboro, USA
kroy@ncat.edu
Abstract—Ransomware has appeared as one of the major
global threats in recent days. The alarming increasing rate of
ransomware attacks and new ransomware variants intrigue the
researchers to constantly examine the distinguishing traits of
ransomware and refine their detection strategies. Application
Programming Interface (API) is a way for one program to
collaborate with another; API calls are the medium by which
they communicate. Ransomware uses this strategy to interact
with the OS and makes a significantly higher number of calls
in different sequences to ask for taking action. This research
work utilizes the frequencies of different API calls to detect and
classify ransomware families. First, a Web-Crawler is developed
to automate collecting the Windows Portable Executable (PE)
files of 15 different ransomware families. By extracting different
frequencies of 68 API calls, we develop our dataset in the
first phase of the two-phase feature engineering process. After
selecting the most significant features in the second phase of the
feature engineering process, we deploy six Supervised Machine
Learning models: Na¨
ıve Bayes, Logistic Regression, Random
Forest, Stochastic Gradient Descent, K-Nearest Neighbor, and
Support Vector Machine. Then, the performances of all the clas-
sifiers are compared to select the best model. The results reveal
that Logistic Regression can efficiently classify ransomware into
their corresponding families securing 99.15% overall accuracy.
Finally, instead of relying on the ‘Black box’ characteristic of the
Machine Learning models, we present the post-hoc analysis of our
best-performing model using ’SHapley Additive exPlanations’ or
SHAP values to ascertain the transparency and trustworthiness
of the model’s prediction.
Index Terms—Ransomware Classification, Machine Learning,
Explainable AI, Cyber Security
I. INTRODUCTION
Recently, ransomware has become one of the biggest global
challenges that are agitating peoples’ normal lives. Being
harmful software, it applies symmetric and asymmetric cryp-
tography to inscribe user information and poses a Denial-of-
Service (DoS) attack on the intended user [1]. The unique
functional process of ransomware attacks makes it more
harmful than any malware attacks and causes irreversible
losses. According to [2], Fig. 1 illustrates the number of
publicized ransomware attacks in 2021, with inflation of 25%
This work is funded by NetApp.
than the same time in the previous year. Although the report
does not include the number of supply chain attacks, it is
creating a big interference in providing healthcare, purchasing
groceries, and even loading fuel in vehicles. Examples of
these attacks are the Kaseya attack, the colonial pipeline
attack, etc. In addition, in the first six months of 2021, the
FBI’s Internet Crime Complaint Center documented 2,084
ransomware attacks [3], and the U.S. Treasury’s Financial
Crimes Enforcement Network (FinCEN) recorded the cost of
around $590 million related to ransomware activities during
that period [4]. Moreover, distinct ransomware variants are be-
ing detected regularly and more than 130 different ransomware
variants have been identified from 2020 till this year causing
an inevitable disturbance in day-to-day lives. [5].
Fig. 1. Number of worldwide ransomware attacks in different sectors in 2021
(*Till November 2021)
Due to the increasing number of ransomware variants and
ransomware attacks, researchers have been earnestly involving
themselves to look for efficient ways to improve the scenarios.
While some researchers are analyzing the distinctive behaviors
of ransomware by executing it in a secure environment called
Dynamic Analysis [6]-[11], some researchers are analyzing
the ransomware without any execution, referred to as Static
Analysis [12], [13]. However, a good number of researchers
are combining these two approaches and adopting a Hybrid
arXiv:2210.11235v3 [cs.CR] 13 Nov 2022
Analysis Approach [14], [15]. In this research, we have opted
for the dynamic analysis approach for its ability to detect and
classify ransomware based on behavioral patterns regardless of
the code obfuscation techniques deployed by the ransomware
programmers [16], [17]. The main contributions of this paper
are:
Develop a Web-Crawler, ‘GetRansomware’ to automate
collecting the Windows Portable Executable (PE) files of
15 different ransomware families from the ransomware
repository. The Web-Crawler is essential to automate
searching and downloading the samples and to cut down
the manual workload, but no prior works targeted this
scenario.
Develop our dataset and conduct feature selection through
a two-phase feature engineering process that includes-
‘Feature Extraction’ from the sample binaries, and ‘Fea-
ture Selection’ to select the most important features for
each ML classifier.
Develop, evaluate and compare the performance of six
State-of-the-art Supervised Machine Learning models.
Our approach includes utilizing Recursive Feature Elim-
ination with Cross-Validation (RFECV) for selecting the
significant features and RandomSearchCV for selecting
the optimum hyperparameter values for each ML clas-
sifier. Thereby we attempt to optimize each model’s
performance before the comparison is made.
Present the post-hoc analysis of the best-performing
model using ‘SHapley Additive exPlanations’ or SHAP
values to ascertain the transparency and trustworthiness of
the model’s prediction. This insight presents a better idea
about which features are more dominant in detecting and
classifying the ransomware families. While explainability
has been widely presented in malware detection scenar-
ios, to the best of the authors’ knowledge, till today, no
prior works presented their models’ explainability that
considered only the ransomware families.
The rest of this paper is structured as follows: Section II
discusses the related works. Section III presents our proposed
method. The experimental results and discussion are made
in Section IV. Section V presents our model’s explainability.
Section VI concludes the paper with the direction for future
works.
II. RELATED WORKS
Most researchers prefer the dynamic analysis approach
because it can delineate the behaviors of the ransomware in
a more explicit manner. Maniath et al. [6] analyzed the API
call sequence of 157 ransomware and presented an LSTM-
based ransomware detection method. Despite securing 96.67%
accuracy, this work lacks complete information about the ran-
somware families/variants, and the number of benign software
used for the experiment. VinayaKumar et al. [7] proposed an
MLP-based ransomware detection method focusing on API
call frequency and secured 100%, and 98% accuracy for
binary and multi-class classification respectively. However,
they deployed a simple MLP network that failed to distinguish
CryptoWall and Cryptolocker ransomware. Z. Chen et al. [8]
used the API Call Flow Graph (CFG) generated from the
extracted API sequence of 83 ransomware and 83 benign
software. Regardless of securing 98.2% exactness using the
Logistic Regression model, the work is based on a smaller
dataset that includes only four ransomware families. Also,
graph-similarity analysis requires higher computational power
that some systems may fail to provide. Takeuchi et al. [9] used
API call sequences extracted from 276 ransomware, and 312
benign files to identify zero-day ransomware attacks. Although
the work secured 97.48% accuracy by deploying the Support
Vector Machine, the accuracy of this work decreases while
using standardized vector representation because of the less
diverse dataset. Using the Intel Pin Tool, Bae et al. [10]
extracted the API call sequences from 1000 ransomware, 900
malware, and 300 benign files. Their sequential process in-
cludes generating an n-gram sequence, input vector, and Class
Frequency Non-Class Frequency (CF-NCF) for every sample
before fitting their model. Regardless of obtaining 98.65%
accuracy using the Random Forest classifier, the model’s
performance can be improved with the help of deception-based
techniques. Hwang et al. [11] analyzed the API call sequence
of 2507 ransomware and 3886 benign files. They used two
Markov chains, one for ransomware and another for benign
software to capture the API call sequence patterns. By using
Random Forest, they compensate Markov Chains and control
FPR and FNR to achieve better performance. Despite securing
97.3% accuracy, their model produces high FPR that can be
improved with the help of signature-based techniques.
A good number of researchers chose the static analysis ap-
proach to detect ransomware. Baldwin and Dehghantanha [12]
analyzed the opcode characteristics of 5 crypto-ransomware
families and 350 benign samples. Their experiment involved
the WEKA AI toolset, and the experimental results showed an
accuracy of 96.5% while recognizing five crypto-ransomware
families and benign software by using the Support Vector
Machine classifier. However, their work could be improved by
extending the dataset and extracting those groups of opcodes
identified during the evaluation of attribute selection. Zhang et
al. [13] analyzed the opcode-based characteristics of 1787 ran-
somware of 8 different ransomware families and 100 benign
software. Their technique included moving opcode groupings
to the N-gram sequence and afterward Term Frequency Inverse
Document Frequency (TF-IDF). Five ML classifiers were used
with 10-fold cross-validation among which the Random Forest
classifier achieved the highest 91.43% exactness. However,
their model could not distinguish Reveton, CryptoWall, and
Locky.
Some researchers adopted a hybrid analysis approach that
combines the features extracted from the dynamic and static
analyses. Subedi et al. [14] used both dynamic and static
analysis on the library, assembly, and function calls. Moreover,
they came up with a new analysis tool, namely, CRSTATIC
which was deployed to build signatures that could classify
ransomware families with the help of reverse engineering.
However, they analyzed only 450 samples of ransomware
摘要:

ApplicationofExplainableMachineLearninginDetectingandClassifyingRansomwareFamiliesBasedonAPICallAnalysisRawshanAraMowriDepartmentofComputerScienceNorthCarolinaA&TStateUniversityGreensboro,USArmowri@aggies.ncat.eduMadhuriSiddulaDepartmentofComputerScienceNorthCarolinaA&TStateUniversityGreensboro,USAm...

展开>> 收起<<
Application of Explainable Machine Learning in Detecting and Classifying Ransomware Families Based on API Call Analysis.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:528.33KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注