Application of Explainable Machine Learning in Detecting and Classifying Ransomware Families Based on API Call Analysis

2025-04-30 0 0 528.33KB 8 页 10玖币

侵权投诉

Application of Explainable Machine Learning in

Detecting and Classifying Ransomware Families

Based on API Call Analysis

Rawshan Ara Mowri

Department of Computer Science

North Carolina A&T State University

Greensboro, USA

rmowri@aggies.ncat.edu

Madhuri Siddula

Department of Computer Science

North Carolina A&T State University

Greensboro, USA

msiddula@ncat.edu

Kaushik Roy

Department of Computer Science

North Carolina A&T State University

Greensboro, USA

kroy@ncat.edu

Abstract—Ransomware has appeared as one of the major

global threats in recent days. The alarming increasing rate of

ransomware attacks and new ransomware variants intrigue the

researchers to constantly examine the distinguishing traits of

ransomware and reﬁne their detection strategies. Application

Programming Interface (API) is a way for one program to

collaborate with another; API calls are the medium by which

they communicate. Ransomware uses this strategy to interact

with the OS and makes a signiﬁcantly higher number of calls

in different sequences to ask for taking action. This research

work utilizes the frequencies of different API calls to detect and

classify ransomware families. First, a Web-Crawler is developed

to automate collecting the Windows Portable Executable (PE)

ﬁles of 15 different ransomware families. By extracting different

frequencies of 68 API calls, we develop our dataset in the

ﬁrst phase of the two-phase feature engineering process. After

selecting the most signiﬁcant features in the second phase of the

feature engineering process, we deploy six Supervised Machine

Learning models: Na¨

ıve Bayes, Logistic Regression, Random

Forest, Stochastic Gradient Descent, K-Nearest Neighbor, and

Support Vector Machine. Then, the performances of all the clas-

siﬁers are compared to select the best model. The results reveal

that Logistic Regression can efﬁciently classify ransomware into

their corresponding families securing 99.15% overall accuracy.

Finally, instead of relying on the ‘Black box’ characteristic of the

Machine Learning models, we present the post-hoc analysis of our

best-performing model using ’SHapley Additive exPlanations’ or

SHAP values to ascertain the transparency and trustworthiness

of the model’s prediction.

Index Terms—Ransomware Classiﬁcation, Machine Learning,

Explainable AI, Cyber Security

I. INTRODUCTION

Recently, ransomware has become one of the biggest global

challenges that are agitating peoples’ normal lives. Being

harmful software, it applies symmetric and asymmetric cryp-

tography to inscribe user information and poses a Denial-of-

Service (DoS) attack on the intended user [1]. The unique

functional process of ransomware attacks makes it more

harmful than any malware attacks and causes irreversible

losses. According to [2], Fig. 1 illustrates the number of

publicized ransomware attacks in 2021, with inﬂation of 25%

This work is funded by NetApp.

than the same time in the previous year. Although the report

does not include the number of supply chain attacks, it is

creating a big interference in providing healthcare, purchasing

groceries, and even loading fuel in vehicles. Examples of

these attacks are the Kaseya attack, the colonial pipeline

attack, etc. In addition, in the ﬁrst six months of 2021, the

FBI’s Internet Crime Complaint Center documented 2,084

ransomware attacks [3], and the U.S. Treasury’s Financial

Crimes Enforcement Network (FinCEN) recorded the cost of

around $590 million related to ransomware activities during

that period [4]. Moreover, distinct ransomware variants are be-

ing detected regularly and more than 130 different ransomware

variants have been identiﬁed from 2020 till this year causing

an inevitable disturbance in day-to-day lives. [5].

Fig. 1. Number of worldwide ransomware attacks in different sectors in 2021

(*Till November 2021)

Due to the increasing number of ransomware variants and

ransomware attacks, researchers have been earnestly involving

themselves to look for efﬁcient ways to improve the scenarios.

While some researchers are analyzing the distinctive behaviors

of ransomware by executing it in a secure environment called

Dynamic Analysis [6]-[11], some researchers are analyzing

the ransomware without any execution, referred to as Static

Analysis [12], [13]. However, a good number of researchers

are combining these two approaches and adopting a Hybrid

arXiv:2210.11235v3 [cs.CR] 13 Nov 2022

Analysis Approach [14], [15]. In this research, we have opted

for the dynamic analysis approach for its ability to detect and

classify ransomware based on behavioral patterns regardless of

the code obfuscation techniques deployed by the ransomware

programmers [16], [17]. The main contributions of this paper

are:

•Develop a Web-Crawler, ‘GetRansomware’ to automate

collecting the Windows Portable Executable (PE) ﬁles of

15 different ransomware families from the ransomware

repository. The Web-Crawler is essential to automate

searching and downloading the samples and to cut down

the manual workload, but no prior works targeted this

scenario.

•Develop our dataset and conduct feature selection through

a two-phase feature engineering process that includes-

‘Feature Extraction’ from the sample binaries, and ‘Fea-

ture Selection’ to select the most important features for

each ML classiﬁer.

•Develop, evaluate and compare the performance of six

State-of-the-art Supervised Machine Learning models.

Our approach includes utilizing Recursive Feature Elim-

ination with Cross-Validation (RFECV) for selecting the

signiﬁcant features and RandomSearchCV for selecting

the optimum hyperparameter values for each ML clas-

siﬁer. Thereby we attempt to optimize each model’s

performance before the comparison is made.

•Present the post-hoc analysis of the best-performing

model using ‘SHapley Additive exPlanations’ or SHAP

values to ascertain the transparency and trustworthiness of

the model’s prediction. This insight presents a better idea

about which features are more dominant in detecting and

classifying the ransomware families. While explainability

has been widely presented in malware detection scenar-

ios, to the best of the authors’ knowledge, till today, no

prior works presented their models’ explainability that

considered only the ransomware families.

The rest of this paper is structured as follows: Section II

discusses the related works. Section III presents our proposed

method. The experimental results and discussion are made

in Section IV. Section V presents our model’s explainability.

Section VI concludes the paper with the direction for future

works.

II. RELATED WORKS

Most researchers prefer the dynamic analysis approach

because it can delineate the behaviors of the ransomware in

a more explicit manner. Maniath et al. [6] analyzed the API

call sequence of 157 ransomware and presented an LSTM-

based ransomware detection method. Despite securing 96.67%

accuracy, this work lacks complete information about the ran-

somware families/variants, and the number of benign software

used for the experiment. VinayaKumar et al. [7] proposed an

MLP-based ransomware detection method focusing on API

call frequency and secured 100%, and 98% accuracy for

binary and multi-class classiﬁcation respectively. However,

they deployed a simple MLP network that failed to distinguish

CryptoWall and Cryptolocker ransomware. Z. Chen et al. [8]

used the API Call Flow Graph (CFG) generated from the

extracted API sequence of 83 ransomware and 83 benign

software. Regardless of securing 98.2% exactness using the

Logistic Regression model, the work is based on a smaller

dataset that includes only four ransomware families. Also,

graph-similarity analysis requires higher computational power

that some systems may fail to provide. Takeuchi et al. [9] used

API call sequences extracted from 276 ransomware, and 312

benign ﬁles to identify zero-day ransomware attacks. Although

the work secured 97.48% accuracy by deploying the Support

Vector Machine, the accuracy of this work decreases while

using standardized vector representation because of the less

diverse dataset. Using the Intel Pin Tool, Bae et al. [10]

extracted the API call sequences from 1000 ransomware, 900

malware, and 300 benign ﬁles. Their sequential process in-

cludes generating an n-gram sequence, input vector, and Class

Frequency Non-Class Frequency (CF-NCF) for every sample

before ﬁtting their model. Regardless of obtaining 98.65%

accuracy using the Random Forest classiﬁer, the model’s

performance can be improved with the help of deception-based

techniques. Hwang et al. [11] analyzed the API call sequence

of 2507 ransomware and 3886 benign ﬁles. They used two

Markov chains, one for ransomware and another for benign

software to capture the API call sequence patterns. By using

Random Forest, they compensate Markov Chains and control

FPR and FNR to achieve better performance. Despite securing

97.3% accuracy, their model produces high FPR that can be

improved with the help of signature-based techniques.

A good number of researchers chose the static analysis ap-

proach to detect ransomware. Baldwin and Dehghantanha [12]

analyzed the opcode characteristics of 5 crypto-ransomware

families and 350 benign samples. Their experiment involved

the WEKA AI toolset, and the experimental results showed an

accuracy of 96.5% while recognizing ﬁve crypto-ransomware

families and benign software by using the Support Vector

Machine classiﬁer. However, their work could be improved by

extending the dataset and extracting those groups of opcodes

identiﬁed during the evaluation of attribute selection. Zhang et

al. [13] analyzed the opcode-based characteristics of 1787 ran-

somware of 8 different ransomware families and 100 benign

software. Their technique included moving opcode groupings

to the N-gram sequence and afterward Term Frequency Inverse

Document Frequency (TF-IDF). Five ML classiﬁers were used

with 10-fold cross-validation among which the Random Forest

classiﬁer achieved the highest 91.43% exactness. However,

their model could not distinguish Reveton, CryptoWall, and

Locky.

Some researchers adopted a hybrid analysis approach that

combines the features extracted from the dynamic and static

analyses. Subedi et al. [14] used both dynamic and static

analysis on the library, assembly, and function calls. Moreover,

they came up with a new analysis tool, namely, CRSTATIC

which was deployed to build signatures that could classify

ransomware families with the help of reverse engineering.

However, they analyzed only 450 samples of ransomware

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ApplicationofExplainableMachineLearninginDetectingandClassifyingRansomwareFamiliesBasedonAPICallAnalysisRawshanAraMowriDepartmentofComputerScienceNorthCarolinaA&TStateUniversityGreensboro,USArmowri@aggies.ncat.eduMadhuriSiddulaDepartmentofComputerScienceNorthCarolinaA&TStateUniversityGreensboro,USAm...

展开>> 收起<<

Application of Explainable Machine Learning in Detecting and Classifying Ransomware Families Based on API Call Analysis.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Application of Explainable Machine Learning in Detecting and Classifying Ransomware Families Based on API Call Analysis

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: