Flexible Android Malware Detection Model based on Generative Adversarial Networks with Code Tensor Zhao Yangy

2025-05-06 0 0 1.52MB 10 页 10玖币

侵权投诉

Flexible Android Malware Detection Model based on

Generative Adversarial Networks with Code Tensor

Zhao Yang†∗

Alibaba Group

Shenzhen, China

Email: lingxi.yz@alibaba-inc.com

Fengyang Deng∗

Huazhong University of Science and Technology

Wuhan, China

Email: fengyang deng@hust.edu.cn

Linxi Han

Xi’an International Studies University

Xi’an, Shaanxi

Email: hanlinxi@sina.cn

Abstract—The behavior of malware threats is gradually increas-

ing, heightened the need for malware detection. However, existing

malware detection methods only target at the existing malicious

samples, the detection of fresh malicious code and variants of

malicious code is limited. In this paper, we propose a novel scheme

that detects malware and its variants efﬁciently. Based on the

idea of the generative adversarial networks (GANs), we obtain

the ‘true’ sample distribution that satisﬁes the characteristics of

the real malware, use them to deceive the discriminator, thus

achieve the defense against malicious code attacks and improve

malware detection. Firstly, a new Android malware APK to image

texture feature extraction segmentation method is proposed, which

is called segment self-growing texture segmentation algorithm.

Secondly, tensor singular value decomposition (tSVD) based on

the low-tubal rank transforms malicious features with different

sizes into a ﬁxed third-order tensor uniformly, which is entered

into the neural network for training and learning. Finally, a

ﬂexible Android malware detection model based on GANs with

code tensor (MTFD-GANs) is proposed. Experiments show that

the proposed model can generally surpass the traditional malware

detection model, with a maximum improvement efﬁciency of

41.6%. At the same time, the newly generated samples of the GANs

generator greatly enrich the sample diversity. And retraining

malware detector can effectively improve the detection efﬁciency

and robustness of traditional models.

Index Terms—component, formatting, style, styling, insert

I. INTRODUCTION

The mobile Internet and intelligent mobile devices have

undergone rapid development in the past decade, however

they bring us security risks because of malware at the same

time. About 80% of smartphone users are using the Android

operating system in the world. According to the report by G

Data [1], the total number of mobile malware increased by

40% in 2018, 3.2 million new Android malware samples were

detected by the end of the third quarter of 2018. The threat

from the Android system has reached a new level.

Traditionally, malicious code analysis methods include both

static analysis [2] and dynamic analysis [3]. Currently, re-

searchers extracted the features of malware, such as the call

sequence of API functions, permissions requested, etc. and

then analyzed them using machine learning methods. Ye et

al. [4] extracted the API call sequences of malware, control

ﬂow graph, and other features, used information chain, word

frequency statistics and other methods. Cui et al. [5] detected

malicious code based on CNNs and multi-objective algorithm,

they converted binary executable ﬁles of malware to a grayscale

image, then combined the CNNs to detect malware.

†is the corresponding author. Email: lingxi.yz@alibaba-inc.com

∗contribute equally to this paper

However, static analysis is susceptible to code obfuscation

techniques, and the dynamic analysis detection results may be

affected by omitting key executable paths. Ye and Cui et al.

analyzed the entire executable ﬁle of the malware, which tends

to weaken the malicious features of the code. Moreover, These

methods are based on existing samples of malware and results

in the detection of malware certain hysteresis.

Regarding the problems above, there are several challenges

we have to attach importance to:

Firstly, how to enhance the malicious features. Considering

a mix of irrelevant features can weaken themselves, separating

a complete executable ﬁlecode into feature fragments with

malicious behavior is necessary.

Secondly, how to unify the size of the feature after segmen-

tation and minimize the behavior pattern loss when adjusting

the feature size. According to the current feature extraction

methods, an effective code feature decomposition or mapping

need to be designed.

Thirdly, how to detect new malware and variations of

existing malware. If new malware features can be generated

according to the existing malware samples, it can not only

enrich the malicious samples, but also improve the efﬁciency

of active defense.

To address the above challenges, ﬁrstly, we map binary

malicious code to image, secondly segment the image using

the self-growing segmentation algorithm, thirdly use the tensor

singular value decomposition to transform the different size

segments into third-order tensors. Finally based on the GAN

idea, we generate new malicious code samples and improve the

detector performance. Based on the idea of active defense and

the purpose of upgrading the original malicious code detec-

tor model, Goodfellow et al. [6] proposed GAN (Generative

Adversarial Network), which adopts the idea of adversarial

generation and build a network composed of generator and

discriminator, and the model is trained by adversarial learning.

In the ﬁeld of information security, GAN’s development mainly

focuses on obtaining adversarial samples [7] and generating

adversarial virus samples [8]. Based on these theories, we

use the generative adversarial network to generate adversarial

samples in the setting environment of a semi-white-box attack

and black-box attack.

In this paper, we propose a novel malware detection method

that utilizes the idea of GANs to generate ‘true’ samples

satisfying the distribution characteristics of malware data and

repeatedly trains the malware detector model. It can effectively

arXiv:2210.14225v1 [cs.CR] 25 Oct 2022

enrich the dataset of unknow malware samples, resist the active

attack of malware, and improve the detection of efﬁciency of

malware.

The main contributions of this paper are as follows:

Firstly, a novel method of segmentation from self-growing

malware APK to texture image features is proposed. We map

the binary malicious code segments into images and analyze

the malicious code segments based on the image texture. We

design the texture cutting algorithm based on the Locality Sen-

sitive Hashing (LSH) algorithm to extract signiﬁcant feature

texture segments from malicious code texture segments and

enhance the texture features of malware.

Secondly, the Singular Value Decomposition (SVD) based

on Low-tubal rank is used to strengthen the characteristics of

malicious code. The images of different sizes are uniﬁed into a

ﬁxed-size third-order tensor as the input of the neural network

model.

Thirdly, a ﬂexible malware detection framework MTFD-

GANs based on the anti-generation network is proposed. New

malicious code features are generated in the training model,

they enrich the diversity of samples and enhance the robustness

of the model. We extracted 2000 data with obvious feature

types from the Drebin dataset for testing. The experimental re-

sults show that the proposed model outperforms the traditional

malware detection model, with the maximum improvement

efﬁciency 41.6%.

The main structure of this paper is as follows. Section 2

presents the background. Section 3 details the preprocessing

for binary code and the structure of MTFD-GANs. Section 4

introduces the training of MTFD-GANs. Section 5 veriﬁes the

validity of our proposed model through experiments. Finally,

Section 6 concludes this paper.

II. BACKGROUND

This section ﬁrst introduces the Locality Sensitive Hashing

algorithm used for signiﬁcant feature segment extraction. Then

detail the principle of tensor singular value decomposition.

Finally, the Black-Bone prediction model is described.

A. Locality Sensitive Hashing

Locality Sensitive Hashing (LSH) [9] is based on the idea

that, multiple hash functions are used to project large-scale

high-dimensional data points, so that the closer the points are,

the more likely they remain close together, and vice versa.

Let xand ybe two different high-dimensional feature vectors.

In LSH index algorithm, the probability of remaining close is

usually related to the similarity, that is:

P rhj∈H[hj(x) = hj(y)] = sim(x, y)(1)

Where His the hash function cluster, hjis the hash function

randomly selected in the hash function cluster, and sim() is

the similarity function.

Obviously, LSH algorithm depends on locally sensitive hash

function family. Let Hbe a hash function family mapped by

Rdto set U. For any two points pand q, a hash function

His randomly selected from the hash function family H. If

the following two properties are satisﬁed, the function family

H=h:Rd→Uis called (r1, r2, p1, p2)locally sensitive:

•if D(p, q)≤r1, then P rH[h(p) = h(q)] ≥p1,

•if D(p, q)≤r2, then P rH[h(p) = h(q)] ≥p2.

Where r1< r2and p1> p2. The function of LSH function

family is used for hashing, which can ensure that the collision

probability of the close points is greater than that of the far

points.

B. Low-tubal-rank Tensor

We use lowercase letters to represent scalar variables, e.g.,

x,y, and bold lowercase letters to indicate vectors, e.g., x,y.

The matrix is represented by bold uppercase letters, e.g., X,Y,

and higher-order tensor is represented by calligraphic letters,

e.g., X,Y. The transposition of high-order tensor is indicated

by the superscript †, e.g., X†,Y†, which ﬁrst transposes the

elements of all the previous slice matrices and then reverses

the order of the slices, from the 2-th slice to the I3-th slice. In

order to calculate the clarity of the description, we deﬁne the

tensor e

Tmapped by the frequency domain space to represent

the original tensor Tto perform Fourier transform along the

third dimension.

Tubes/ﬁbers and slices of a tensor: The higher-order

analogue of a matrix’s column is called tube, which is deﬁned

by a one-dimensional ﬁxed direction. T(:, j, k),T(i, :, k)and

T(x, j, :) are used to represent mode-1, mode-2, and mode-3

tubes, respectively, which are vectors. While a slice is deﬁned

by a two-dimensional matrix, T(:,:, k),T(:, j, :) and T(i, :,:)

represent the front, lateral, horizontal slices, respectively. In

addition, if all the front slice matrices of the tensor are

diagonal, then call it f-diagonal tensor.

t-product [10], [11]: Let Ais I1×I2×I

,Bis I2×I3×I

the t-product of Aand Bcan be expressed as

A ∗ B =foldcirc(A)·MatVec(B),(2)

where circ(A) is the circular matrix of tensor A, and

MatVec(B) is the block I2I

0×I3matrix that is obtained by

tensor B. In this paper, the product of two tensors, also called

the tensor circular convolution operation.

Third-order tensor block diagonal and circulant matrix

[10], [11] : For a third-order tensor A ∈ RI1×I2×I3, we denote

the block Ap∈RI1×I2as the matrix obtained by holding the

third index of Aﬁxed at p,p∈[I3]in the Fourier domain. The

block diagonal form of third-order tensor Acan be expressed

blkdiag(e

A),







...

AI3





∈CI1I3×I2I3

(3)

where Cdenotes the set of complex numbers. We use the

MatVec(·)function to expand the front slices of the tensor

MatVec(A) =







AI3





∈RI1I3×I2.(4)

The operation takes MatVec(A) back to the form of the original

tensor by

fold(MatVec(A)) = A,(5)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FlexibleAndroidMalwareDetectionModelbasedonGenerativeAdversarialNetworkswithCodeTensorZhaoYangyAlibabaGroupShenzhen,ChinaEmail:lingxi.yz@alibaba-inc.comFengyangDengHuazhongUniversityofScienceandTechnologyWuhan,ChinaEmail:fengyangdeng@hust.edu.cnLinxiHanXi'anInternationalStudiesUniversityXi'an,Shaa...

展开>> 收起<<

Flexible Android Malware Detection Model based on Generative Adversarial Networks with Code Tensor Zhao Yangy.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Flexible Android Malware Detection Model based on Generative Adversarial Networks with Code Tensor Zhao Yangy

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: