Flexible Android Malware Detection Model based on Generative Adversarial Networks with Code Tensor Zhao Yangy

2025-05-06 0 0 1.52MB 10 页 10玖币
侵权投诉
Flexible Android Malware Detection Model based on
Generative Adversarial Networks with Code Tensor
Zhao Yang
Alibaba Group
Shenzhen, China
Email: lingxi.yz@alibaba-inc.com
Fengyang Deng
Huazhong University of Science and Technology
Wuhan, China
Email: fengyang deng@hust.edu.cn
Linxi Han
Xi’an International Studies University
Xi’an, Shaanxi
Email: hanlinxi@sina.cn
Abstract—The behavior of malware threats is gradually increas-
ing, heightened the need for malware detection. However, existing
malware detection methods only target at the existing malicious
samples, the detection of fresh malicious code and variants of
malicious code is limited. In this paper, we propose a novel scheme
that detects malware and its variants efficiently. Based on the
idea of the generative adversarial networks (GANs), we obtain
the ‘true’ sample distribution that satisfies the characteristics of
the real malware, use them to deceive the discriminator, thus
achieve the defense against malicious code attacks and improve
malware detection. Firstly, a new Android malware APK to image
texture feature extraction segmentation method is proposed, which
is called segment self-growing texture segmentation algorithm.
Secondly, tensor singular value decomposition (tSVD) based on
the low-tubal rank transforms malicious features with different
sizes into a fixed third-order tensor uniformly, which is entered
into the neural network for training and learning. Finally, a
flexible Android malware detection model based on GANs with
code tensor (MTFD-GANs) is proposed. Experiments show that
the proposed model can generally surpass the traditional malware
detection model, with a maximum improvement efficiency of
41.6%. At the same time, the newly generated samples of the GANs
generator greatly enrich the sample diversity. And retraining
malware detector can effectively improve the detection efficiency
and robustness of traditional models.
Index Terms—component, formatting, style, styling, insert
I. INTRODUCTION
The mobile Internet and intelligent mobile devices have
undergone rapid development in the past decade, however
they bring us security risks because of malware at the same
time. About 80% of smartphone users are using the Android
operating system in the world. According to the report by G
Data [1], the total number of mobile malware increased by
40% in 2018, 3.2 million new Android malware samples were
detected by the end of the third quarter of 2018. The threat
from the Android system has reached a new level.
Traditionally, malicious code analysis methods include both
static analysis [2] and dynamic analysis [3]. Currently, re-
searchers extracted the features of malware, such as the call
sequence of API functions, permissions requested, etc. and
then analyzed them using machine learning methods. Ye et
al. [4] extracted the API call sequences of malware, control
flow graph, and other features, used information chain, word
frequency statistics and other methods. Cui et al. [5] detected
malicious code based on CNNs and multi-objective algorithm,
they converted binary executable files of malware to a grayscale
image, then combined the CNNs to detect malware.
is the corresponding author. Email: lingxi.yz@alibaba-inc.com
contribute equally to this paper
However, static analysis is susceptible to code obfuscation
techniques, and the dynamic analysis detection results may be
affected by omitting key executable paths. Ye and Cui et al.
analyzed the entire executable file of the malware, which tends
to weaken the malicious features of the code. Moreover, These
methods are based on existing samples of malware and results
in the detection of malware certain hysteresis.
Regarding the problems above, there are several challenges
we have to attach importance to:
Firstly, how to enhance the malicious features. Considering
a mix of irrelevant features can weaken themselves, separating
a complete executable filecode into feature fragments with
malicious behavior is necessary.
Secondly, how to unify the size of the feature after segmen-
tation and minimize the behavior pattern loss when adjusting
the feature size. According to the current feature extraction
methods, an effective code feature decomposition or mapping
need to be designed.
Thirdly, how to detect new malware and variations of
existing malware. If new malware features can be generated
according to the existing malware samples, it can not only
enrich the malicious samples, but also improve the efficiency
of active defense.
To address the above challenges, firstly, we map binary
malicious code to image, secondly segment the image using
the self-growing segmentation algorithm, thirdly use the tensor
singular value decomposition to transform the different size
segments into third-order tensors. Finally based on the GAN
idea, we generate new malicious code samples and improve the
detector performance. Based on the idea of active defense and
the purpose of upgrading the original malicious code detec-
tor model, Goodfellow et al. [6] proposed GAN (Generative
Adversarial Network), which adopts the idea of adversarial
generation and build a network composed of generator and
discriminator, and the model is trained by adversarial learning.
In the field of information security, GAN’s development mainly
focuses on obtaining adversarial samples [7] and generating
adversarial virus samples [8]. Based on these theories, we
use the generative adversarial network to generate adversarial
samples in the setting environment of a semi-white-box attack
and black-box attack.
In this paper, we propose a novel malware detection method
that utilizes the idea of GANs to generate ‘true’ samples
satisfying the distribution characteristics of malware data and
repeatedly trains the malware detector model. It can effectively
arXiv:2210.14225v1 [cs.CR] 25 Oct 2022
enrich the dataset of unknow malware samples, resist the active
attack of malware, and improve the detection of efficiency of
malware.
The main contributions of this paper are as follows:
Firstly, a novel method of segmentation from self-growing
malware APK to texture image features is proposed. We map
the binary malicious code segments into images and analyze
the malicious code segments based on the image texture. We
design the texture cutting algorithm based on the Locality Sen-
sitive Hashing (LSH) algorithm to extract significant feature
texture segments from malicious code texture segments and
enhance the texture features of malware.
Secondly, the Singular Value Decomposition (SVD) based
on Low-tubal rank is used to strengthen the characteristics of
malicious code. The images of different sizes are unified into a
fixed-size third-order tensor as the input of the neural network
model.
Thirdly, a flexible malware detection framework MTFD-
GANs based on the anti-generation network is proposed. New
malicious code features are generated in the training model,
they enrich the diversity of samples and enhance the robustness
of the model. We extracted 2000 data with obvious feature
types from the Drebin dataset for testing. The experimental re-
sults show that the proposed model outperforms the traditional
malware detection model, with the maximum improvement
efficiency 41.6%.
The main structure of this paper is as follows. Section 2
presents the background. Section 3 details the preprocessing
for binary code and the structure of MTFD-GANs. Section 4
introduces the training of MTFD-GANs. Section 5 verifies the
validity of our proposed model through experiments. Finally,
Section 6 concludes this paper.
II. BACKGROUND
This section first introduces the Locality Sensitive Hashing
algorithm used for significant feature segment extraction. Then
detail the principle of tensor singular value decomposition.
Finally, the Black-Bone prediction model is described.
A. Locality Sensitive Hashing
Locality Sensitive Hashing (LSH) [9] is based on the idea
that, multiple hash functions are used to project large-scale
high-dimensional data points, so that the closer the points are,
the more likely they remain close together, and vice versa.
Let xand ybe two different high-dimensional feature vectors.
In LSH index algorithm, the probability of remaining close is
usually related to the similarity, that is:
P rhjH[hj(x) = hj(y)] = sim(x, y)(1)
Where His the hash function cluster, hjis the hash function
randomly selected in the hash function cluster, and sim() is
the similarity function.
Obviously, LSH algorithm depends on locally sensitive hash
function family. Let Hbe a hash function family mapped by
Rdto set U. For any two points pand q, a hash function
His randomly selected from the hash function family H. If
the following two properties are satisfied, the function family
H=h:RdUis called (r1, r2, p1, p2)locally sensitive:
if D(p, q)r1, then P rH[h(p) = h(q)] p1,
if D(p, q)r2, then P rH[h(p) = h(q)] p2.
Where r1< r2and p1> p2. The function of LSH function
family is used for hashing, which can ensure that the collision
probability of the close points is greater than that of the far
points.
B. Low-tubal-rank Tensor
We use lowercase letters to represent scalar variables, e.g.,
x,y, and bold lowercase letters to indicate vectors, e.g., x,y.
The matrix is represented by bold uppercase letters, e.g., X,Y,
and higher-order tensor is represented by calligraphic letters,
e.g., X,Y. The transposition of high-order tensor is indicated
by the superscript , e.g., X,Y, which first transposes the
elements of all the previous slice matrices and then reverses
the order of the slices, from the 2-th slice to the I3-th slice. In
order to calculate the clarity of the description, we define the
tensor e
Tmapped by the frequency domain space to represent
the original tensor Tto perform Fourier transform along the
third dimension.
Tubes/fibers and slices of a tensor: The higher-order
analogue of a matrix’s column is called tube, which is defined
by a one-dimensional fixed direction. T(:, j, k),T(i, :, k)and
T(x, j, :) are used to represent mode-1, mode-2, and mode-3
tubes, respectively, which are vectors. While a slice is defined
by a two-dimensional matrix, T(:,:, k),T(:, j, :) and T(i, :,:)
represent the front, lateral, horizontal slices, respectively. In
addition, if all the front slice matrices of the tensor are
diagonal, then call it f-diagonal tensor.
t-product [10], [11]: Let Ais I1×I2×I
0
,Bis I2×I3×I
0
,
the t-product of Aand Bcan be expressed as
A ∗ B =foldcirc(A)·MatVec(B),(2)
where circ(A) is the circular matrix of tensor A, and
MatVec(B) is the block I2I
0×I3matrix that is obtained by
tensor B. In this paper, the product of two tensors, also called
the tensor circular convolution operation.
Third-order tensor block diagonal and circulant matrix
[10], [11] : For a third-order tensor A ∈ RI1×I2×I3, we denote
the block ApRI1×I2as the matrix obtained by holding the
third index of Afixed at p,p[I3]in the Fourier domain. The
block diagonal form of third-order tensor Acan be expressed
as
blkdiag(e
A),
e
A1
e
A2
...
e
AI3
CI1I3×I2I3
(3)
where Cdenotes the set of complex numbers. We use the
MatVec(·)function to expand the front slices of the tensor
MatVec(A) =
A1
A2
.
.
.
AI3
RI1I3×I2.(4)
The operation takes MatVec(A) back to the form of the original
tensor by
fold(MatVec(A)) = A,(5)
摘要:

FlexibleAndroidMalwareDetectionModelbasedonGenerativeAdversarialNetworkswithCodeTensorZhaoYangyAlibabaGroupShenzhen,ChinaEmail:lingxi.yz@alibaba-inc.comFengyangDengHuazhongUniversityofScienceandTechnologyWuhan,ChinaEmail:fengyangdeng@hust.edu.cnLinxiHanXi'anInternationalStudiesUniversityXi'an,Shaa...

展开>> 收起<<
Flexible Android Malware Detection Model based on Generative Adversarial Networks with Code Tensor Zhao Yangy.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:1.52MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注