enrich the dataset of unknow malware samples, resist the active
attack of malware, and improve the detection of efficiency of
malware.
The main contributions of this paper are as follows:
Firstly, a novel method of segmentation from self-growing
malware APK to texture image features is proposed. We map
the binary malicious code segments into images and analyze
the malicious code segments based on the image texture. We
design the texture cutting algorithm based on the Locality Sen-
sitive Hashing (LSH) algorithm to extract significant feature
texture segments from malicious code texture segments and
enhance the texture features of malware.
Secondly, the Singular Value Decomposition (SVD) based
on Low-tubal rank is used to strengthen the characteristics of
malicious code. The images of different sizes are unified into a
fixed-size third-order tensor as the input of the neural network
model.
Thirdly, a flexible malware detection framework MTFD-
GANs based on the anti-generation network is proposed. New
malicious code features are generated in the training model,
they enrich the diversity of samples and enhance the robustness
of the model. We extracted 2000 data with obvious feature
types from the Drebin dataset for testing. The experimental re-
sults show that the proposed model outperforms the traditional
malware detection model, with the maximum improvement
efficiency 41.6%.
The main structure of this paper is as follows. Section 2
presents the background. Section 3 details the preprocessing
for binary code and the structure of MTFD-GANs. Section 4
introduces the training of MTFD-GANs. Section 5 verifies the
validity of our proposed model through experiments. Finally,
Section 6 concludes this paper.
II. BACKGROUND
This section first introduces the Locality Sensitive Hashing
algorithm used for significant feature segment extraction. Then
detail the principle of tensor singular value decomposition.
Finally, the Black-Bone prediction model is described.
A. Locality Sensitive Hashing
Locality Sensitive Hashing (LSH) [9] is based on the idea
that, multiple hash functions are used to project large-scale
high-dimensional data points, so that the closer the points are,
the more likely they remain close together, and vice versa.
Let xand ybe two different high-dimensional feature vectors.
In LSH index algorithm, the probability of remaining close is
usually related to the similarity, that is:
P rhj∈H[hj(x) = hj(y)] = sim(x, y)(1)
Where His the hash function cluster, hjis the hash function
randomly selected in the hash function cluster, and sim() is
the similarity function.
Obviously, LSH algorithm depends on locally sensitive hash
function family. Let Hbe a hash function family mapped by
Rdto set U. For any two points pand q, a hash function
His randomly selected from the hash function family H. If
the following two properties are satisfied, the function family
H=h:Rd→Uis called (r1, r2, p1, p2)locally sensitive:
•if D(p, q)≤r1, then P rH[h(p) = h(q)] ≥p1,
•if D(p, q)≤r2, then P rH[h(p) = h(q)] ≥p2.
Where r1< r2and p1> p2. The function of LSH function
family is used for hashing, which can ensure that the collision
probability of the close points is greater than that of the far
points.
B. Low-tubal-rank Tensor
We use lowercase letters to represent scalar variables, e.g.,
x,y, and bold lowercase letters to indicate vectors, e.g., x,y.
The matrix is represented by bold uppercase letters, e.g., X,Y,
and higher-order tensor is represented by calligraphic letters,
e.g., X,Y. The transposition of high-order tensor is indicated
by the superscript †, e.g., X†,Y†, which first transposes the
elements of all the previous slice matrices and then reverses
the order of the slices, from the 2-th slice to the I3-th slice. In
order to calculate the clarity of the description, we define the
tensor e
Tmapped by the frequency domain space to represent
the original tensor Tto perform Fourier transform along the
third dimension.
Tubes/fibers and slices of a tensor: The higher-order
analogue of a matrix’s column is called tube, which is defined
by a one-dimensional fixed direction. T(:, j, k),T(i, :, k)and
T(x, j, :) are used to represent mode-1, mode-2, and mode-3
tubes, respectively, which are vectors. While a slice is defined
by a two-dimensional matrix, T(:,:, k),T(:, j, :) and T(i, :,:)
represent the front, lateral, horizontal slices, respectively. In
addition, if all the front slice matrices of the tensor are
diagonal, then call it f-diagonal tensor.
t-product [10], [11]: Let Ais I1×I2×I
0
,Bis I2×I3×I
0
,
the t-product of Aand Bcan be expressed as
A ∗ B =foldcirc(A)·MatVec(B),(2)
where circ(A) is the circular matrix of tensor A, and
MatVec(B) is the block I2I
0×I3matrix that is obtained by
tensor B. In this paper, the product of two tensors, also called
the tensor circular convolution operation.
Third-order tensor block diagonal and circulant matrix
[10], [11] : For a third-order tensor A ∈ RI1×I2×I3, we denote
the block Ap∈RI1×I2as the matrix obtained by holding the
third index of Afixed at p,p∈[I3]in the Fourier domain. The
block diagonal form of third-order tensor Acan be expressed
as
blkdiag(e
A),
e
A1
e
A2
...
e
AI3
∈CI1I3×I2I3
(3)
where Cdenotes the set of complex numbers. We use the
MatVec(·)function to expand the front slices of the tensor
MatVec(A) =
A1
A2
.
.
.
AI3
∈RI1I3×I2.(4)
The operation takes MatVec(A) back to the form of the original
tensor by
fold(MatVec(A)) = A,(5)