Scalable Program Clone Search Through Spectral Analysis

2025-05-03 0 0 820.55KB 13 页 10玖币
侵权投诉
Scalable Program Clone Search Through Spectral Analysis
Tristan Benoit
benoit.tristan.info@gmail.com
Université de Lorraine, CNRS, LORIA
Nancy, France
Jean-Yves Marion
jean-yves.marion@loria.fr
Université de Lorraine, CNRS, LORIA
Nancy, France
Sébastien Bardin
sebastien.bardin@cea.fr
CEA LIST, Université Paris-Saclay
Saclay, France
ABSTRACT
We consider the problem of program clone search, i.e. given a target
program and a repository of known programs (all in executable
format), the goal is to nd the program in the repository most
similar to the target program – with potential applications in terms
of reverse engineering, program clustering, malware lineage and
software theft detection. Recent years have witnessed a blooming
in code similarity techniques, yet most of them focus on function-
level similarity and function clone search, while we are interested
in program-level similarity and program clone search. Actually, our
study shows that prior similarity approaches are either too slow
to handle large program repositories, or not precise enough, or yet
not robust against slight variations introduced by compilers, source
code versions or light obfuscations. We propose a novel spectral
analysis method for program-level similarity and program clone
search called Programs Spectral Similarity (PSS). In a nutshell, PSS
one-time spectral feature extraction is tailored for large repositories,
making it a perfect t for program clone search. We have compared
the dierent approaches with extensive benchmarks, showing that
PSS reaches a sweet spot in terms of precision, speed and robustness.
CCS CONCEPTS
Security and privacy
Software reverse engineering;Mal-
ware and its mitigation.
KEYWORDS
binary code analysis, clone search, spectral analysis
1 INTRODUCTION
Binary code similarity approaches identify similarities or dier-
ences [
31
] between pieces of assembly code (e.g., basic blocks, bi-
nary functions or whole programs). We focus on program-level
similarities (coined program similarity in the following), that is,
computing a similarity index between whole programs which is
capable of telling at which degree two programs are similar – with
potential applications in terms of reverse engineering, program
clustering, malware lineage and software theft detection.
Program clone search. Given a query composed of a target pro-
gram and a repository, the program clone search ranks repository
programs by their program similarity to the target program. The
search is successful if the most similar program is a clone of the
target program. These clones may be (i) compiled with slightly
dierent compiler chains, or (ii) produced from a slightly dierent
version of the source code, or (iii) altered by slight obfuscations.
Applications. Searching program clones between x86 or ARM bi-
naries over a large program repository is necessary when the origi-
nal program written in source code is unavailable, which happens
with commercial o-the-shelf (COTS), legacy programs, rmware
or malware. For example, detecting malware clones is a major is-
sue [
4
,
18
,
57
,
73
], as most malware are actually variants of a few
major families active for more than ve years
1
. Another applica-
tion is the identication of libraries [
3
,
20
,
32
,
36
,
69
,
70
], which is
both a software engineering issue and a cybersecurity issue due to
vulnerabilities inside dynamically linked libraries. The problem of
library identication, while in between programs and functions in
terms of size, is much closer to the case of program clones by its
nature, as libraries are not arbitrary collections of functions and
require inter-procedural analysis. The situation is similar for patch
and rmware analysis [
75
], or software theft detection [
20
,
32
,
58
],
which also need to consider a global view of the code.
In all these cases, we see function clone search as only a proxy to a
problem that is by nature at the level of programs.
Prior work. Given its potential applications and challenges, the
eld of similarity detection has been extremely active over the
last two decades, starting from the pioneering work of Dullien
in 2004 [
22
,
23
] on call-graph isomorphisms and the popular Bin-
Di tool for recognizing similar binary functions among two re-
lated executables. Other approaches include for example symbolic
methods [
28
], graph edit distances [
34
,
44
] and matching tech-
niques [
4
,
73
]. Interestingly, the last ve years have seen a strong
trend toward machine learning based approaches to binary function
similarity [
19
,
52
,
55
,
74
,
77
]. Overall, most prior work focuses on
function clone search and function-level similarity.
The challenges. Program clone search presents specic challenges
compared to standard function similarity. (1) As already stated, it re-
quires comparing programs, i.e. much larger objects than functions,
hence similarity checks must be scalable in typical program sizes;
(2) We do not consider two programs taken in isolation, but a target
program and a (possibly large) program repository, hence the need
for very ecient similarity checks that will be iterated over all the
programs in the repository; (3) The repository could contain similar
but slightly dierent programs, due to variations in compilers or
code versions. Clone search must be robust to such variations; (4)
Finally, the technique must work equally well on stripped binary
codes (where symbols have been removed at compile time), han-
dle the case where external function names are unavailable (for
example IoT device rmware), and handle lightweight obfuscations
(such as adding deadcode, or hiding literal identiers).
All these constraints do not t well with prior work on similarity,
as state-of-the-art is increasingly focused on function-level simi-
larities
2
, with unclear scalability toward the program-level case.
For example, we found in our experiments that SMIT [
34
] takes
more than 43 hours to compute a similarity index between the main
library of Geany and the cp command, while DeepBinDi [
21
] is
1https://www.cisa.gov/uscert/ncas/alerts/aa22-216a
2
According to Haq and Caballero [
31
], since 2014, among 40 binary code similarity
approaches, only 7 approaches have taken programs as input.
1
arXiv:2210.13063v3 [cs.CR] 31 Aug 2023
Tristan Benoit, Jean-Yves Marion, and Sébastien Bardin
reported to take 10 minutes to compute basic bloc matching on
small binaries from the Coreutils package.
Goal.From the program clone search point of view, there is a strong
need for a binary-level program-level similarity technique that is
precise, robust to slight variation, and fast enough to operate over
large code bases. This is exactly what we want to address in this paper.
Our proposal. We explore the application of spectral graph analy-
sis [
14
] to the problem of program clone search. It seems a very good
starting point as, on graphs, it is both aordable and competitive
against graph edit distances (GED) [
66
] in terms of precision, while
GED is arguably a very good (but expensive to compute) notion of
graph similarity. Yet, programs are not standard graphs: on the one
hand programs seen as graphs can be very large (especially at the
binary level), while on the other hand they are highly structured
due to their function hierarchy.
We take advantage of this specicity and propose Program
Spectral Similarity (PSS), the rst spectral analysis tailored to
program similarity. The techniques extract eigenvalues related fea-
tures from both function call graphs and control ow graphs, and
take advantage of a preprocessing step (done once for the whole
program repository) to achieve similarity checks in time linear in
the number of functions of the program (done for each program in
the repository), making it a perfect t for program clone search –
most prior works have at least a quadratic runtime.
We experimentally show that PSS outperforms state-of-the-art
approaches and is resilient to code variations as well as lightweight
obfuscations (e.g., instruction substitution, bogus control ow, con-
trol ow attening). Moreover, PSS does not rely on literal identi-
ers (e.g., function names, constant string values), making it robust
against a range of basic obfuscations. In our experiments, a program
clone search with PSS (optimized version) takes on average less
than 3s (0.3s and 0.4s for Linux and IoT benchmarks) where, as a
comparison, the function embedding Gemini [
74
] requires roughly
2 minutes per clone search.
We set up a strong comprehensive evaluation framework (14 com-
petitors and 3 baselines) to systematically compare PSS with state-of-
the-art methods, covering string based methods [
69
,
70
], graph edit
distance [
27
,
34
], N-grams [
33
], vector embedding [
19
,
52
,
55
,
74
],
standard spectral methods [
27
] and matching algorithms [
4
,
73
].
Our experiments cover our own dataset of diverse open-source
projects along with classical Coreutils, Diutils, Findutils, and Binu-
tils packages along two dimensions (optimization levels and code
versions) for a total of 950 programs. Moreover, we consider part of
the BinKit dataset [
43
] (98K samples), covering four optimization
levels, 9 compilers, 8 architectures and 4 obfuscations. Finally, we
gather 19,959 IoT malware and 84,992 Windows goodware.
Contribution. As a summary, we claim the following:
A novel technique named PSS (together with its optimiza-
tion
𝑃𝑆𝑆𝑂
) for code similarity (Section 4), tailored to pro-
gram clone search over large repositories. PSS is the rst
spectral technique tailored to program-level similarity. Es-
pecially, PSS takes advantage of a preprocessing step to
perform latter similarity checks in time linear w.r.t. the
number of functions in the program, making it a perfect t
for program clone search over large repositories;
A comprehensive evaluation framework for program clone
search (Section 5), encompassing (1) 97,760 programs from
BinKit [
43
], 19,959 IoT malware, 84,992 Windows programs
and a smaller Linux dataset of 950 programs, and (2) three
baselines and 14 state-of-the-art methods – 10 of them being
reimplemented. The complete framework is available online,
which is rare in this eld [54];
Experimental evidence (Sections 5) that PSS reaches a sweet
spot in terms of speed, precision and robustness, making it
a perfect t for program clone search, where prior works
in the eld are more specialized to function-level similarity
evaluation. Especially, PSS appears to scale well and to
retain good precision in demanding clone search scenarios
(cross-compilers, cross-architecture or obfuscation);
Finally, as another notable result, we show that prior work
targeting function clones cannot cope with program clones
due to scalability issues.
Besides providing a novel and ecient method for program
clone search, our results also shed new light on prior work on code
similarity. First, we make the case for the program clone search
application scenario and show that it behaves dierently enough
than the well-studied pairwise function similarity setting, requiring
dedicated methods. Second, we are the rst to pinpoint the sepa-
ration in prior work between techniques using literal identiers
and those that do not. As a side result, during our experiments,
we identify two simple methods based on literal identiers (string
values and external function names), which despite their simplic-
ity, appear to perform well when these identiers are available.
These methods came from the simplication of ideas coming from
the state-of-the-art in library identication by using literal iden-
tiers [
20
,
32
,
69
,
73
]. Third, we show the potential of dedicated
spectral methods for program clone search. Overall, we believe that
these results pave the way for novel research directions in the eld.
Research artifacts are available on Zenodo [9].
2 PROBLEM STATEMENT
2.1 Program Clone Search Procedure
Given an unknown target program
𝑃
and a program repository
𝑅
,
the goal is to identify a clone of 𝑃in 𝑅.
A clone of a program 𝑃is dened as follows:
A program
𝑄
compiled from the same source code
𝑆
as
𝑃
,
but with a dierent compiler toolchain is a clone of
𝑃
. For
example,
𝑃
has been compiled with GCC v9.1 using the
optimization level -O0 from the source code
𝑆
, and
𝑄
has
also been built from
𝑆
using the same compiler but another
optimization level, say -O3;
A program
𝑄
compiled from another version of
𝑃
source
code is a clone of
𝑃
. For example, both instances of the git
application compiled from two source code versions, say
v2.35.2 and v2.37.1, are clones.
2
Scalable Program Clone Search Through Spectral Analysis
In the last case, we have to be a bit careful. Indeed, we can
only consider incremental versions of an application or library, not
major revisions that completely change the source code. In our
experiments, the newest and oldest versions of most packages are
usually separated by 4 years. However, it goes up to 15 years for
the most standard packages: Coreutils, Diutils, and Findutils.
Unknown
Program
Similarity
Checks
Query Preprocessing Features
Repository
1: svn
2: git
3: cmp
........
0.82
0.65
0.49
....
Similarity Metric
svn
Figure 1: Architecture of a program clone search procedure
Figure 1 illustrates a clone search procedure architecture. Note
that all along, we suppose that there is no exact copy of
𝑃
in the
repository
𝑅
. The repository is a database containing enough in-
formation for a clone search procedure. As a result, in practice, a
repository is quite an extensive program database w.r.t. the applica-
tion domain (rmware, plagiarism, malware, etc.).
An evaluation of clone search procedures should take into con-
sideration the three criteria below in order to be realistic:
The eciency w.r.t. both the size of the unknown target
program and the size of the repository,
The robustness not only to compiler toolchains but also
to slight program variations coming from dierent source
code versions,
The ability to deal with stripped programs. Moreover, exter-
nal symbols are not necessarily available when dealing with
rmware, lightweight obfuscations, or yet from payload
extracted from packers[13].
As we said previously, the main dierence between program
clone search and function clone search is the size of the binary
codes, which is much larger in the case of programs.
At a high level, all program clone search procedures work in a
similar way. The repository is already built, and the query process
is divided into three steps:
(1)
Query preprocessing. Upon query, we receive the target
program
𝑃
. We can perform some preprocessing at this step,
extracting relevant features for the rest of the procedure;
(2)
Similarity checks. For each program
𝑄𝑅
, we perform
a similarity check with a similarity metric
𝑀
on
(𝑃, 𝑄)
possibly taking advantage of the preprocessing – and record
the computed similarity index 𝑀(𝑃, 𝑄);
(3)
Decision. The program
𝑄𝑏𝑒𝑠𝑡
with the highest similarity
index is considered the most similar. The program clone
search succeeds if 𝑄𝑏𝑒𝑠𝑡 is a clone of 𝑃, otherwise it fails.
2.2 Motivating Example
Let us consider a repository containing 1420 libraries obtained from
the compilation of 20 libraries
3
with four optimization levels, ve
versions of GCC, four versions of clang, and to the 32 and 64 bits x86
3
From packages libiconv, coreutils, libtool, gss, gdbm, libtasn1, gsl, libmicrohttpd, osip,
readline, gsasl, lightning, recutils, gmp, libunistring, and glpk.
Table 1: Clone searches results
Framework Average Total runtime
precision@1 (preprocess. time included)
Asm2Vec [19] 0.7 35h
Gemini [74] 1 17h
SAFE [55] 0.95 160h
𝛼Di [52] 1 140h
LibDB [70] 1 2h
PSS 126s
(includ. 26s of preprocess)
learning time not included
platforms. Next, let us imagine we have the 20 libraries as targets
(compiled for x86 32 bits with gcc 6.4 and the -O2 optimization
level).
Lifting function-level clone searches in order to detect program-
level clones is attractive. However, to obtain a similarity index
between two programs from function embedding methods, we
need to nd a distance between two sets of function embeddings.
Let
𝑒𝑚𝑏𝑒𝑑𝑠 (𝑃)
be the set of function embeddings of a program
𝑃
. A
rst solution is to perform a matching between the two sets. Such
matching could be an instance of the assignment problem where
assigning a function embedding
𝑥
of
𝑃
to a function embedding
𝑦
of
𝑃
has a cost
𝑥𝑦2
. However, this problem has complexity
𝑂(𝑛3)
where
𝑛
is the number of functions. We relax the matching
so that a function embedding of a program
𝑃
can be assigned to
multiple function embeddings of a program 𝑃.
We dene 𝐹as the similarity metric for an embedding 𝑒𝑚𝑏𝑒𝑑𝑠:
𝐹(𝑃, 𝑃):=
𝑥𝑒𝑚𝑏𝑒𝑑𝑠 (𝑃)
min
𝑦𝑒𝑚𝑏𝑒𝑑𝑠 (𝑃)𝑥𝑦2(1)
We consider the following function-level methods and lift them
to programs as just explained: Asm2Vec [
19
], Gemini [
74
], SAFE [
55
],
𝛼
Di [
52
]. We also consider LibDB [
70
], which is directly designed
for libraries (i.e., large pieces of code).
Results. We report in Table 1 the average precision@1, equivalent
to the proportion of successful clone searches, as well as clone
searches total runtime. PSS is precise and successful in all clone
searches. Most function-level methods can also nd a clone in all
clone searches. However, PSS takes only 26s in total, while pure
function embedding methods take from 17h with Gemini to 160h
with SAFE. Even with pre-ltering, LibDB is close to 2h. Moreover,
PSS runtime is due to its preprocessing; the total similarity checks
runtime is negligible. As a result, PSS scales up to large repositories
with good precision.
3
摘要:

ScalableProgramCloneSearchThroughSpectralAnalysisTristanBenoitbenoit.tristan.info@gmail.comUniversitédeLorraine,CNRS,LORIANancy,FranceJean-YvesMarionjean-yves.marion@loria.frUniversitédeLorraine,CNRS,LORIANancy,FranceSébastienBardinsebastien.bardin@cea.frCEALIST,UniversitéParis-SaclaySaclay,FranceAB...

展开>> 收起<<
Scalable Program Clone Search Through Spectral Analysis.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:820.55KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注