Tristan Benoit, Jean-Yves Marion, and Sébastien Bardin
reported to take 10 minutes to compute basic bloc matching on
small binaries from the Coreutils package.
Goal.From the program clone search point of view, there is a strong
need for a binary-level program-level similarity technique that is
precise, robust to slight variation, and fast enough to operate over
large code bases. This is exactly what we want to address in this paper.
Our proposal. We explore the application of spectral graph analy-
sis [
14
] to the problem of program clone search. It seems a very good
starting point as, on graphs, it is both aordable and competitive
against graph edit distances (GED) [
66
] in terms of precision, while
GED is arguably a very good (but expensive to compute) notion of
graph similarity. Yet, programs are not standard graphs: on the one
hand programs seen as graphs can be very large (especially at the
binary level), while on the other hand they are highly structured
due to their function hierarchy.
We take advantage of this specicity and propose Program
Spectral Similarity (PSS), the rst spectral analysis tailored to
program similarity. The techniques extract eigenvalues related fea-
tures from both function call graphs and control ow graphs, and
take advantage of a preprocessing step (done once for the whole
program repository) to achieve similarity checks in time linear in
the number of functions of the program (done for each program in
the repository), making it a perfect t for program clone search –
most prior works have at least a quadratic runtime.
We experimentally show that PSS outperforms state-of-the-art
approaches and is resilient to code variations as well as lightweight
obfuscations (e.g., instruction substitution, bogus control ow, con-
trol ow attening). Moreover, PSS does not rely on literal identi-
ers (e.g., function names, constant string values), making it robust
against a range of basic obfuscations. In our experiments, a program
clone search with PSS (optimized version) takes on average less
than 3s (0.3s and 0.4s for Linux and IoT benchmarks) where, as a
comparison, the function embedding Gemini [
74
] requires roughly
2 minutes per clone search.
We set up a strong comprehensive evaluation framework (14 com-
petitors and 3 baselines) to systematically compare PSS with state-of-
the-art methods, covering string based methods [
69
,
70
], graph edit
distance [
27
,
34
], N-grams [
33
], vector embedding [
19
,
52
,
55
,
74
],
standard spectral methods [
27
] and matching algorithms [
4
,
73
].
Our experiments cover our own dataset of diverse open-source
projects along with classical Coreutils, Diutils, Findutils, and Binu-
tils packages along two dimensions (optimization levels and code
versions) for a total of 950 programs. Moreover, we consider part of
the BinKit dataset [
43
] (98K samples), covering four optimization
levels, 9 compilers, 8 architectures and 4 obfuscations. Finally, we
gather 19,959 IoT malware and 84,992 Windows goodware.
Contribution. As a summary, we claim the following:
•
A novel technique named PSS (together with its optimiza-
tion
𝑃𝑆𝑆𝑂
) for code similarity (Section 4), tailored to pro-
gram clone search over large repositories. PSS is the rst
spectral technique tailored to program-level similarity. Es-
pecially, PSS takes advantage of a preprocessing step to
perform latter similarity checks in time linear w.r.t. the
number of functions in the program, making it a perfect t
for program clone search over large repositories;
•
A comprehensive evaluation framework for program clone
search (Section 5), encompassing (1) 97,760 programs from
BinKit [
43
], 19,959 IoT malware, 84,992 Windows programs
and a smaller Linux dataset of 950 programs, and (2) three
baselines and 14 state-of-the-art methods – 10 of them being
reimplemented. The complete framework is available online,
which is rare in this eld [54];
•
Experimental evidence (Sections 5) that PSS reaches a sweet
spot in terms of speed, precision and robustness, making it
a perfect t for program clone search, where prior works
in the eld are more specialized to function-level similarity
evaluation. Especially, PSS appears to scale well and to
retain good precision in demanding clone search scenarios
(cross-compilers, cross-architecture or obfuscation);
•
Finally, as another notable result, we show that prior work
targeting function clones cannot cope with program clones
due to scalability issues.
Besides providing a novel and ecient method for program
clone search, our results also shed new light on prior work on code
similarity. First, we make the case for the program clone search
application scenario and show that it behaves dierently enough
than the well-studied pairwise function similarity setting, requiring
dedicated methods. Second, we are the rst to pinpoint the sepa-
ration in prior work between techniques using literal identiers
and those that do not. As a side result, during our experiments,
we identify two simple methods based on literal identiers (string
values and external function names), which despite their simplic-
ity, appear to perform well when these identiers are available.
These methods came from the simplication of ideas coming from
the state-of-the-art in library identication by using literal iden-
tiers [
20
,
32
,
69
,
73
]. Third, we show the potential of dedicated
spectral methods for program clone search. Overall, we believe that
these results pave the way for novel research directions in the eld.
Research artifacts are available on Zenodo [9].
2 PROBLEM STATEMENT
2.1 Program Clone Search Procedure
Given an unknown target program
𝑃
and a program repository
𝑅
,
the goal is to identify a clone of 𝑃in 𝑅.
A clone of a program 𝑃is dened as follows:
•
A program
𝑄
compiled from the same source code
𝑆
as
𝑃
,
but with a dierent compiler toolchain is a clone of
𝑃
. For
example,
𝑃
has been compiled with GCC v9.1 using the
optimization level -O0 from the source code
𝑆
, and
𝑄
has
also been built from
𝑆
using the same compiler but another
optimization level, say -O3;
•
A program
𝑄
compiled from another version of
𝑃
source
code is a clone of
𝑃
. For example, both instances of the git
application compiled from two source code versions, say
v2.35.2 and v2.37.1, are clones.
2