Scalable Program Clone Search Through Spectral Analysis

2025-05-03 0 0 820.55KB 13 页 10玖币

侵权投诉

Tristan Benoit

benoit.tristan.info@gmail.com

Université de Lorraine, CNRS, LORIA

Nancy, France

Jean-Yves Marion

jean-yves.marion@loria.fr

Université de Lorraine, CNRS, LORIA

Nancy, France

Sébastien Bardin

sebastien.bardin@cea.fr

CEA LIST, Université Paris-Saclay

Saclay, France

ABSTRACT

We consider the problem of program clone search, i.e. given a target

program and a repository of known programs (all in executable

format), the goal is to nd the program in the repository most

similar to the target program – with potential applications in terms

of reverse engineering, program clustering, malware lineage and

software theft detection. Recent years have witnessed a blooming

in code similarity techniques, yet most of them focus on function-

level similarity and function clone search, while we are interested

in program-level similarity and program clone search. Actually, our

study shows that prior similarity approaches are either too slow

to handle large program repositories, or not precise enough, or yet

not robust against slight variations introduced by compilers, source

code versions or light obfuscations. We propose a novel spectral

analysis method for program-level similarity and program clone

search called Programs Spectral Similarity (PSS). In a nutshell, PSS

one-time spectral feature extraction is tailored for large repositories,

making it a perfect t for program clone search. We have compared

the dierent approaches with extensive benchmarks, showing that

PSS reaches a sweet spot in terms of precision, speed and robustness.

CCS CONCEPTS

•Security and privacy

→

Software reverse engineering;Mal-

ware and its mitigation.

KEYWORDS

binary code analysis, clone search, spectral analysis

1 INTRODUCTION

Binary code similarity approaches identify similarities or dier-

ences [

] between pieces of assembly code (e.g., basic blocks, bi-

nary functions or whole programs). We focus on program-level

similarities (coined program similarity in the following), that is,

computing a similarity index between whole programs which is

capable of telling at which degree two programs are similar – with

potential applications in terms of reverse engineering, program

clustering, malware lineage and software theft detection.

Program clone search. Given a query composed of a target pro-

gram and a repository, the program clone search ranks repository

programs by their program similarity to the target program. The

search is successful if the most similar program is a clone of the

target program. These clones may be (i) compiled with slightly

dierent compiler chains, or (ii) produced from a slightly dierent

version of the source code, or (iii) altered by slight obfuscations.

Applications. Searching program clones between x86 or ARM bi-

naries over a large program repository is necessary when the origi-

nal program written in source code is unavailable, which happens

with commercial o-the-shelf (COTS), legacy programs, rmware

or malware. For example, detecting malware clones is a major is-

sue [

], as most malware are actually variants of a few

major families active for more than ve years

. Another applica-

tion is the identication of libraries [

], which is

both a software engineering issue and a cybersecurity issue due to

vulnerabilities inside dynamically linked libraries. The problem of

library identication, while in between programs and functions in

terms of size, is much closer to the case of program clones by its

nature, as libraries are not arbitrary collections of functions and

require inter-procedural analysis. The situation is similar for patch

and rmware analysis [

], or software theft detection [

which also need to consider a global view of the code.

In all these cases, we see function clone search as only a proxy to a

problem that is by nature at the level of programs.

Prior work. Given its potential applications and challenges, the

eld of similarity detection has been extremely active over the

last two decades, starting from the pioneering work of Dullien

in 2004 [

] on call-graph isomorphisms and the popular Bin-

Di tool for recognizing similar binary functions among two re-

lated executables. Other approaches include for example symbolic

methods [

], graph edit distances [

] and matching tech-

niques [

]. Interestingly, the last ve years have seen a strong

trend toward machine learning based approaches to binary function

similarity [

]. Overall, most prior work focuses on

function clone search and function-level similarity.

The challenges. Program clone search presents specic challenges

compared to standard function similarity. (1) As already stated, it re-

quires comparing programs, i.e. much larger objects than functions,

hence similarity checks must be scalable in typical program sizes;

(2) We do not consider two programs taken in isolation, but a target

program and a (possibly large) program repository, hence the need

for very ecient similarity checks that will be iterated over all the

programs in the repository; (3) The repository could contain similar

but slightly dierent programs, due to variations in compilers or

code versions. Clone search must be robust to such variations; (4)

Finally, the technique must work equally well on stripped binary

codes (where symbols have been removed at compile time), han-

dle the case where external function names are unavailable (for

example IoT device rmware), and handle lightweight obfuscations

(such as adding deadcode, or hiding literal identiers).

All these constraints do not t well with prior work on similarity,

as state-of-the-art is increasingly focused on function-level simi-

larities

, with unclear scalability toward the program-level case.

For example, we found in our experiments that SMIT [

] takes

more than 43 hours to compute a similarity index between the main

library of Geany and the cp command, while DeepBinDi [

] is

1https://www.cisa.gov/uscert/ncas/alerts/aa22-216a

According to Haq and Caballero [

], since 2014, among 40 binary code similarity

approaches, only 7 approaches have taken programs as input.

arXiv:2210.13063v3 [cs.CR] 31 Aug 2023

Tristan Benoit, Jean-Yves Marion, and Sébastien Bardin

reported to take 10 minutes to compute basic bloc matching on

small binaries from the Coreutils package.

Goal.From the program clone search point of view, there is a strong

need for a binary-level program-level similarity technique that is

precise, robust to slight variation, and fast enough to operate over

large code bases. This is exactly what we want to address in this paper.

Our proposal. We explore the application of spectral graph analy-

sis [

] to the problem of program clone search. It seems a very good

starting point as, on graphs, it is both aordable and competitive

against graph edit distances (GED) [

] in terms of precision, while

GED is arguably a very good (but expensive to compute) notion of

graph similarity. Yet, programs are not standard graphs: on the one

hand programs seen as graphs can be very large (especially at the

binary level), while on the other hand they are highly structured

due to their function hierarchy.

We take advantage of this specicity and propose Program

Spectral Similarity (PSS), the rst spectral analysis tailored to

program similarity. The techniques extract eigenvalues related fea-

tures from both function call graphs and control ow graphs, and

take advantage of a preprocessing step (done once for the whole

program repository) to achieve similarity checks in time linear in

the number of functions of the program (done for each program in

the repository), making it a perfect t for program clone search –

most prior works have at least a quadratic runtime.

We experimentally show that PSS outperforms state-of-the-art

approaches and is resilient to code variations as well as lightweight

obfuscations (e.g., instruction substitution, bogus control ow, con-

trol ow attening). Moreover, PSS does not rely on literal identi-

ers (e.g., function names, constant string values), making it robust

against a range of basic obfuscations. In our experiments, a program

clone search with PSS (optimized version) takes on average less

than 3s (0.3s and 0.4s for Linux and IoT benchmarks) where, as a

comparison, the function embedding Gemini [

] requires roughly

2 minutes per clone search.

We set up a strong comprehensive evaluation framework (14 com-

petitors and 3 baselines) to systematically compare PSS with state-of-

the-art methods, covering string based methods [

], graph edit

distance [

], N-grams [

], vector embedding [

standard spectral methods [

] and matching algorithms [

Our experiments cover our own dataset of diverse open-source

projects along with classical Coreutils, Diutils, Findutils, and Binu-

tils packages along two dimensions (optimization levels and code

versions) for a total of 950 programs. Moreover, we consider part of

the BinKit dataset [

] (98K samples), covering four optimization

levels, 9 compilers, 8 architectures and 4 obfuscations. Finally, we

gather 19,959 IoT malware and 84,992 Windows goodware.

Contribution. As a summary, we claim the following:

•

A novel technique named PSS (together with its optimiza-

tion

𝑃𝑆𝑆𝑂

) for code similarity (Section 4), tailored to pro-

gram clone search over large repositories. PSS is the rst

spectral technique tailored to program-level similarity. Es-

pecially, PSS takes advantage of a preprocessing step to

perform latter similarity checks in time linear w.r.t. the

number of functions in the program, making it a perfect t

for program clone search over large repositories;

•

A comprehensive evaluation framework for program clone

search (Section 5), encompassing (1) 97,760 programs from

BinKit [

], 19,959 IoT malware, 84,992 Windows programs

and a smaller Linux dataset of 950 programs, and (2) three

baselines and 14 state-of-the-art methods – 10 of them being

reimplemented. The complete framework is available online,

which is rare in this eld [54];

•

Experimental evidence (Sections 5) that PSS reaches a sweet

spot in terms of speed, precision and robustness, making it

a perfect t for program clone search, where prior works

in the eld are more specialized to function-level similarity

evaluation. Especially, PSS appears to scale well and to

retain good precision in demanding clone search scenarios

(cross-compilers, cross-architecture or obfuscation);

•

Finally, as another notable result, we show that prior work

targeting function clones cannot cope with program clones

due to scalability issues.

Besides providing a novel and ecient method for program

clone search, our results also shed new light on prior work on code

similarity. First, we make the case for the program clone search

application scenario and show that it behaves dierently enough

than the well-studied pairwise function similarity setting, requiring

dedicated methods. Second, we are the rst to pinpoint the sepa-

ration in prior work between techniques using literal identiers

and those that do not. As a side result, during our experiments,

we identify two simple methods based on literal identiers (string

values and external function names), which despite their simplic-

ity, appear to perform well when these identiers are available.

These methods came from the simplication of ideas coming from

the state-of-the-art in library identication by using literal iden-

tiers [

]. Third, we show the potential of dedicated

spectral methods for program clone search. Overall, we believe that

these results pave the way for novel research directions in the eld.

Research artifacts are available on Zenodo [9].

2 PROBLEM STATEMENT

2.1 Program Clone Search Procedure

Given an unknown target program

𝑃

and a program repository

𝑅

the goal is to identify a clone of 𝑃in 𝑅.

A clone of a program 𝑃is dened as follows:

•

A program

𝑄

compiled from the same source code

𝑆

𝑃

but with a dierent compiler toolchain is a clone of

𝑃

. For

example,

𝑃

has been compiled with GCC v9.1 using the

optimization level -O0 from the source code

𝑆

, and

𝑄

has

also been built from

𝑆

using the same compiler but another

optimization level, say -O3;

•

A program

𝑄

compiled from another version of

𝑃

source

code is a clone of

𝑃

. For example, both instances of the git

application compiled from two source code versions, say

v2.35.2 and v2.37.1, are clones.

Scalable Program Clone Search Through Spectral Analysis

In the last case, we have to be a bit careful. Indeed, we can

only consider incremental versions of an application or library, not

major revisions that completely change the source code. In our

experiments, the newest and oldest versions of most packages are

usually separated by 4 years. However, it goes up to 15 years for

the most standard packages: Coreutils, Diutils, and Findutils.

Unknown

Program

Similarity

Checks

Query Preprocessing Features

Repository

1: svn

2: git

3: cmp

........

0.82

0.65

0.49

....

Similarity Metric

svn

Figure 1: Architecture of a program clone search procedure

Figure 1 illustrates a clone search procedure architecture. Note

that all along, we suppose that there is no exact copy of

𝑃

in the

repository

𝑅

. The repository is a database containing enough in-

formation for a clone search procedure. As a result, in practice, a

repository is quite an extensive program database w.r.t. the applica-

tion domain (rmware, plagiarism, malware, etc.).

An evaluation of clone search procedures should take into con-

sideration the three criteria below in order to be realistic:

•

The eciency w.r.t. both the size of the unknown target

program and the size of the repository,

•

The robustness not only to compiler toolchains but also

to slight program variations coming from dierent source

code versions,

•

The ability to deal with stripped programs. Moreover, exter-

nal symbols are not necessarily available when dealing with

rmware, lightweight obfuscations, or yet from payload

extracted from packers[13].

As we said previously, the main dierence between program

clone search and function clone search is the size of the binary

codes, which is much larger in the case of programs.

At a high level, all program clone search procedures work in a

similar way. The repository is already built, and the query process

is divided into three steps:

(1)

Query preprocessing. Upon query, we receive the target

program

𝑃

. We can perform some preprocessing at this step,

extracting relevant features for the rest of the procedure;

(2)

Similarity checks. For each program

𝑄∈𝑅

, we perform

a similarity check with a similarity metric

𝑀

(𝑃, 𝑄)

–

possibly taking advantage of the preprocessing – and record

the computed similarity index 𝑀(𝑃, 𝑄);

(3)

Decision. The program

𝑄𝑏𝑒𝑠𝑡

with the highest similarity

index is considered the most similar. The program clone

search succeeds if 𝑄𝑏𝑒𝑠𝑡 is a clone of 𝑃, otherwise it fails.

2.2 Motivating Example

Let us consider a repository containing 1420 libraries obtained from

the compilation of 20 libraries

with four optimization levels, ve

versions of GCC, four versions of clang, and to the 32 and 64 bits x86

From packages libiconv, coreutils, libtool, gss, gdbm, libtasn1, gsl, libmicrohttpd, osip,

readline, gsasl, lightning, recutils, gmp, libunistring, and glpk.

Table 1: Clone searches results

Framework Average Total runtime

precision@1 (preprocess. time included)

Asm2Vec [19] †0.7 35h

Gemini [74] †1 17h

SAFE [55] †0.95 160h

𝛼Di [52] †1 140h

LibDB [70] †1 2h

PSS 126s

(includ. 26s of preprocess)

†learning time not included

platforms. Next, let us imagine we have the 20 libraries as targets

(compiled for x86 32 bits with gcc 6.4 and the -O2 optimization

level).

Lifting function-level clone searches in order to detect program-

level clones is attractive. However, to obtain a similarity index

between two programs from function embedding methods, we

need to nd a distance between two sets of function embeddings.

Let

𝑒𝑚𝑏𝑒𝑑𝑠 (𝑃)

be the set of function embeddings of a program

𝑃

. A

rst solution is to perform a matching between the two sets. Such

matching could be an instance of the assignment problem where

assigning a function embedding

𝑥

𝑃

to a function embedding

𝑦

𝑃′

has a cost

∥𝑥−𝑦∥2

. However, this problem has complexity

𝑂(𝑛3)

where

𝑛

is the number of functions. We relax the matching

so that a function embedding of a program

𝑃

can be assigned to

multiple function embeddings of a program 𝑃′.

We dene 𝐹as the similarity metric for an embedding 𝑒𝑚𝑏𝑒𝑑𝑠:

𝐹(𝑃, 𝑃′):=−

𝑥∈𝑒𝑚𝑏𝑒𝑑𝑠 (𝑃)

min

𝑦∈𝑒𝑚𝑏𝑒𝑑𝑠 (𝑃′)∥𝑥−𝑦∥2(1)

We consider the following function-level methods and lift them

to programs as just explained: Asm2Vec [

], Gemini [

], SAFE [

𝛼

Di [

]. We also consider LibDB [

], which is directly designed

for libraries (i.e., large pieces of code).

Results. We report in Table 1 the average precision@1, equivalent

to the proportion of successful clone searches, as well as clone

searches total runtime. PSS is precise and successful in all clone

searches. Most function-level methods can also nd a clone in all

clone searches. However, PSS takes only 26s in total, while pure

function embedding methods take from 17h with Gemini to 160h

with SAFE. Even with pre-ltering, LibDB is close to 2h. Moreover,

PSS runtime is due to its preprocessing; the total similarity checks

runtime is negligible. As a result, PSS scales up to large repositories

with good precision.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ScalableProgramCloneSearchThroughSpectralAnalysisTristanBenoitbenoit.tristan.info@gmail.comUniversitédeLorraine,CNRS,LORIANancy,FranceJean-YvesMarionjean-yves.marion@loria.frUniversitédeLorraine,CNRS,LORIANancy,FranceSébastienBardinsebastien.bardin@cea.frCEALIST,UniversitéParis-SaclaySaclay,FranceAB...

展开>> 收起<<

Scalable Program Clone Search Through Spectral Analysis.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Scalable Program Clone Search Through Spectral Analysis

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: