Comparing One with Many Solving Binary2source Function Matching Under Function Inlining

2025-04-29 0 0 1.19MB 22 页 10玖币

侵权投诉

Comparing One with Many — Solving Binary2source Function Matching Under

Function Inlining

ANG JIA, MING FAN, XI XU, WUXIA JIN, and HAIJUN WANG, Xi’an Jiaotong University, China

QIYI TANG, SEN NIE, and SHI WU, Tencent Security Keen Lab, China

TING LIU, Xi’an Jiaotong University, China

Binary2source function matching is a fundamental task for many security applications, including Software Component Analysis

(SCA).

“1-to-1”

mechanism has been applied in existing binary2source matching works, in which one binary function is matched

against one source function. However, we discovered that such mapping could be

“1-to-n”

(one query binary function maps multiple

source functions), due to the existence of function inlining.

To help conduct binary2source function matching under function inlining, we propose a method named

O2NMatcher

to generate

Source Function Sets (SFSs) as the matching target for binary functions with inlining. We rst propose a model named

ECOCCJ48

for

inlined call site prediction. To train this model, we leverage the compilable OSS to generate a dataset with labeled call sites (inlined

or not), extract several features from the call sites, and design a compiler-opt-based multi-label classier by inspecting the inlining

correlations between dierent compilations. Then, we use this model to predict the labels of call sites in the uncompilable OSS projects

without compilation and obtain the labeled function call graphs of these projects. Next, we regard the construction of SFSs as a sub-tree

generation problem and design root node selection and edge extension rules to construct SFSs automatically. Finally, these SFSs will

be added to the corpus of source functions and compared with binary functions with inlining. We conduct several experiments to

evaluate the eectiveness of O2NMatcher and results show our method increases the performance of existing works by 6% and exceeds

all the state-of-the-art works.1

CCS Concepts:

•Software and its engineering →Search-based software engineering

;

Maintaining software

;

•Security and

privacy →Software and application security.

Additional Key Words and Phrases: Binary2source Matching, Function Inlining, “1-to-n”, Source Function Sets

1 INTRODUCTION

Most software today is not developed entirely from scratch. Instead, developers rely on a range of open-source

components to create their applications[

]. According to a report published by Gartner [

], over 90% of the development

organizations stated that they rely on open-source components. Although using open-source components helps to

nish projects quicker and reduce costs, dependence on risky open-source components brings software supply chain

security risks[

]. For example, due to code reuse, a single vulnerability (e.g., the Heartbleed [

] in OpenSSL [

]) may

spread across thousands of software, causing 17% (around half a million) of the Internet’s secure web servers vulnerable.

To avoid vulnerable dependence, Software Component Analysis (SCA) [

] is proposed to discover software’s

dependence on Open Source Software (OSS) projects. Usually, the SCA service provider maintains a large OSS codebase.

When commercial software companies send their released binary executables, the SCA service provider compares these

binaries with the OSS projects and returns a report of the OSS components that the queried executables contain.

1New Paper

Authors’ addresses: Ang Jia, jiaang@stu.xjtu.edu.cn; Ming Fan, mingfan@mail.xjtu.edu.cn; Xi Xu, xx19960325@stu.xjtu.edu.cn; Wuxia Jin, jinwuxia@

mail.xjtu.edu.cn; Haijun Wang, hjwang.china@gmail.com, Xi’an Jiaotong University, Shaanxi, Xi’an, China, 710049; Qiyi Tang, dodgetang@tencent.com;

Sen Nie, snie@tencent.com; Shi Wu, shiwu@tencent.com, Tencent Security Keen Lab, Shanghai, China, 710049; Ting Liu, tingliu@mail.xjtu.edu.cn, Xi’an

Jiaotong University, Xi’an, China.

Manuscript submitted to ACM 1

arXiv:2210.15159v1 [cs.SE] 27 Oct 2022

2 Ang Jia et al.

The comparison procedure in the SCA is accomplished by binary2source function matching. When a query binary

is provided, this query binary will be rst dissembled into binary functions and then these binary functions will be

compared with the source functions of the OSS. Existing binary2source matching works, such as B2SMatcher [

] and

CodeCMR [

], conduct an

“1-to-1”

matching between the query binary function and the source function. However,

we recover that such mapping could be

“1-to-n”

(one query binary function maps multiple source functions), due to

the existence of function inlining.

Figure 1shows an “1-to-n” matching case when comparing the binary function dtls1_get_record with the source

function dtls1_get_record. Source function dtls1_get_record is from OpenSSL in version 1.0.1j which is associated with

the vulnerability CVE-2014-3571[

] and binary function dtls1_get_record is compiled from it using gcc-8.2.0 with O2.

But when we use CodeCMR to search the corresponding source function for binary function dtls1_get_record, we nd

that the similarity between the binary function dtls1_get_record and the source function dtls1_get_record is less than

60%, leading a failure in fetching the corresponding source function in the top 10 returns. This mismatch causes existing

works to fail to detect this vulnerability and leaves this binary function risky.

(a) Binary function dtls1_get_record

File: ssl/d1_pkt.c

Line 551: int dtls1_get_record(SSL *s)

{

…

Line 565: dtls1_process_buffered_records(s);

…

Line 660: bitmap = dtls1_get_bitmap(s, rr, &is_next_epoch);

…

Line 679: if (!(s->d1->listen && rr->type == SSL3_RT_HANDSHAKE &&

*p == SSL3_MT_CLIENT_HELLO) &&

!dtls1_record_replay_check(s, bitmap))

…}

File: ssl/d1_pkt.c

Line 299: static int

dtls1_process_buffered_records(SSL *s)

…

File: ssl/d1_pkt.c

Line 1792: static DTLS1_BITMAP *

dtls1_get_bitmap(SSL *s, SSL3_RECORD

*rr, unsigned int *is_next_epoch)

…

File: ssl/d1_pkt.c

Line 1680: static int

dtls1_record_replay_check(SSL *s,

DTLS1_BITMAP *bitmap)

…

(b) Source functions that binary function dtls1_get_record maps

Fig. 1. “1-to-n” matching cases brought by function inlining

This is the issue that

function inlining

has brought. When we look deeper into the compilation, we nd that source

function dtls1_get_record inlines the contents of several other source functions, including dtls1_process_buered_records,

dtls1_get_bitmap and dtls1_record_replay_check, into its body. Function inlining, which replaces a function call with the

embedded function body, greatly changes the content of the function, thus aecting the progress of function matching.

As a result, source function dtls1_get_record and binary function dtls1_get_record become less similar, and existing

“1-to-1” matching mechanism suers when conducting binary2source function matching under function inlining.

Manuscript submitted to ACM

Comparing One with Many — Solving Binary2source Function Matching Under Function Inlining 3

As function inlining widely exists in binaries [

], It is important to resolve the issues function inlining brings.

However, currently, there are still three challenges for conducting binary2source matching under function inlining.

C1: “Out-of-domain” (OOD) issue in OSS projects.

The binary functions with inlining are generated by more

than one source function. However, there does not exist a source function in the OSS projects that has the fully same

semantics as the queried binary function. As a result, the query of binary function with inlining is experiencing an

OOD issue, where an exact match of the queried binary function does not exist in the OSS corpus.

C2: Opaque inlining in stripped binaries.

The queried binaries are often stripped which makes it harder to infer

which binary functions are generated with inlining and which source functions are inlined into the binary function.

The opaque inlining makes it hard to decide when and how to apply a strategy to tackle function inlining.

C3: Various inlining decisions in dierent binaries.

The queried binaries are generated through enormous

compilation settings and thus result in various inlining decisions in the release binaries. It is dicult to cover all the

inlining cases without a comprehensive study of the correlation between inlining under various compilation settings.

To tackle the aforementioned three challenges, we propose a matching method named

O2NMatcher

that uses the

“1-to-n” matching

mechanism to match binary functions with source functions. Instead of developing an end-to-

end method for binary2source matching, our method works on the input source functions and aims to help existing

works to promote their eectiveness under function inlining. Figure 2shows the key idea of O2NMatcher. Generally,

existing works directly compare binary functions and source functions to obtain their similarities. However, this “1-to-1”

matching mechanism misses the semantics of the source functions that are inlined into the binary functions. Our method

aims to generate SFSs as a supplement to the source functions. Comparing source functions with binary functions

solves the “1-to-1” matching tasks, while comparing SFSs with binary functions solves the “1-to-n” matching tasks.

Binary2source

Function Matching

OSS

Binaries Binary

Functions

Source

Functions

Source

Functions Sets

O2NMatcher

Traditional 1-to-1 matching process

Fig. 2. The key idea of O2NMatcher

To tackle

, we focus on generating Source Function Sets (SFSs) to complement the corpus for queries of binary func-

tions with inlining. For example, the source functions dtls1_get_record,dtls1_process_buered_records,dtls1_get_bitmap

and dtls1_record_replay_check together compose the SFS for the queried binary function dtls1_get_record shown in

Figure 1. To tackle

, a classier for Inlined Call Site (ICS) prediction is trained on the compilable OSS projects and is

used to predict ICSs in the uncompilable OSS projects. To support the training process, we propose a dataset labeling

method to automatically identify the inlined call sites in the binaries. To tackle

, the classier is trained towards

enormous compilation setting combinations. We model it as a Multi-Label Classication (MLC) problem and propose a

compiler-opt-based classier based on the correlations of inlining decisions under dierent compilation settings.

Manuscript submitted to ACM

4 Ang Jia et al.

Through our method, SFSs are generated for the given OSS without compilation and then these SFSs are compared

with binary functions using existing “1-to-1” methods. We conduct several experiments to evaluate the eectiveness of

our method. Results show our method can achieve a 6% improvement in recall when detecting the unseen OSS and

exceeds all state-of-the-art methods.

The major contributions are as follows:

•

To the best of our knowledge, we are the rst to investigate the solution for binary2source matching under

function inlining.

•

We conduct several studies to analyze the inlining correlations between dierent compilation settings, which

help us model the prediction of inlined call sites as a multi-label classication problem and propose a compiler-

opt-based classier named ECOCCJ48.

•Based on ECOCCJ48, we propose a method named O2NMatcher that uses “1-to-n” matching to tackle function

inlining. Our generation method can automatically generate SFSs for binary functions with inlining.

•

We evaluate O2NMatcher through several experiments and results show our method can achieve a 6% improve-

ment in detecting inlined functions for existing binary2source matching works.

To facilitate further research, we have made the source code and dataset publicly available [15].

2 OVERVIEW

In this section, we will introduce the workow of O2NMatcher.

Source Project Source Function FeaturesNCS ICS ClassifierBinaries Function Call Binary Function

Compilable

OSS

Classifier Training

Call Site

Feature

Extraction

Model

Training Classifier

Dataset Labeling

Function

Mapping

Call Site

Mapping

Dataset Generation

Compiling in

Debug Mode

FCG

Construction

Model Training

OSS

ICS Prediction

Call Site

Feature

Extraction

Model

Prediction

FCG Construction

Call Site

Extraction

FCG

Construction

SFS Generation

Root Node

Selection

Edge

Extension SFS

SFS Generation

Fig. 3. Workflow of O2NMatcher

Figure 3shows the overall workow of O2NMatcher. The workow of O2NMatcher is divided into two parts: Model

Training and SFS Generation.

The target of model training is to generate a classier to predict ICSs in uncompilable OSS projects. The model is

trained on the compilable OSS on the source level and can be applied to uncompilable OSS projects without compilation.

The model training process is divided into three steps: Dataset Generation, Dataset Labeling, and Classier Training.

Firstly, a dataset is generated by compiling OSS projects to binaries enabled with function inlining. Then, the inlined

Manuscript submitted to ACM

Comparing One with Many — Solving Binary2source Function Matching Under Function Inlining 5

call sites in the binaries are labeled to serve as the training set for the classier. Finally, a classier is designed and

trained on the labeled dataset, which produces the classier used in the SFS generation process.

The SFS generation is based on the trained model and leverages it to generate SFSs for uncompilable OSS projects.

The SFS generation process is divided into three steps: FCG (Function Call Graph) Construction, ICS Prediction, and

SFS Generation. Firstly, the FCG Construction process identies call sites in the OSS and constructs FCGs for projects.

Then, the ICS Prediction process extracts features for all call sites and uses the trained model to predict which call sites

will be inlined during compilation. Finally, the SFS Generation process generates the SFSs based on the inlined call sites.

The following two sections will illustrate the full progress of our O2NMatcher. In detail, Section 3presents the

process of Model Training and Section 4presents the process of SFS Generation.

3 MODEL TRAINING PROCESS OF O2NMATCHER

3.1 Dataset Generation and Labeling

Dataset generation and labeling are the bases for training a model. Generally, we propose a method, which leverages the

debug information generated by Dwarf [

] during the compilation, to label the dataset. It will generate the binary2source

line-level and function-level mappings together with the labels (inlined or not) of the call sites.

In detail, when compiling the source projects with the “-g” option, a .debug_line section, which contains the mapping

from the binary address to the source le line (line-level mapping), will be produced. Similar to the idea in [

], we

extend the line-level mapping to the function-level mapping by using existing tools, such as Understand [

] and IDA

Pro [7] or tree-sitter [9] and Ghidra [5].

However, though function-level mapping directly identies inlined source functions (function mapping that a binary

function is mapped to multiple source functions will be identied as an inlining case), it is not suitable for training a

classier that directly predicts which functions will be inlined. That is because the compilers decide whether to inlining

on the features of each call site. Callees that are inlined into one of their callers may not be inlined into another. Thus,

for the accuracy of ICS prediction, our labeling process further identies which call site is inlined.

B C

(a) Easily inferred call sites

B C

(b) Confused call sites

Fig. 4. Inferring inlined call sites from function-level mapping

Labels of some call sites sometimes can be inferred directly by their function mapping. Figure 4shows two examples

shown in FCG. In the source and binary FCGs, the circle represents the source function while the square represents the

binary function. A red square and several red circles represent the binary function and its mapped source function. In

Figure 4(a), binary function A is mapped to source functions A and C. As A only has one call to C, thus the call site

from A to C is inlined and other call sites are not. However, Figure 4(b) presents an example that may have multiple

inlining decisions. For example, as A calls B twice, it is hard to infer which call site is inlined. Moreover, A calls C and B

calls C, it is also dicult to decide which call site inlines the function C.

Manuscript submitted to ACM

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ComparingOnewithMany—SolvingBinary2sourceFunctionMatchingUnderFunctionInliningANGJIA,MINGFAN,XIXU,WUXIAJIN,andHAIJUNWANG,Xi’anJiaotongUniversity,ChinaQIYITANG,SENNIE,andSHIWU,TencentSecurityKeenLab,ChinaTINGLIU,Xi’anJiaotongUniversity,ChinaBinary2sourcefunctionmatchingisafundamentaltaskformanysecuri...

展开>> 收起<<

Comparing One with Many Solving Binary2source Function Matching Under Function Inlining.pdf

共22页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Comparing One with Many Solving Binary2source Function Matching Under Function Inlining

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: