Comparing One with Many Solving Binary2source Function Matching Under Function Inlining

2025-04-29 0 0 1.19MB 22 页 10玖币
侵权投诉
Comparing One with Many — Solving Binary2source Function Matching Under
Function Inlining
ANG JIA, MING FAN, XI XU, WUXIA JIN, and HAIJUN WANG, Xi’an Jiaotong University, China
QIYI TANG, SEN NIE, and SHI WU, Tencent Security Keen Lab, China
TING LIU, Xi’an Jiaotong University, China
Binary2source function matching is a fundamental task for many security applications, including Software Component Analysis
(SCA).
“1-to-1”
mechanism has been applied in existing binary2source matching works, in which one binary function is matched
against one source function. However, we discovered that such mapping could be
“1-to-n”
(one query binary function maps multiple
source functions), due to the existence of function inlining.
To help conduct binary2source function matching under function inlining, we propose a method named
O2NMatcher
to generate
Source Function Sets (SFSs) as the matching target for binary functions with inlining. We rst propose a model named
ECOCCJ48
for
inlined call site prediction. To train this model, we leverage the compilable OSS to generate a dataset with labeled call sites (inlined
or not), extract several features from the call sites, and design a compiler-opt-based multi-label classier by inspecting the inlining
correlations between dierent compilations. Then, we use this model to predict the labels of call sites in the uncompilable OSS projects
without compilation and obtain the labeled function call graphs of these projects. Next, we regard the construction of SFSs as a sub-tree
generation problem and design root node selection and edge extension rules to construct SFSs automatically. Finally, these SFSs will
be added to the corpus of source functions and compared with binary functions with inlining. We conduct several experiments to
evaluate the eectiveness of O2NMatcher and results show our method increases the performance of existing works by 6% and exceeds
all the state-of-the-art works.1
CCS Concepts:
Software and its engineering Search-based software engineering
;
Maintaining software
;
Security and
privacy Software and application security.
Additional Key Words and Phrases: Binary2source Matching, Function Inlining, “1-to-n”, Source Function Sets
1 INTRODUCTION
Most software today is not developed entirely from scratch. Instead, developers rely on a range of open-source
components to create their applications[
38
]. According to a report published by Gartner [
29
], over 90% of the development
organizations stated that they rely on open-source components. Although using open-source components helps to
nish projects quicker and reduce costs, dependence on risky open-source components brings software supply chain
security risks[
31
]. For example, due to code reuse, a single vulnerability (e.g., the Heartbleed [
1
] in OpenSSL [
13
]) may
spread across thousands of software, causing 17% (around half a million) of the Internet’s secure web servers vulnerable.
To avoid vulnerable dependence, Software Component Analysis (SCA) [
20
] is proposed to discover software’s
dependence on Open Source Software (OSS) projects. Usually, the SCA service provider maintains a large OSS codebase.
When commercial software companies send their released binary executables, the SCA service provider compares these
binaries with the OSS projects and returns a report of the OSS components that the queried executables contain.
1New Paper
Authors’ addresses: Ang Jia, jiaang@stu.xjtu.edu.cn; Ming Fan, mingfan@mail.xjtu.edu.cn; Xi Xu, xx19960325@stu.xjtu.edu.cn; Wuxia Jin, jinwuxia@
mail.xjtu.edu.cn; Haijun Wang, hjwang.china@gmail.com, Xi’an Jiaotong University, Shaanxi, Xi’an, China, 710049; Qiyi Tang, dodgetang@tencent.com;
Sen Nie, snie@tencent.com; Shi Wu, shiwu@tencent.com, Tencent Security Keen Lab, Shanghai, China, 710049; Ting Liu, tingliu@mail.xjtu.edu.cn, Xi’an
Jiaotong University, Xi’an, China.
Manuscript submitted to ACM 1
arXiv:2210.15159v1 [cs.SE] 27 Oct 2022
2 Ang Jia et al.
The comparison procedure in the SCA is accomplished by binary2source function matching. When a query binary
is provided, this query binary will be rst dissembled into binary functions and then these binary functions will be
compared with the source functions of the OSS. Existing binary2source matching works, such as B2SMatcher [
16
] and
CodeCMR [
51
], conduct an
“1-to-1”
matching between the query binary function and the source function. However,
we recover that such mapping could be
“1-to-n”
(one query binary function maps multiple source functions), due to
the existence of function inlining.
Figure 1shows an “1-to-n” matching case when comparing the binary function dtls1_get_record with the source
function dtls1_get_record. Source function dtls1_get_record is from OpenSSL in version 1.0.1j which is associated with
the vulnerability CVE-2014-3571[
10
] and binary function dtls1_get_record is compiled from it using gcc-8.2.0 with O2.
But when we use CodeCMR to search the corresponding source function for binary function dtls1_get_record, we nd
that the similarity between the binary function dtls1_get_record and the source function dtls1_get_record is less than
60%, leading a failure in fetching the corresponding source function in the top 10 returns. This mismatch causes existing
works to fail to detect this vulnerability and leaves this binary function risky.
(a) Binary function dtls1_get_record
File: ssl/d1_pkt.c
Line 551: int dtls1_get_record(SSL *s)
{
Line 565: dtls1_process_buffered_records(s);
Line 660: bitmap = dtls1_get_bitmap(s, rr, &is_next_epoch);
Line 679: if (!(s->d1->listen && rr->type == SSL3_RT_HANDSHAKE &&
*p == SSL3_MT_CLIENT_HELLO) &&
!dtls1_record_replay_check(s, bitmap))
}
File: ssl/d1_pkt.c
Line 299: static int
dtls1_process_buffered_records(SSL *s)
File: ssl/d1_pkt.c
Line 1792: static DTLS1_BITMAP *
dtls1_get_bitmap(SSL *s, SSL3_RECORD
*rr, unsigned int *is_next_epoch)
File: ssl/d1_pkt.c
Line 1680: static int
dtls1_record_replay_check(SSL *s,
DTLS1_BITMAP *bitmap)
(b) Source functions that binary function dtls1_get_record maps
Fig. 1. “1-to-n” matching cases brought by function inlining
This is the issue that
function inlining
has brought. When we look deeper into the compilation, we nd that source
function dtls1_get_record inlines the contents of several other source functions, including dtls1_process_buered_records,
dtls1_get_bitmap and dtls1_record_replay_check, into its body. Function inlining, which replaces a function call with the
embedded function body, greatly changes the content of the function, thus aecting the progress of function matching.
As a result, source function dtls1_get_record and binary function dtls1_get_record become less similar, and existing
“1-to-1” matching mechanism suers when conducting binary2source function matching under function inlining.
Manuscript submitted to ACM
Comparing One with Many — Solving Binary2source Function Matching Under Function Inlining 3
As function inlining widely exists in binaries [
34
], It is important to resolve the issues function inlining brings.
However, currently, there are still three challenges for conducting binary2source matching under function inlining.
C1: “Out-of-domain” (OOD) issue in OSS projects.
The binary functions with inlining are generated by more
than one source function. However, there does not exist a source function in the OSS projects that has the fully same
semantics as the queried binary function. As a result, the query of binary function with inlining is experiencing an
OOD issue, where an exact match of the queried binary function does not exist in the OSS corpus.
C2: Opaque inlining in stripped binaries.
The queried binaries are often stripped which makes it harder to infer
which binary functions are generated with inlining and which source functions are inlined into the binary function.
The opaque inlining makes it hard to decide when and how to apply a strategy to tackle function inlining.
C3: Various inlining decisions in dierent binaries.
The queried binaries are generated through enormous
compilation settings and thus result in various inlining decisions in the release binaries. It is dicult to cover all the
inlining cases without a comprehensive study of the correlation between inlining under various compilation settings.
To tackle the aforementioned three challenges, we propose a matching method named
O2NMatcher
that uses the
“1-to-n” matching
mechanism to match binary functions with source functions. Instead of developing an end-to-
end method for binary2source matching, our method works on the input source functions and aims to help existing
works to promote their eectiveness under function inlining. Figure 2shows the key idea of O2NMatcher. Generally,
existing works directly compare binary functions and source functions to obtain their similarities. However, this “1-to-1”
matching mechanism misses the semantics of the source functions that are inlined into the binary functions. Our method
aims to generate SFSs as a supplement to the source functions. Comparing source functions with binary functions
solves the “1-to-1” matching tasks, while comparing SFSs with binary functions solves the “1-to-n” matching tasks.
Binary2source
Function Matching
OSS
Binaries Binary
Functions
Source
Functions
Source
Functions Sets
O2NMatcher
Traditional 1-to-1 matching process
Fig. 2. The key idea of O2NMatcher
To tackle
C1
, we focus on generating Source Function Sets (SFSs) to complement the corpus for queries of binary func-
tions with inlining. For example, the source functions dtls1_get_record,dtls1_process_buered_records,dtls1_get_bitmap
and dtls1_record_replay_check together compose the SFS for the queried binary function dtls1_get_record shown in
Figure 1. To tackle
C2
, a classier for Inlined Call Site (ICS) prediction is trained on the compilable OSS projects and is
used to predict ICSs in the uncompilable OSS projects. To support the training process, we propose a dataset labeling
method to automatically identify the inlined call sites in the binaries. To tackle
C3
, the classier is trained towards
enormous compilation setting combinations. We model it as a Multi-Label Classication (MLC) problem and propose a
compiler-opt-based classier based on the correlations of inlining decisions under dierent compilation settings.
Manuscript submitted to ACM
4 Ang Jia et al.
Through our method, SFSs are generated for the given OSS without compilation and then these SFSs are compared
with binary functions using existing “1-to-1” methods. We conduct several experiments to evaluate the eectiveness of
our method. Results show our method can achieve a 6% improvement in recall when detecting the unseen OSS and
exceeds all state-of-the-art methods.
The major contributions are as follows:
To the best of our knowledge, we are the rst to investigate the solution for binary2source matching under
function inlining.
We conduct several studies to analyze the inlining correlations between dierent compilation settings, which
help us model the prediction of inlined call sites as a multi-label classication problem and propose a compiler-
opt-based classier named ECOCCJ48.
Based on ECOCCJ48, we propose a method named O2NMatcher that uses “1-to-n” matching to tackle function
inlining. Our generation method can automatically generate SFSs for binary functions with inlining.
We evaluate O2NMatcher through several experiments and results show our method can achieve a 6% improve-
ment in detecting inlined functions for existing binary2source matching works.
To facilitate further research, we have made the source code and dataset publicly available [15].
2 OVERVIEW
In this section, we will introduce the workow of O2NMatcher.
Source Project Source Function FeaturesNCS ICS ClassifierBinaries Function Call Binary Function
Compilable
OSS
Classifier Training
Call Site
Feature
Extraction
Model
Training Classifier
Dataset Labeling
Function
Mapping
Call Site
Mapping
Dataset Generation
Compiling in
Debug Mode
FCG
Construction
Model Training
OSS
ICS Prediction
Call Site
Feature
Extraction
Model
Prediction
FCG Construction
Call Site
Extraction
FCG
Construction
SFS Generation
Root Node
Selection
Edge
Extension SFS
SFS Generation
Fig. 3. Workflow of O2NMatcher
Figure 3shows the overall workow of O2NMatcher. The workow of O2NMatcher is divided into two parts: Model
Training and SFS Generation.
The target of model training is to generate a classier to predict ICSs in uncompilable OSS projects. The model is
trained on the compilable OSS on the source level and can be applied to uncompilable OSS projects without compilation.
The model training process is divided into three steps: Dataset Generation, Dataset Labeling, and Classier Training.
Firstly, a dataset is generated by compiling OSS projects to binaries enabled with function inlining. Then, the inlined
Manuscript submitted to ACM
Comparing One with Many — Solving Binary2source Function Matching Under Function Inlining 5
call sites in the binaries are labeled to serve as the training set for the classier. Finally, a classier is designed and
trained on the labeled dataset, which produces the classier used in the SFS generation process.
The SFS generation is based on the trained model and leverages it to generate SFSs for uncompilable OSS projects.
The SFS generation process is divided into three steps: FCG (Function Call Graph) Construction, ICS Prediction, and
SFS Generation. Firstly, the FCG Construction process identies call sites in the OSS and constructs FCGs for projects.
Then, the ICS Prediction process extracts features for all call sites and uses the trained model to predict which call sites
will be inlined during compilation. Finally, the SFS Generation process generates the SFSs based on the inlined call sites.
The following two sections will illustrate the full progress of our O2NMatcher. In detail, Section 3presents the
process of Model Training and Section 4presents the process of SFS Generation.
3 MODEL TRAINING PROCESS OF O2NMATCHER
3.1 Dataset Generation and Labeling
Dataset generation and labeling are the bases for training a model. Generally, we propose a method, which leverages the
debug information generated by Dwarf [
4
] during the compilation, to label the dataset. It will generate the binary2source
line-level and function-level mappings together with the labels (inlined or not) of the call sites.
In detail, when compiling the source projects with the “-g” option, a .debug_line section, which contains the mapping
from the binary address to the source le line (line-level mapping), will be produced. Similar to the idea in [
34
], we
extend the line-level mapping to the function-level mapping by using existing tools, such as Understand [
8
] and IDA
Pro [7] or tree-sitter [9] and Ghidra [5].
However, though function-level mapping directly identies inlined source functions (function mapping that a binary
function is mapped to multiple source functions will be identied as an inlining case), it is not suitable for training a
classier that directly predicts which functions will be inlined. That is because the compilers decide whether to inlining
on the features of each call site. Callees that are inlined into one of their callers may not be inlined into another. Thus,
for the accuracy of ICS prediction, our labeling process further identies which call site is inlined.
A
B
A
B C
(a) Easily inferred call sites
(b) Confused call sites
Fig. 4. Inferring inlined call sites from function-level mapping
Labels of some call sites sometimes can be inferred directly by their function mapping. Figure 4shows two examples
shown in FCG. In the source and binary FCGs, the circle represents the source function while the square represents the
binary function. A red square and several red circles represent the binary function and its mapped source function. In
Figure 4(a), binary function A is mapped to source functions A and C. As A only has one call to C, thus the call site
from A to C is inlined and other call sites are not. However, Figure 4(b) presents an example that may have multiple
inlining decisions. For example, as A calls B twice, it is hard to infer which call site is inlined. Moreover, A calls C and B
calls C, it is also dicult to decide which call site inlines the function C.
Manuscript submitted to ACM
摘要:

ComparingOnewithMany—SolvingBinary2sourceFunctionMatchingUnderFunctionInliningANGJIA,MINGFAN,XIXU,WUXIAJIN,andHAIJUNWANG,Xi’anJiaotongUniversity,ChinaQIYITANG,SENNIE,andSHIWU,TencentSecurityKeenLab,ChinaTINGLIU,Xi’anJiaotongUniversity,ChinaBinary2sourcefunctionmatchingisafundamentaltaskformanysecuri...

展开>> 收起<<
Comparing One with Many Solving Binary2source Function Matching Under Function Inlining.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:22 页 大小:1.19MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注