NeuDep Neural Binary Memory Dependence Analysis

2025-05-02 0 0 855.23KB 13 页 10玖币
侵权投诉
NEUDEP: Neural Binary Memory Dependence Analysis
Kexin Pei
kpei@cs.columbia.edu
Columbia University
New York, USA
Dongdong She*
dongdong@cs.columbia.edu
Columbia University
New York, USA
Michael Wang*
mi27950@mit.edu
MIT
Cambridge, USA
Scott Geng*
scott.geng@columbia.edu
Columbia University
New York, USA
Zhou Xuan
xuan1@purdue.edu
Purdue University
West Lafayette, USA
Yaniv David
yaniv.david@columbia.edu
Columbia University
New York, USA
Junfeng Yang
junfeng@cs.columbia.edu
Columbia University
New York, USA
Suman Jana
suman@cs.columbia.edu
Columbia University
New York, USA
Baishakhi Ray
rayb@cs.columbia.edu
Columbia University
New York, USA
ABSTRACT
Determining whether multiple instructions can access the same mem-
ory location is a critical task in binary analysis. It is challenging as
statically computing precise alias information is undecidable in the-
ory. The problem aggravates at the binary level due to the presence
of compiler optimizations and the absence of symbols and types. Ex-
isting approaches either produce significant spurious dependencies
due to conservative analysis or scale poorly to complex binaries.
We present a new machine-learning-based approach to predict
memory dependencies by exploiting the model’s learned knowledge
about how binary programs execute. Our approach features (i) a
self-supervised procedure that pretrains a neural net to reason over
binary code and its dynamic value flows through memory addresses,
followed by (ii) supervised finetuning to infer the memory dependen-
cies statically. To facilitate efficient learning, we develop dedicated
neural architectures to encode the heterogeneous inputs (i.e., code,
data values, and memory addresses from traces) with specific mod-
ules and fuse them with a composition learning strategy.
We implement our approach in NEUDEP and evaluate it on 41
popular software projects compiled by 2 compilers, 4 optimizations,
and 4 obfuscation passes. We demonstrate that NEUDEP is more
precise (1.5
×
) and faster (3.5
×
) than the current state-of-the-art. Ex-
tensive probing studies on security-critical reverse engineering tasks
suggest that NEUDEP understands memory access patterns, learns
function signatures, and is able to match indirect calls. All these
tasks either assist or benefit from inferring memory dependencies.
Notably, NEUDEP also outperforms the current state-of-the-art on
these tasks.
*These authors contributed equally to the paper
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore
© 2022 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-9413-0/22/11. . . $15.00
https://doi.org/10.1145/3540250.3549147
CCS CONCEPTS
Security and privacy Software reverse engineering
;
Com-
puting methodologies Machine learning.
KEYWORDS
Memory Dependence Analysis, Reverse Engineering, Large Lan-
guage Models, Machine Learning for Program Analysis
ACM Reference Format:
Kexin Pei, Dongdong She, Michael Wang, Scott Geng, Zhou Xuan, Yaniv
David, Junfeng Yang, Suman Jana, and Baishakhi Ray. 2022. NEUDEP:
Neural Binary Memory Dependence Analysis. In Proceedings of the 30th
ACM Joint European Software Engineering Conference and Symposium
on the Foundations of Software Engineering (ESEC/FSE ’22), November
14–18, 2022, Singapore, Singapore. ACM, New York, NY, USA, 13 pages.
https://doi.org/10.1145/3540250.3549147
1 INTRODUCTION
Binary memory dependence analysis, which determines whether
two machine instructions in an executable can access the same mem-
ory location, is critical for many security-sensitive tasks, including
detecting vulnerabilities [
18
,
37
,
86
], analyzing malware [
39
,
95
],
hardening binaries [
4
,
29
,
45
,
92
], and forensics [
19
,
35
,
58
,
93
]. The
key challenge behind memory dependence analysis is that machine
instructions often leverage indirect addressing or indirect control-
flow transfer (i.e., involving dynamically computed targets) to access
the memory. Furthermore, most commercial software is stripped of
source-level information such as variables, arguments, types, data
structures, etc. Without this information, the problem of memory
dependence analysis becomes even harder, forcing the analysis to
reason about values flowing through generic registers and memory
addresses. Consider the following code snippet where we show two
instructions (within the same function) at different program locations.
The function is executed twice, resulting in two different traces.
arXiv:2210.02853v1 [cs.CR] 4 Oct 2022
ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore Pei, She, Wang, Geng, Xuan, David, Yang, Jana, Ray
Address Instruction Trace 1 Trace 2
......
0x06: mov [rax],rbxrax=0x3;rbx=0x1 rax=0x5;rbx=0x1
......
0x1f: mov rdi,[0x3] rdi=0x1 rdi=0x0
......
In Intel x86 syntax [77], mov [rax],rbx means writing register rbx to the memory
pointed by register rax; [] means dereference a memory address.
The two instructions are memory-dependent (read-after-write) when
rax
=
0x3
(Trace 1). When analyzing the code statically, it requires
precise value flow analysis to determine what values can flow to
rax from different program contexts.
Over the last two decades, researchers have made numerous at-
tempts to improve the accuracy and performance of binary memory
dependence analysis [
5
,
6
,
11
,
16
,
22
,
34
,
71
]. The most common
approach often involves statically computing and propagating an
over-approximated set of values that each register and memory ad-
dress can contain at each program point using abstract interpretation.
For example, a seminal paper by Balakrishnan and Reps on value set
analysis (VSA) [
5
] adopts strided intervals as the abstract domain
and propagates the interval bounds for the operands (e.g., registers
and memory locations) along each instruction. VSA detects two in-
structions to be dependent if their intervals intersect. Unfortunately,
these static approaches have been shown to be highly imprecise
in practice [
98
]. Composing abstract domains along multiple in-
structions and merging them across a large number of paths quickly
accumulate prohibitive amounts of over-approximation error. As
a result, the computed set of accessed memory addresses by such
approaches often ends up covering almost the entire memory space,
leading to a large number of false positives (i.e., instructions with
no dependencies are incorrectly detected as dependent).
With the advent of data-driven approaches to program analy-
ses [
3
,
69
,
91
], state-of-the-art memory dependence analysis is in-
creasingly using statistical or machine learning (ML) based methods
to improve the analysis precision [
35
,
58
,
88
,
98
], but they still suffer
from serious limitations. For example, DeepVSA [
35
] trains a neural
network on static code to classify the memory locations accessed
by each instruction into a more coarse-grained abstract domain such
as stack, heap, and global, and use the predicted memory region
to instantiate the value set in VSA. However, such coarse-grained
prediction results in high false positives as any two instructions
accessing the same region (e.g., stack) will always be detected as
dependent even when the instructions access two completely differ-
ent addresses. To avoid the precision losses by the static approaches,
BDA [
98
] uses a dynamic approach that leverages probabilistic
analysis to sample program paths and performs per-path abstract
interpretation. However, as real-world programs often have many
paths, the cost of performing per-path abstract interpretation for even
a smaller subset of paths adds prohibitive runtime overhead, e.g.,
taking more than 12 hours to finish analyzing a single program. It
is perhaps not surprising—while a dynamic approach can be more
accurate than static approaches, it can incur extremely high runtime
overhead, especially while trying to achieve good code coverage.
To achieve higher accuracy in a reasonably faster time, we pro-
pose an ML-based hybrid approach. Our key strategy is to learn to
reason about approximate memory dependencies from the execution
behavior of generic binary code during training. We then apply the
Code
Value Flows:
Memory+Trace
ML Model
Execute
Code
Representation
Value Flow
Representation
Code
Code
Value Flows:
Memory+Trace
Code Representation
ML Model Memory Dependencies
Pretraining (Dynamic):
Finetuning (Static):
Synthesize
Interpret
Figure 1: The workflow of our approach. We first pretrain the model to
predict code based on its traces and predict traces based on its code. We
then finetune the model to statically infer memory dependencies.
learned knowledge to static code during inference without any extra
runtime overhead (see Figure 1). Such a hybrid approach, i.e., learn-
ing from both code and traces, has been shown promise in several
Software Engineering applications, including clone detection, type
inference, and program fixing and synthesis [
60
,
63
,
64
,
90
,
91
].
However, none of these works can reason fine-grained value flows
through different memory addresses as they do not explicitly model
memory. To bridge this gap, we aim to model the memory addresses
in the ML-based hybrid framework and try to make fine-grained
predictions differentiating the memory contents of different data
pointers. Modeling memory address is, however, challenging as it
requires the model to (i) distinguish between different memory ad-
dresses, (ii) learn to reason about indirect address references and
memory contents, and (iii) learn the compositional effects of multi-
ple instructions that involve memory operations.
To this end, we propose a new learning framework comprising
pretraining and finetuning steps inspired by masked language model
(MLM) [
25
], as shown in Figure 1. Unlike traditional MLM, where
the input is restricted to a single input modality (e.g., text), our
model learns from multi-modal information: instructions (static
code), traces (dynamic values), and memory addresses (code spatial
layout). We deploy a novel fusion module to simultaneously capture
the interactions of these modalities for predicting memory depen-
dencies. During pretraining, we mask random tokens from these
modalities. While predicting masked opcode teaches the model to
synthesize the instruction, predicting masked values in traces and
memory addresses forces the model to learn to interpret instructions
and their effect on registers and memory contents. For instance, if we
mask the value of
rax
in
mov [rax],rbx
in the above example
and train the model to predict it, the model is forced to interpret
the previous instructions in the context and reason about how they
compute their trace values that flow into
rax
. We hypothesize that
such pretraining helps the model gain a general understanding of the
value flow behavior involving memory operations.
After pretraining, the model is finetuned to statically (without
the trace values) reason about the value flows (based on its learned
NEUDEP ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore
knowledge from pretraining) across memory along multiple paths
and predict the memory-dependent instruction pairs. Both pretrain-
ing and finetuning steps are automated and data-driven without
manually defining any propagation rules for value flows. As a re-
sult, we show that our model is faster and more precise than the
state-of-the-art systems (§5).
We implement our approach in NEUDEP by carefully designing
a new neural architecture specialized for fine-grained modeling of
pointers to distinguish between unique memory addresses (Chal-
lenge i). We develop a novel fusion module to facilitate efficient
training on the multi-modal bi-directional masking task, which helps
the model to understand memory address content and thus, indirect
memory references (Challenge ii). Finally, to teach the composi-
tional effects of instructions on memory values (Challenge iii), we
leverage the principle of curriculum learning [
8
], i.e., expose short
training examples in the initial learning phase, and gradually increase
the sample difficulties as the training progresses.
We evaluate NEUDEP on a wide range of popular software
projects compiled with diverse optimizations and obfuscation passes.
We demonstrate that NEUDEP is significantly more precise than
state-of-the-art binary dependence analysis approaches, widely-used
reverse engineering frameworks, and even a source-level pointer
analysis tool that has access to much richer program properties. We
also show that NEUDEP generalizes to unseen binaries, optimiza-
tions, and obfuscations, and is drastically faster than existing ap-
proaches. We perform extensive ablation studies to justify our design
choices over other alternatives studied in previous works [
64
,
66
].
Moreover, NEUDEP is surprisingly accurate at many additional
security-critical reverse engineering tasks, which either support or
benefit from inferring memory dependencies, such as predicting
memory-access regions, function signatures, and indirect procedure
calls – NEUDEP also outperforms the state-of-the-arts on all these
tasks.
We make the following contributions:
(1)
We propose a new neural architecture that can jointly learn
memory value flows from code and the corresponding traces for
predicting binary memory dependencies.
(2)
We implement our approach in NEUDEP that contains a dedi-
cated fusion module for learning encodings of memory address-
es/trace values, and a composition learning strategy.
(3)
Our experimental results demonstrate that NEUDEP is (3.5
×
)
faster and more accurate (1.5×) than the state-of-the-art.
(4)
Our extensive ablation studies and analysis on downstream tasks
suggest that our pretraining substantially improves the prediction
performance and helps the model to learn value flow through
different instructions.
2 OVERVIEW
2.1 Motivating Example
Figure 2shows that two instructions
𝐼2
:
mov rdi,[rax+0x8]
and 𝐼7:mov [rbp+rbx],rdi access the same memory location
(and are thus memory-dependent) but via different addressing regis-
ters. To detect the dependency, the model needs to first understand
that the behavior of
mov
: both line 1 (
𝐼1
) and line 4 (
𝐼4
) set
rax
and
rbx
to the same value. It then needs to understand
xor
in line 5
sets
rbp
to 0, and
add
in line 6 performs addition and sets
rbp
to
I1 mov rax,0x1234
I2 mov rdi,[rax+0x8]
I3 mov rdx,[rip+0xbf]
I4 mov rbx,0x1234
I5 xor rbp,rbp
I6 add rbp,0x8
I7 mov [rbp+rbx],rdi
I8 mov [rip+0x81],rax
Instructions I
NeuDepExisting ML Models
Stack
Memory Layout
Heap
Global1
Global2
......
Execution Trace
I1 rax=0x1234
I2 rdi=[0x123c]
I3 rdx=[rip+0xbf]
I4 rbx=0x1234
I5 rbp=0x0
I6 rbp=0x8
I7 [0x123c]=rdi
I8 [rip+0x81]=rax
rip+0xbf
rip+0x81
0x123c
Ground Truth
......
Figure 2: Motivating example of predicting memory dependencies in
the function ngx_init_setproctitle from nginx-1.21.1.𝐼2
and 𝐼7access the same heap variable (in blue ); 𝐼3and 𝐼8access the
two different global variables (in red ). We find that existing ML-based
approaches (learned on static code only) fail to detect 𝐼2and 𝐼7are de-
pendent by predicting they access to different memory regions and can-
not distinguish the memory accessed by 𝐼3and 𝐼8due to the prediction
granularity.
0x8
. Finally, the model needs to compose these facts and concludes
that
rax+0x8
is semantically equivalent to
rbp+rbx
in such a
context, i.e., they both evaluate to 0x123c.
Gap in Existing Solutions.
We find that when running the ML model
trained only on static code for this task [
35
], it mispredicts that
𝐼2
and
𝐼7
are not dependent as their memory-access regions (
𝐼2
accesses
heap while
𝐼7
is mispredicted to access stack) do not intersect, pos-
sibly because its inference depends on the spurious pattern that the
stack base pointer
rbp
is used at line 7. Such mispredictions [
35
]
might lead to a false negative by flagging two instructions as access-
ing non-overlapping memory regions.
Proposed Solution.
The above observation underscores the impor-
tance of encoding the knowledge about each instruction’s contri-
bution to value flows through memory and their compositions as
part of the ML model. However, integrating a memory model as
part of the encoded knowledge is challenging due to the presence of
potentially complex flows involving indirect address references and
their compositions. We address these challenges by designing
(1)
A novel training objectives to distinguish between unique mem-
ory addresses (§2.2)
(2)
A dedicated fusion module sepcialized to capture the interaction
between instruction, trace, and memory addresses (§2.3.2). Our
new tracing and sampling strategies (§2.3.1) help the ML model
to learn value flows across memory addresses.
(3)
Curriculum learning [
8
] in the training process to incrementally
learn the compositional effects (§2.3.3).
Table 1shows some examples of how the pretraining task works
and how it teaches the model to reason about value flows.
2.2 Problem Formulation
Let
𝑓
denote an ML model parameterized by
𝜃
. Before directly
training
𝑓
towards predicting memory dependencies, we pretrain
𝑓
to reason about the value flows (see §3.4 for details). Now consider
摘要:

NEUDEP:NeuralBinaryMemoryDependenceAnalysisKexinPeikpei@cs.columbia.eduColumbiaUniversityNewYork,USADongdongShe*dongdong@cs.columbia.eduColumbiaUniversityNewYork,USAMichaelWang*mi27950@mit.eduMITCambridge,USAScottGeng*scott.geng@columbia.eduColumbiaUniversityNewYork,USAZhouXuanxuan1@purdue.eduPurdue...

展开>> 收起<<
NeuDep Neural Binary Memory Dependence Analysis.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:855.23KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注