NeuDep Neural Binary Memory Dependence Analysis

2025-05-02 0 0 855.23KB 13 页 10玖币

侵权投诉

NEUDEP: Neural Binary Memory Dependence Analysis

Kexin Pei

kpei@cs.columbia.edu

Columbia University

New York, USA

Dongdong She*

dongdong@cs.columbia.edu

Columbia University

New York, USA

Michael Wang*

mi27950@mit.edu

MIT

Cambridge, USA

Scott Geng*

scott.geng@columbia.edu

Columbia University

New York, USA

Zhou Xuan

xuan1@purdue.edu

Purdue University

West Lafayette, USA

Yaniv David

yaniv.david@columbia.edu

Columbia University

New York, USA

Junfeng Yang

junfeng@cs.columbia.edu

Columbia University

New York, USA

Suman Jana

suman@cs.columbia.edu

Columbia University

New York, USA

Baishakhi Ray

rayb@cs.columbia.edu

Columbia University

New York, USA

ABSTRACT

Determining whether multiple instructions can access the same mem-

ory location is a critical task in binary analysis. It is challenging as

statically computing precise alias information is undecidable in the-

ory. The problem aggravates at the binary level due to the presence

of compiler optimizations and the absence of symbols and types. Ex-

isting approaches either produce signiﬁcant spurious dependencies

due to conservative analysis or scale poorly to complex binaries.

We present a new machine-learning-based approach to predict

memory dependencies by exploiting the model’s learned knowledge

about how binary programs execute. Our approach features (i) a

self-supervised procedure that pretrains a neural net to reason over

binary code and its dynamic value ﬂows through memory addresses,

followed by (ii) supervised ﬁnetuning to infer the memory dependen-

cies statically. To facilitate efﬁcient learning, we develop dedicated

neural architectures to encode the heterogeneous inputs (i.e., code,

data values, and memory addresses from traces) with speciﬁc mod-

ules and fuse them with a composition learning strategy.

We implement our approach in NEUDEP and evaluate it on 41

popular software projects compiled by 2 compilers, 4 optimizations,

and 4 obfuscation passes. We demonstrate that NEUDEP is more

precise (1.5

) and faster (3.5

) than the current state-of-the-art. Ex-

tensive probing studies on security-critical reverse engineering tasks

suggest that NEUDEP understands memory access patterns, learns

function signatures, and is able to match indirect calls. All these

tasks either assist or beneﬁt from inferring memory dependencies.

Notably, NEUDEP also outperforms the current state-of-the-art on

these tasks.

*These authors contributed equally to the paper

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full citation

on the ﬁrst page. Copyrights for components of this work owned by others than the

author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or

republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission

and/or a fee. Request permissions from permissions@acm.org.

ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore

ACM ISBN 978-1-4503-9413-0/22/11. . . $15.00

https://doi.org/10.1145/3540250.3549147

CCS CONCEPTS

•Security and privacy →Software reverse engineering

;

•Com-

puting methodologies →Machine learning.

KEYWORDS

Memory Dependence Analysis, Reverse Engineering, Large Lan-

guage Models, Machine Learning for Program Analysis

ACM Reference Format:

Kexin Pei, Dongdong She, Michael Wang, Scott Geng, Zhou Xuan, Yaniv

David, Junfeng Yang, Suman Jana, and Baishakhi Ray. 2022. NEUDEP:

Neural Binary Memory Dependence Analysis. In Proceedings of the 30th

ACM Joint European Software Engineering Conference and Symposium

on the Foundations of Software Engineering (ESEC/FSE ’22), November

14–18, 2022, Singapore, Singapore. ACM, New York, NY, USA, 13 pages.

https://doi.org/10.1145/3540250.3549147

1 INTRODUCTION

Binary memory dependence analysis, which determines whether

two machine instructions in an executable can access the same mem-

ory location, is critical for many security-sensitive tasks, including

detecting vulnerabilities [

], analyzing malware [

hardening binaries [

], and forensics [

]. The

key challenge behind memory dependence analysis is that machine

instructions often leverage indirect addressing or indirect control-

ﬂow transfer (i.e., involving dynamically computed targets) to access

the memory. Furthermore, most commercial software is stripped of

source-level information such as variables, arguments, types, data

structures, etc. Without this information, the problem of memory

dependence analysis becomes even harder, forcing the analysis to

reason about values ﬂowing through generic registers and memory

addresses. Consider the following code snippet where we show two

instructions (within the same function) at different program locations.

The function is executed twice, resulting in two different traces.

arXiv:2210.02853v1 [cs.CR] 4 Oct 2022

ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore Pei, She, Wang, Geng, Xuan, David, Yang, Jana, Ray

Address Instruction Trace 1 Trace 2

......

0x06: mov [rax],rbx∗rax=0x3;rbx=0x1 rax=0x5;rbx=0x1

......

0x1f: mov rdi,[0x3] rdi=0x1 rdi=0x0

......

∗In Intel x86 syntax [77], mov [rax],rbx means writing register rbx to the memory

pointed by register rax; [] means dereference a memory address.

The two instructions are memory-dependent (read-after-write) when

rax

0x3

(Trace 1). When analyzing the code statically, it requires

precise value ﬂow analysis to determine what values can ﬂow to

rax from different program contexts.

Over the last two decades, researchers have made numerous at-

tempts to improve the accuracy and performance of binary memory

dependence analysis [

]. The most common

approach often involves statically computing and propagating an

over-approximated set of values that each register and memory ad-

dress can contain at each program point using abstract interpretation.

For example, a seminal paper by Balakrishnan and Reps on value set

analysis (VSA) [

] adopts strided intervals as the abstract domain

and propagates the interval bounds for the operands (e.g., registers

and memory locations) along each instruction. VSA detects two in-

structions to be dependent if their intervals intersect. Unfortunately,

these static approaches have been shown to be highly imprecise

in practice [

]. Composing abstract domains along multiple in-

structions and merging them across a large number of paths quickly

accumulate prohibitive amounts of over-approximation error. As

a result, the computed set of accessed memory addresses by such

approaches often ends up covering almost the entire memory space,

leading to a large number of false positives (i.e., instructions with

no dependencies are incorrectly detected as dependent).

With the advent of data-driven approaches to program analy-

ses [

], state-of-the-art memory dependence analysis is in-

creasingly using statistical or machine learning (ML) based methods

to improve the analysis precision [

], but they still suffer

from serious limitations. For example, DeepVSA [

] trains a neural

network on static code to classify the memory locations accessed

by each instruction into a more coarse-grained abstract domain such

as stack, heap, and global, and use the predicted memory region

to instantiate the value set in VSA. However, such coarse-grained

prediction results in high false positives as any two instructions

accessing the same region (e.g., stack) will always be detected as

dependent even when the instructions access two completely differ-

ent addresses. To avoid the precision losses by the static approaches,

BDA [

] uses a dynamic approach that leverages probabilistic

analysis to sample program paths and performs per-path abstract

interpretation. However, as real-world programs often have many

paths, the cost of performing per-path abstract interpretation for even

a smaller subset of paths adds prohibitive runtime overhead, e.g.,

taking more than 12 hours to ﬁnish analyzing a single program. It

is perhaps not surprising—while a dynamic approach can be more

accurate than static approaches, it can incur extremely high runtime

overhead, especially while trying to achieve good code coverage.

To achieve higher accuracy in a reasonably faster time, we pro-

pose an ML-based hybrid approach. Our key strategy is to learn to

reason about approximate memory dependencies from the execution

behavior of generic binary code during training. We then apply the

Code

Value Flows:

Memory+Trace

ML Model

Execute

Code

Representation

Value Flow

Representation

Code

Value Flows:

Memory+Trace

Code Representation

ML Model Memory Dependencies

Pretraining (Dynamic):

Finetuning (Static):

Synthesize

Interpret

Figure 1: The workﬂow of our approach. We ﬁrst pretrain the model to

predict code based on its traces and predict traces based on its code. We

then ﬁnetune the model to statically infer memory dependencies.

learned knowledge to static code during inference without any extra

runtime overhead (see Figure 1). Such a hybrid approach, i.e., learn-

ing from both code and traces, has been shown promise in several

Software Engineering applications, including clone detection, type

inference, and program ﬁxing and synthesis [

However, none of these works can reason ﬁne-grained value ﬂows

through different memory addresses as they do not explicitly model

memory. To bridge this gap, we aim to model the memory addresses

in the ML-based hybrid framework and try to make ﬁne-grained

predictions differentiating the memory contents of different data

pointers. Modeling memory address is, however, challenging as it

requires the model to (i) distinguish between different memory ad-

dresses, (ii) learn to reason about indirect address references and

memory contents, and (iii) learn the compositional effects of multi-

ple instructions that involve memory operations.

To this end, we propose a new learning framework comprising

pretraining and ﬁnetuning steps inspired by masked language model

(MLM) [

], as shown in Figure 1. Unlike traditional MLM, where

the input is restricted to a single input modality (e.g., text), our

model learns from multi-modal information: instructions (static

code), traces (dynamic values), and memory addresses (code spatial

layout). We deploy a novel fusion module to simultaneously capture

the interactions of these modalities for predicting memory depen-

dencies. During pretraining, we mask random tokens from these

modalities. While predicting masked opcode teaches the model to

synthesize the instruction, predicting masked values in traces and

memory addresses forces the model to learn to interpret instructions

and their effect on registers and memory contents. For instance, if we

mask the value of

rax

mov [rax],rbx

in the above example

and train the model to predict it, the model is forced to interpret

the previous instructions in the context and reason about how they

compute their trace values that ﬂow into

rax

. We hypothesize that

such pretraining helps the model gain a general understanding of the

value ﬂow behavior involving memory operations.

After pretraining, the model is ﬁnetuned to statically (without

the trace values) reason about the value ﬂows (based on its learned

NEUDEP ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore

knowledge from pretraining) across memory along multiple paths

and predict the memory-dependent instruction pairs. Both pretrain-

ing and ﬁnetuning steps are automated and data-driven without

manually deﬁning any propagation rules for value ﬂows. As a re-

sult, we show that our model is faster and more precise than the

state-of-the-art systems (§5).

We implement our approach in NEUDEP by carefully designing

a new neural architecture specialized for ﬁne-grained modeling of

pointers to distinguish between unique memory addresses (Chal-

lenge i). We develop a novel fusion module to facilitate efﬁcient

training on the multi-modal bi-directional masking task, which helps

the model to understand memory address content and thus, indirect

memory references (Challenge ii). Finally, to teach the composi-

tional effects of instructions on memory values (Challenge iii), we

leverage the principle of curriculum learning [

], i.e., expose short

training examples in the initial learning phase, and gradually increase

the sample difﬁculties as the training progresses.

We evaluate NEUDEP on a wide range of popular software

projects compiled with diverse optimizations and obfuscation passes.

We demonstrate that NEUDEP is signiﬁcantly more precise than

state-of-the-art binary dependence analysis approaches, widely-used

reverse engineering frameworks, and even a source-level pointer

analysis tool that has access to much richer program properties. We

also show that NEUDEP generalizes to unseen binaries, optimiza-

tions, and obfuscations, and is drastically faster than existing ap-

proaches. We perform extensive ablation studies to justify our design

choices over other alternatives studied in previous works [

Moreover, NEUDEP is surprisingly accurate at many additional

security-critical reverse engineering tasks, which either support or

beneﬁt from inferring memory dependencies, such as predicting

memory-access regions, function signatures, and indirect procedure

calls – NEUDEP also outperforms the state-of-the-arts on all these

tasks.

We make the following contributions:

(1)

We propose a new neural architecture that can jointly learn

memory value ﬂows from code and the corresponding traces for

predicting binary memory dependencies.

(2)

We implement our approach in NEUDEP that contains a dedi-

cated fusion module for learning encodings of memory address-

es/trace values, and a composition learning strategy.

(3)

Our experimental results demonstrate that NEUDEP is (3.5

)

faster and more accurate (1.5×) than the state-of-the-art.

(4)

Our extensive ablation studies and analysis on downstream tasks

suggest that our pretraining substantially improves the prediction

performance and helps the model to learn value ﬂow through

different instructions.

2 OVERVIEW

2.1 Motivating Example

Figure 2shows that two instructions

𝐼2

mov rdi,[rax+0x8]

and 𝐼7:mov [rbp+rbx],rdi access the same memory location

(and are thus memory-dependent) but via different addressing regis-

ters. To detect the dependency, the model needs to ﬁrst understand

that the behavior of

mov

: both line 1 (

𝐼1

) and line 4 (

𝐼4

) set

rax

and

rbx

to the same value. It then needs to understand

xor

in line 5

sets

rbp

to 0, and

add

in line 6 performs addition and sets

rbp

I1 mov rax,0x1234

I2 mov rdi,[rax+0x8]

I3 mov rdx,[rip+0xbf]

I4 mov rbx,0x1234

I5 xor rbp,rbp

I6 add rbp,0x8

I7 mov [rbp+rbx],rdi

I8 mov [rip+0x81],rax

Instructions I

NeuDepExisting ML Models

Stack

Memory Layout

Heap

Global1

Global2

......

Execution Trace

I1 rax=0x1234

I2 rdi=[0x123c]

I3 rdx=[rip+0xbf]

I4 rbx=0x1234

I5 rbp=0x0

I6 rbp=0x8

I7 [0x123c]=rdi

I8 [rip+0x81]=rax

rip+0xbf

rip+0x81

0x123c

Ground Truth

......

Figure 2: Motivating example of predicting memory dependencies in

the function ngx_init_setproctitle from nginx-1.21.1.𝐼2

and 𝐼7access the same heap variable (in blue ); 𝐼3and 𝐼8access the

two different global variables (in red ). We ﬁnd that existing ML-based

approaches (learned on static code only) fail to detect 𝐼2and 𝐼7are de-

pendent by predicting they access to different memory regions and can-

not distinguish the memory accessed by 𝐼3and 𝐼8due to the prediction

granularity.

0x8

. Finally, the model needs to compose these facts and concludes

that

rax+0x8

is semantically equivalent to

rbp+rbx

in such a

context, i.e., they both evaluate to 0x123c.

Gap in Existing Solutions.

We ﬁnd that when running the ML model

trained only on static code for this task [

], it mispredicts that

𝐼2

and

𝐼7

are not dependent as their memory-access regions (

𝐼2

accesses

heap while

𝐼7

is mispredicted to access stack) do not intersect, pos-

sibly because its inference depends on the spurious pattern that the

stack base pointer

rbp

is used at line 7. Such mispredictions [

]

might lead to a false negative by ﬂagging two instructions as access-

ing non-overlapping memory regions.

Proposed Solution.

The above observation underscores the impor-

tance of encoding the knowledge about each instruction’s contri-

bution to value ﬂows through memory and their compositions as

part of the ML model. However, integrating a memory model as

part of the encoded knowledge is challenging due to the presence of

potentially complex ﬂows involving indirect address references and

their compositions. We address these challenges by designing

(1)

A novel training objectives to distinguish between unique mem-

ory addresses (§2.2)

(2)

A dedicated fusion module sepcialized to capture the interaction

between instruction, trace, and memory addresses (§2.3.2). Our

new tracing and sampling strategies (§2.3.1) help the ML model

to learn value ﬂows across memory addresses.

(3)

Curriculum learning [

] in the training process to incrementally

learn the compositional effects (§2.3.3).

Table 1shows some examples of how the pretraining task works

and how it teaches the model to reason about value ﬂows.

2.2 Problem Formulation

Let

𝑓

denote an ML model parameterized by

𝜃

. Before directly

training

𝑓

towards predicting memory dependencies, we pretrain

𝑓

to reason about the value ﬂows (see §3.4 for details). Now consider

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

NEUDEP:NeuralBinaryMemoryDependenceAnalysisKexinPeikpei@cs.columbia.eduColumbiaUniversityNewYork,USADongdongShe*dongdong@cs.columbia.eduColumbiaUniversityNewYork,USAMichaelWang*mi27950@mit.eduMITCambridge,USAScottGeng*scott.geng@columbia.eduColumbiaUniversityNewYork,USAZhouXuanxuan1@purdue.eduPurdue...

展开>> 收起<<

NeuDep Neural Binary Memory Dependence Analysis.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

NeuDep Neural Binary Memory Dependence Analysis

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: