
ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore Pei, She, Wang, Geng, Xuan, David, Yang, Jana, Ray
Address Instruction Trace 1 Trace 2
......
0x06: mov [rax],rbx∗rax=0x3;rbx=0x1 rax=0x5;rbx=0x1
......
0x1f: mov rdi,[0x3] rdi=0x1 rdi=0x0
......
∗In Intel x86 syntax [77], mov [rax],rbx means writing register rbx to the memory
pointed by register rax; [] means dereference a memory address.
The two instructions are memory-dependent (read-after-write) when
rax
=
0x3
(Trace 1). When analyzing the code statically, it requires
precise value flow analysis to determine what values can flow to
rax from different program contexts.
Over the last two decades, researchers have made numerous at-
tempts to improve the accuracy and performance of binary memory
dependence analysis [
5
,
6
,
11
,
16
,
22
,
34
,
71
]. The most common
approach often involves statically computing and propagating an
over-approximated set of values that each register and memory ad-
dress can contain at each program point using abstract interpretation.
For example, a seminal paper by Balakrishnan and Reps on value set
analysis (VSA) [
5
] adopts strided intervals as the abstract domain
and propagates the interval bounds for the operands (e.g., registers
and memory locations) along each instruction. VSA detects two in-
structions to be dependent if their intervals intersect. Unfortunately,
these static approaches have been shown to be highly imprecise
in practice [
98
]. Composing abstract domains along multiple in-
structions and merging them across a large number of paths quickly
accumulate prohibitive amounts of over-approximation error. As
a result, the computed set of accessed memory addresses by such
approaches often ends up covering almost the entire memory space,
leading to a large number of false positives (i.e., instructions with
no dependencies are incorrectly detected as dependent).
With the advent of data-driven approaches to program analy-
ses [
3
,
69
,
91
], state-of-the-art memory dependence analysis is in-
creasingly using statistical or machine learning (ML) based methods
to improve the analysis precision [
35
,
58
,
88
,
98
], but they still suffer
from serious limitations. For example, DeepVSA [
35
] trains a neural
network on static code to classify the memory locations accessed
by each instruction into a more coarse-grained abstract domain such
as stack, heap, and global, and use the predicted memory region
to instantiate the value set in VSA. However, such coarse-grained
prediction results in high false positives as any two instructions
accessing the same region (e.g., stack) will always be detected as
dependent even when the instructions access two completely differ-
ent addresses. To avoid the precision losses by the static approaches,
BDA [
98
] uses a dynamic approach that leverages probabilistic
analysis to sample program paths and performs per-path abstract
interpretation. However, as real-world programs often have many
paths, the cost of performing per-path abstract interpretation for even
a smaller subset of paths adds prohibitive runtime overhead, e.g.,
taking more than 12 hours to finish analyzing a single program. It
is perhaps not surprising—while a dynamic approach can be more
accurate than static approaches, it can incur extremely high runtime
overhead, especially while trying to achieve good code coverage.
To achieve higher accuracy in a reasonably faster time, we pro-
pose an ML-based hybrid approach. Our key strategy is to learn to
reason about approximate memory dependencies from the execution
behavior of generic binary code during training. We then apply the
Code
Value Flows:
Memory+Trace
ML Model
Execute
Code
Representation
Value Flow
Representation
Code
Code
Value Flows:
Memory+Trace
Code Representation
ML Model Memory Dependencies
Pretraining (Dynamic):
Finetuning (Static):
Synthesize
Interpret
Figure 1: The workflow of our approach. We first pretrain the model to
predict code based on its traces and predict traces based on its code. We
then finetune the model to statically infer memory dependencies.
learned knowledge to static code during inference without any extra
runtime overhead (see Figure 1). Such a hybrid approach, i.e., learn-
ing from both code and traces, has been shown promise in several
Software Engineering applications, including clone detection, type
inference, and program fixing and synthesis [
60
,
63
,
64
,
90
,
91
].
However, none of these works can reason fine-grained value flows
through different memory addresses as they do not explicitly model
memory. To bridge this gap, we aim to model the memory addresses
in the ML-based hybrid framework and try to make fine-grained
predictions differentiating the memory contents of different data
pointers. Modeling memory address is, however, challenging as it
requires the model to (i) distinguish between different memory ad-
dresses, (ii) learn to reason about indirect address references and
memory contents, and (iii) learn the compositional effects of multi-
ple instructions that involve memory operations.
To this end, we propose a new learning framework comprising
pretraining and finetuning steps inspired by masked language model
(MLM) [
25
], as shown in Figure 1. Unlike traditional MLM, where
the input is restricted to a single input modality (e.g., text), our
model learns from multi-modal information: instructions (static
code), traces (dynamic values), and memory addresses (code spatial
layout). We deploy a novel fusion module to simultaneously capture
the interactions of these modalities for predicting memory depen-
dencies. During pretraining, we mask random tokens from these
modalities. While predicting masked opcode teaches the model to
synthesize the instruction, predicting masked values in traces and
memory addresses forces the model to learn to interpret instructions
and their effect on registers and memory contents. For instance, if we
mask the value of
rax
in
mov [rax],rbx
in the above example
and train the model to predict it, the model is forced to interpret
the previous instructions in the context and reason about how they
compute their trace values that flow into
rax
. We hypothesize that
such pretraining helps the model gain a general understanding of the
value flow behavior involving memory operations.
After pretraining, the model is finetuned to statically (without
the trace values) reason about the value flows (based on its learned