in parallel (section IV).
Finally, this paper presents another interesting finding:
the state-of-the-art CPU-based automata processing engine,
Hyperscan [20] sometimes outperforms the state-of-the-art
FPGA-based automata processing engine, Grapefruit, in terms
of throughput (pattern matching rate). Even though FPGAs
offer massive parallelism, FPGAs’ clock rate is considerably
lower than CPUs. On top of that, applications face perfor-
mance degradation if the computation model does not fit into
the CPU cache. Prior works show that in the worst case, the
FPGA-based pattern matching engine has equal throughput
compared to the CPU-based pattern matching engine [2, 5].
Contrary to that, this paper shows that for some real-world
benchmarks, the CPU gets 4.5x speedup over the FPGA
(Section V). However, if a workload is such that it matches
many distinct patterns, CPU throughput degrades. So, it can
be affirmed that neither CPU nor FPGA is a clear winner for
the FA-based pattern matching engine.
In summary, this paper makes the following contributions:
•It presents a detailed analysis of the state count of the
optimized NFA vs. minimized DFA across various real-
world automata benchmarks (section III).
•It shows that on FPGA, the choice of NFA over DFA
is less important than the need to keep individual DFA
separate and the ability to support parallel processing of
a large number of DFAs (section IV).
•This paper presents an in-depth performance analysis
of the state-of-the-art CPU-based automata processing
engine vs. the state-of-the-art FPGA-based automata pro-
cessing engine (section V).
II. RELATED WORK
The state count of the NFA and DFA has been studied
previously from the perspective of network-intrusion detection,
and synthetic benchmarks [13, 21, 22]. The studies on the
network-intrusion detection benchmarks showed that gener-
ating a DFA for a specified pattern set can exponentially
increase the DFA state count. They demonstrate that regular
expressions containing constraints such as dot-star (pq.*rs)
and repeated sub-patterns (xyz.{100}abc) induce exponential
growth in DFA states. Contrary to the prior works, this paper
performs state count analysis for a diverse state of real-world
benchmarks and shows that if separate DFAs are generated for
each pattern, the DFA state count is equal to the equivalent
optimized NFA. However, in the worst-case DFA state count
shows a polynomial growth rate.
Tabakov et al. [17] evaluated Hopcroft’s [15] and Brzo-
zowski’s [16] DFA minimization algorithm using a synthetic
automata type called random automata. Their analysis illus-
trates that the state count of a DFA depends on the automata
graph’s transition density1. For smaller transition densities, the
DFA state grows super-polynomially but sub-exponentially.
Additionally, their result shows that minimized DFA has a
1For each symbol ratio of the total number of transitions to the total number
of states. Details can be found at [17].
smaller state count than the equivalent NFAs with greater tran-
sition densities. One limitation of this analysis is that the NFAs
used are not optimized. Even though NFA minimization is a
PSPACE-complete problem [23], a heuristics-based approach
can generate optimized NFA, as shown in this paper.
Since Floyd et al. [24] introduced hardware implementation
of NFA, there has been a substantial amount of work on high-
performance automata processors using either DFA [4, 13,
25] or NFA [1]–[3, 26] as a computation model. The state-
of-the-art CPU-based automata processing engine Hyperscan
[20] uses regex decomposition to separate the pattern into
disjoint strings and FA. Moreover, to process FA, Hyperscan
exploits SIMD vector operation to perform the maximum
possible state transition for each memory access. Hyperscan
encounters throughput degradation due to the CPU’s high
memory latency if the working set size does not fit into
the CPU cache. Grapefruit [5], a state-of-the-art FPGA-based
automata processing engine, maps a cluster of automata into
the FPGA logic fabric. Even though this approach helps in
accelerating Grapefruit’s throughput, if the automata graph
does not fit into the FPGA logic fabric, Grapefruit fails to
process it. This paper does not discuss GPU-based automata
processing engine because prior works [27, 28] have presented
that GPU is not promising for processing FA.
Xie et al. [2] examined throughput comparison between the
state-of-the-art CPU-based automata processing engine, Hy-
perscan [20] and an FPGA-based automata processing engine,
REAPR. Their analysis shows that in the worst case, REAPR
performs the same as Hyperscan, whereas, in the best case,
REAPR has a 2,000x speedup over Hyperscan. However, this
paper shows that, based on the workload and the benchmark,
neither Hyperscan nor Grapefruit is a clear winner for the FA-
based pattern matching engine.
Becchi et al. [14] presented a performance evaluation of
a network processor and a general-purpose processor using
NFA, DFA, and hybrid-FA. Hybrid-FA is a variant of automata
formed by halting the NFA→DFA conversion in places where
DFA blowup occurs. The evaluation at [14] shows that hybrid-
FA has the overall best throughput across the network and
general-purpose processors. The reason is that the DFA part
of the hybrid-FA is less susceptible to high memory latency,
and the NFA part of the hybrid-FA keeps a check on the
state explosion issue of the DFA. This performance evaluation
is limited to von Neumann architecture, whereas this paper
does the performance evaluation for von Neumann vs. spatial
architecture.
III. NFA VS. DFA STATE COUNT ANALYSIS
By definition, if an NFA requires nstates to express a
regular language Lor pattern P, the equivalent DFA will
require at most 2nstates [8]. As a result, DFAs are disregarded
for designing FA-based pattern matching engines on FPGA
[2, 21]. The state count study of DFA vs. NFA across diverse
real-world benchmarks is extremely important for automata
processing research. Because it would resolve if, for real-
world benchmarks, minimized DFAs’ state count exhibits an
2