
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Espada et al.
1 Introduction
Genetic Programming (GP) has been applied to several do-
mains [
1
], including test-case generation [
6
], program syn-
thesis [
10
] and automatic optimization of programs [
18
].
Being a search-based method, GP often requires the user to
encode domain-specic restrictions to the possible solutions.
Throughout the Genetic Programming procedure, many
solutions (also called programs or trees) are generated, mod-
ied and combined in order to search the one that has the
best performance or solves the problem. One of the main
challenges of using GP is to dene the boundaries of the
search-space in a way that is performant and ecient. En-
coding domain-knowledge in the denition of the search
space is desirable, so many candidate, but invalid solutions
are removed from the search.Grammar-Guided and Strongly-
Typed Genetic Programming (GGGP [
33
] and STGP [
21
])
are two popular families of approaches to guide the random
solution generation by the user. To restrict valid solutions,
GGGP relies on user-dened grammars while STGP relies
on user-dened type annotations.
Within each family of approaches, there are dierent trade-
os between expressive power and computational complex-
ity. Context-Free Grammars [
33
] can be used with little over-
head in processing, while Attribute Grammars [
5
] can en-
code more complex restrictions. Similarly, polymorphic type
systems [
37
] are more expressive than monomorphic ones,
but they have a higher overhead during the evolutionary
process.
However, in practice, popular implementations of these
systems are limited to Context-Free grammars (CFG) (e.g.,
PonyGE2 [
7
]) and base types (e.g., DEAP [
11
] and Jenet-
ics [36]).
Problem Statement
GGGP requires users to express the search-space of valid pro-
grams in the Backus Normal Form (BNF) or one of its deriva-
tions, such as the Extended Backus Normal Form (EBNF).
We consider BNF and its derivatives to be a textual editing
medium of the target language, as concepts in the target
language (like types, functions, literals) have no meaning in
the grammar language. As such, users are treating the target
language as character streams, instead of treating them as
semantic code units, as they do in code editors.
The BNF approach is not ergonomic nor error-prone due
to the interleaving of two languages: the BNF meta-syntax
and the syntax of the target language, i.e., the language of
the generated programs. As an example, Listing 1 depicts
two common errors that both novices and experts do while
writing grammars. These are not detected immediately, and
are typically only revealed during execution.
Ideally, these two bugs would be detected immediately in
most development environments: the type-checker would
identify a missing import, and the type mismatch in the
1<S> ::= <S> "+" <S> | "x" | 1 | np.array([<A>])
2<A> ::= [0-9]+,<A> | [0-9]+
Listing 1.
Example of a grammar with two possible bugs:
the
np
module may not be imported, and the
+
operator may
be applied to a string and integer, causing a crash.
1<Real> ::= <RealLit> | DATASET[GE_RANGE:3] | <
RealLit> + <RealLit> | np.average(<Vector>)
2<RealLit> ::= <d>.<d>
3<d> ::= GE_RANGE:10<d> |
4<Vector> ::= DATASET[3 + GE_RANGE:3]
Listing 2. PonyGE2 grammar for a subset of VectorialGP
operator. Is it really necessary that, to implement a grammar,
developers have to lose access to an array of tooling that
has shown to be useful in software engineering? Or is there
a more ergonomic interface that is integrated into the host
language ecosystem, providing access to code editors, syntax
highlighting, linting, formatting, test suites, etc...?
If there is, such an approach could reduce the barrier to
adoption of GGGP, STGP, and similar systems, increasing
the productivity of practitioners and reducing the number of
possible bugs that are not caught earlier due the mismatch
between the host language and the grammar language.
It is important to notice that usability has been identied
as a relevant open challenge for GP both in 2010 [
25
], and
again in 2020, when it was identied as one of the 10 major
challenges for the following 15 years [
30
]. This work ad-
dresses exactly that problem, especially in the case where
practitioners encode domain information to improve the
performance of GP.
Approach
Our rst contribution is an object-oriented encoding of gram-
mars to be used in GGGP, regardless of representation. In
short, non-terminals are represented as abstract classes (or
interfaces, or traits) and productions are represented as con-
crete classes that implement the corresponding abstract class.
As a running example, we will use grammars to describe
the VectorialGP example in Azzali et al
. [2]
. Listing 2 de-
scribes that subset using a PonyGE2’s avor of BNF, while
Listing 3 encodes the same grammar using the proposed
approach, both implemented in Python.
There is a direct correspondence between the abstract and
concrete types, and the non-terminals and productions of
the grammar. This direct correspondence is an advantage of
this approach, that does not add more cognitive overhead
when using the Object-Oriented encoding.