
MODELS ’22 Companion, October 23–28, 2022, Montreal, QC, Canada Yang et al.
pipeline that performs the extraction of UML class diagrams. Figure
2 summarizes the process. Figure 1 summarizes the process.
Our approach combines machine learning with pattern-based
diagram generation. To perform machine learning, we start by cre-
ating a dataset of UML class diagrams and their corresponding
specications in natural language (top part of Figure 1). We select
pre-existing UML class diagrams from the AtlanMod Zoo reposi-
tory 1. The selected diagrams are decomposed into fragments and
manually labeled by volunteer participants. After postprocessing,
the labeled diagrams are stored in a repository.
The bottom part of Figure 1 consists of the actual diagram genera-
tion process, which takes place right after a user submits a software
specication. The submitted natural language specication is then
preprocessed and decomposed into sentences. Using a classier
built from the above-mentioned dataset, the sentences are labeled
according to the nature of the UML construct they refer to, i.e. a
class or a relation. According to this label, specic procedures of
parsing and extraction are performed on the sentence to generate
a UML fragment. In the end, all UML fragments are composed back
together into one UML class diagram.
2.2 Dataset Creation
We create a new dataset for both the operation and the evaluation of
our approach. In particular, we use this dataset to learn a classier
for the Classication step in Figure 1.
To build the dataset, we start from an existing set of UML class
diagrams from the AtlanMod Zoo. The AtlanMod Zoo has a repos-
itory of 305 high-quality UML class diagrams that model various
domains. The size of the diagrams varies from a few to hundreds
of classes. We fragment each diagram into simple classes (Figure 2)
and relationships (Figure 3). Table 1 shows the size of the initial
set of diagrams and the fragments, as well as the portion that we
labeled.
Dataset UML models UML fragments
AtlanMod Zoo 305 8706
Labeled 62 649
Table 1: UML datasets and their sizes by version
Since we are interested in the translation of specications into
diagrams, each UML class diagram needs to be paired with an Eng-
lish specication. To achieve that goal, we set up a website where
we crowdsource the labeling of fragments. The website proposes
the labeling of 305 diagrams containing 8706 fragments. We present
the diagrams in ascending order of complexity. The website rst
shows a complete diagram, then iterates on its fragments for la-
beling while keeping the whole diagram in view. The volunteer
participants write an English specication for each fragment. We
give examples of labels to help the participants write at the right
level of abstraction.
We send the labeling invitation to dierent MDE mailing lists
and specic large research groups active in the MDE eld. Volunteer
participants are mostly university students and faculty members
across the world. To ensure that the labeling is done in good faith,
1https://web.imt-atlantique.fr/x-info/atlanmod/index.php?title=Zoos
we do not oer monetary compensation for participation. However,
since participation was low, we did not impose a contribution limit.
After about two months of crowdsourcing, we receive labels for
649 fragments across 62 UML class diagrams. The produced dataset
is available on a public repository
2
. To ensure quality, labels are
reviewed and some are rejected. We replace the rejected labels by
labeling them again ourselves. Figure 4 shows example labels.
2.3 Preprocessing and Fragmentation
Preprocessing is the rst step after receiving an input specication
from the user as shown in Figure 1. We substitute pronouns through-
out the text, such as it and him, by their reference nouns. This is
done using coreferee [
5
], which is a tool written in Python that
performs coreference resolution, including pronoun substitution.
A course is taught by a teacher. A classroom is as-
signed to it.
=⇒
A course is taught by a teacher. A classroom is
assigned to a course.
Pronoun substitution allows sentences in the English specica-
tion to be less dependent on each other for semantic purposes. The
accuracy of coreferee for general English text is 81%.
Sentence fragmentation is the second step in the runtime oper-
ations in Figure 1. We split the preprocessed text into individual
sentences, using spaCy [
3
]. spaCy is an NLP library in Python that
can be used for various NLP tasks, such as sentence splitting. spaCy
splits text into sentences by looking at punctuation and special
cases like abbreviations. Its decisions are powered by pre-trained
statistical models. We use the small English model, which has a
good speed and respectable performance. For instance, in the fol-
lowing example, the rst two dots are not considered for splitting
the sentences but the third dot is.
An employee has a level of studies, i.e., a degree. An
employee is aliated to a department.
=⇒
𝑠1: An employee has a level of studies, i.e., a degree.
𝑠2: An employee is aliated to a department.
2.4 Sentence Classication
Sentence classication is the third step in the runtime operations of
Figure 1. Classication provides additional information on the Eng-
lish specication that can be used later to better generate the related
UML diagram fragment. Each sentence is classied as describing
either a "class" or a "relationship".
The training data for the classier comes from the dataset de-
scribed in Section 2.2. Each data point is structured as a pair <Eng-
lish specication, UML fragment> and is assigned a label of a "class"
or "relationship" from the moment the dataset was processed from
AtlanMod Zoo. The pairing means that the English specication
belongs to that specic UML fragment. Our classier is trained to
predict the "class/relationship" label from an English specication.
To evaluate the accuracy of the classier, we use 80% of the data
for the training, and the remaining 20% for testing.
2https://github.com/XsongyangX/uml-classes-and-specs