Towards Automatically Extracting UML Class Diagrams from Natural Language Specifications

2025-05-06 0 0 629.08KB 8 页 10玖币
侵权投诉
Towards Automatically Extracting UML Class Diagrams from
Natural Language Specifications
Song Yang
Université de Montréal
Montreal, Canada
song.yang.1@umontreal.ca
Houari Sahraoui
Université de Montréal
Montreal, Canada
houari.sahraoui@umontreal.ca
ABSTRACT
In model-driven engineering (MDE), UML class diagrams serve
as a way to plan and communicate between developers. However,
it is complex and resource-consuming. We propose an automated
approach for the extraction of UML class diagrams from natural lan-
guage software specications. To develop our approach, we create a
dataset of UML class diagrams and their English specications with
the help of volunteers. Our approach is a pipeline of steps consisting
of the segmentation of the input into sentences, the classication
of the sentences, the generation of UML class diagram fragments
from sentences, and the composition of these fragments into one
UML class diagram. We develop a quantitative testing framework
specic to UML class diagram extraction. Our approach yields low
precision and recall but serves as a benchmark for future research.
CCS CONCEPTS
Software and its engineering Software design engineer-
ing
;
Computing methodologies Information extraction
;
Classication and regression trees.
KEYWORDS
Model-driven engineering, Machine learning, Natural language
processing, Domain modeling
ACM Reference Format:
Song Yang and Houari Sahraoui. 2022. Towards Automatically Extracting
UML Class Diagrams from Natural Language Specications. In ACM/IEEE
25th International Conference on Model Driven Engineering Languages and
Systems (MODELS ’22 Companion), October 23–28, 2022, Montreal, QC, Canada.
ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3550356.3561592
1 INTRODUCTION
Software development is a complex and error-prone process. Part
of this complexity comes from the gap between domain experts
who are familiar with the domain knowledge but have limited
expertise with development tools, and software specialists who
master the development environments but are unfamiliar with
the target application domain. To ll that gap, the model-driven
engineering paradigm aims at raising the level of abstraction in
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
MODELS ’22 Companion, October 23–28, 2022, Montreal, QC, Canada
©2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-9467-3/22/10. . . $15.00
https://doi.org/10.1145/3550356.3561592
development activities by considering domain models, such as UML
class diagrams, as rst-class development artifacts.
Though smaller, a gap still exists between the domain concepts
and the tools and languages that are produced to model them [
8
].
For a domain specialist, creating UML models from scratch is a time-
consuming and error-prone process that requires various technical
skills. To address that problem, various approaches target the gener-
ation of models from dierent structured information such as user
stories [
4
]. However, little work has been done on the extraction of
natural language specications. In the specic case of UML class
diagrams, existing work rely either on techniques that use machine
learning in a semi-automated process [
12
] or rule-based techniques
that are fully automated but require a restricted input [
1
,
7
]. In
this paper, we propose an approach that combines both machine
learning and rules while accepting free-owing text.
Our approach uses natural language patterns and machine learn-
ing to fully automate the generation process. We rst decompose a
specication into sentences. Then, using a trained classier, we tag
each sentence as describing either a class or a relationship. Next, us-
ing grammar patterns, we map each sentence into a UML fragment.
Finally, we assemble the fragments into a complete UML diagram
using a composition algorithm. In addition to our approach, we
build a dataset thanks to the eort of the modeling community. This
dataset is used to train the classier et to evaluate the approach.
We evaluate our approach on a dataset of 62 diagrams containing
624 fragments. Although the accuracy of our approach does not
reach an accuracy level needed for practical use, our work explores
the benets of mixing machine learning with natural language
patterns for a fully automated process. Our approach can serve as
a baseline for future research on generating UML diagrams from
English specications, and the dataset created together with the
dened quantitative metrics can serve as a benchmark for this
problem.
The rest of the paper is structured as follows. Section 2 gives
an overview of the proposed generation pipeline and the details
of each step. The setup and the results of evaluating the approach
are provided in Section 3.1. Section 5 discusses the related work
and positions our contribution to it. Section 4 lists some threats to
validity. Finally, we conclude this paper in Section 6.
2 APPROACH
2.1 Overview
The goal is to design a method to translate English specications
to UML diagrams. To do this, we implement a tool pipeline that
generates UML class diagrams from natural language specica-
tions. First, we create a dataset. Secondly, we implement an NLP
arXiv:2210.14441v2 [cs.SE] 27 Oct 2022
MODELS ’22 Companion, October 23–28, 2022, Montreal, QC, Canada Yang et al.
pipeline that performs the extraction of UML class diagrams. Figure
2 summarizes the process. Figure 1 summarizes the process.
Our approach combines machine learning with pattern-based
diagram generation. To perform machine learning, we start by cre-
ating a dataset of UML class diagrams and their corresponding
specications in natural language (top part of Figure 1). We select
pre-existing UML class diagrams from the AtlanMod Zoo reposi-
tory 1. The selected diagrams are decomposed into fragments and
manually labeled by volunteer participants. After postprocessing,
the labeled diagrams are stored in a repository.
The bottom part of Figure 1 consists of the actual diagram genera-
tion process, which takes place right after a user submits a software
specication. The submitted natural language specication is then
preprocessed and decomposed into sentences. Using a classier
built from the above-mentioned dataset, the sentences are labeled
according to the nature of the UML construct they refer to, i.e. a
class or a relation. According to this label, specic procedures of
parsing and extraction are performed on the sentence to generate
a UML fragment. In the end, all UML fragments are composed back
together into one UML class diagram.
2.2 Dataset Creation
We create a new dataset for both the operation and the evaluation of
our approach. In particular, we use this dataset to learn a classier
for the Classication step in Figure 1.
To build the dataset, we start from an existing set of UML class
diagrams from the AtlanMod Zoo. The AtlanMod Zoo has a repos-
itory of 305 high-quality UML class diagrams that model various
domains. The size of the diagrams varies from a few to hundreds
of classes. We fragment each diagram into simple classes (Figure 2)
and relationships (Figure 3). Table 1 shows the size of the initial
set of diagrams and the fragments, as well as the portion that we
labeled.
Dataset UML models UML fragments
AtlanMod Zoo 305 8706
Labeled 62 649
Table 1: UML datasets and their sizes by version
Since we are interested in the translation of specications into
diagrams, each UML class diagram needs to be paired with an Eng-
lish specication. To achieve that goal, we set up a website where
we crowdsource the labeling of fragments. The website proposes
the labeling of 305 diagrams containing 8706 fragments. We present
the diagrams in ascending order of complexity. The website rst
shows a complete diagram, then iterates on its fragments for la-
beling while keeping the whole diagram in view. The volunteer
participants write an English specication for each fragment. We
give examples of labels to help the participants write at the right
level of abstraction.
We send the labeling invitation to dierent MDE mailing lists
and specic large research groups active in the MDE eld. Volunteer
participants are mostly university students and faculty members
across the world. To ensure that the labeling is done in good faith,
1https://web.imt-atlantique.fr/x-info/atlanmod/index.php?title=Zoos
we do not oer monetary compensation for participation. However,
since participation was low, we did not impose a contribution limit.
After about two months of crowdsourcing, we receive labels for
649 fragments across 62 UML class diagrams. The produced dataset
is available on a public repository
2
. To ensure quality, labels are
reviewed and some are rejected. We replace the rejected labels by
labeling them again ourselves. Figure 4 shows example labels.
2.3 Preprocessing and Fragmentation
Preprocessing is the rst step after receiving an input specication
from the user as shown in Figure 1. We substitute pronouns through-
out the text, such as it and him, by their reference nouns. This is
done using coreferee [
5
], which is a tool written in Python that
performs coreference resolution, including pronoun substitution.
A course is taught by a teacher. A classroom is as-
signed to it.
=
A course is taught by a teacher. A classroom is
assigned to a course.
Pronoun substitution allows sentences in the English specica-
tion to be less dependent on each other for semantic purposes. The
accuracy of coreferee for general English text is 81%.
Sentence fragmentation is the second step in the runtime oper-
ations in Figure 1. We split the preprocessed text into individual
sentences, using spaCy [
3
]. spaCy is an NLP library in Python that
can be used for various NLP tasks, such as sentence splitting. spaCy
splits text into sentences by looking at punctuation and special
cases like abbreviations. Its decisions are powered by pre-trained
statistical models. We use the small English model, which has a
good speed and respectable performance. For instance, in the fol-
lowing example, the rst two dots are not considered for splitting
the sentences but the third dot is.
An employee has a level of studies, i.e., a degree. An
employee is aliated to a department.
=
𝑠1: An employee has a level of studies, i.e., a degree.
𝑠2: An employee is aliated to a department.
2.4 Sentence Classication
Sentence classication is the third step in the runtime operations of
Figure 1. Classication provides additional information on the Eng-
lish specication that can be used later to better generate the related
UML diagram fragment. Each sentence is classied as describing
either a "class" or a "relationship".
The training data for the classier comes from the dataset de-
scribed in Section 2.2. Each data point is structured as a pair <Eng-
lish specication, UML fragment> and is assigned a label of a "class"
or "relationship" from the moment the dataset was processed from
AtlanMod Zoo. The pairing means that the English specication
belongs to that specic UML fragment. Our classier is trained to
predict the "class/relationship" label from an English specication.
To evaluate the accuracy of the classier, we use 80% of the data
for the training, and the remaining 20% for testing.
2https://github.com/XsongyangX/uml-classes-and-specs
摘要:

TowardsAutomaticallyExtractingUMLClassDiagramsfromNaturalLanguageSpecificationsSongYangUniversitédeMontréalMontreal,Canadasong.yang.1@umontreal.caHouariSahraouiUniversitédeMontréalMontreal,Canadahouari.sahraoui@umontreal.caABSTRACTInmodel-drivenengineering(MDE),UMLclassdiagramsserveasawaytoplanandco...

展开>> 收起<<
Towards Automatically Extracting UML Class Diagrams from Natural Language Specifications.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:629.08KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注