Towards Automatically Extracting UML Class Diagrams from Natural Language Specifications

2025-05-06 0 0 629.08KB 8 页 10玖币

侵权投诉

Towards Automatically Extracting UML Class Diagrams from

Natural Language Specifications

Song Yang

Université de Montréal

Montreal, Canada

song.yang.1@umontreal.ca

Houari Sahraoui

Université de Montréal

Montreal, Canada

houari.sahraoui@umontreal.ca

ABSTRACT

In model-driven engineering (MDE), UML class diagrams serve

as a way to plan and communicate between developers. However,

it is complex and resource-consuming. We propose an automated

approach for the extraction of UML class diagrams from natural lan-

guage software specications. To develop our approach, we create a

dataset of UML class diagrams and their English specications with

the help of volunteers. Our approach is a pipeline of steps consisting

of the segmentation of the input into sentences, the classication

of the sentences, the generation of UML class diagram fragments

from sentences, and the composition of these fragments into one

UML class diagram. We develop a quantitative testing framework

specic to UML class diagram extraction. Our approach yields low

precision and recall but serves as a benchmark for future research.

CCS CONCEPTS

•Software and its engineering →Software design engineer-

ing

;

•Computing methodologies →Information extraction

;

Classication and regression trees.

KEYWORDS

Model-driven engineering, Machine learning, Natural language

processing, Domain modeling

ACM Reference Format:

Song Yang and Houari Sahraoui. 2022. Towards Automatically Extracting

UML Class Diagrams from Natural Language Specications. In ACM/IEEE

25th International Conference on Model Driven Engineering Languages and

Systems (MODELS ’22 Companion), October 23–28, 2022, Montreal, QC, Canada.

ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3550356.3561592

1 INTRODUCTION

Software development is a complex and error-prone process. Part

of this complexity comes from the gap between domain experts

who are familiar with the domain knowledge but have limited

expertise with development tools, and software specialists who

master the development environments but are unfamiliar with

the target application domain. To ll that gap, the model-driven

engineering paradigm aims at raising the level of abstraction in

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior specic permission and/or a

fee. Request permissions from permissions@acm.org.

MODELS ’22 Companion, October 23–28, 2022, Montreal, QC, Canada

ACM ISBN 978-1-4503-9467-3/22/10. . . $15.00

https://doi.org/10.1145/3550356.3561592

development activities by considering domain models, such as UML

class diagrams, as rst-class development artifacts.

Though smaller, a gap still exists between the domain concepts

and the tools and languages that are produced to model them [

For a domain specialist, creating UML models from scratch is a time-

consuming and error-prone process that requires various technical

skills. To address that problem, various approaches target the gener-

ation of models from dierent structured information such as user

stories [

]. However, little work has been done on the extraction of

natural language specications. In the specic case of UML class

diagrams, existing work rely either on techniques that use machine

learning in a semi-automated process [

] or rule-based techniques

that are fully automated but require a restricted input [

]. In

this paper, we propose an approach that combines both machine

learning and rules while accepting free-owing text.

Our approach uses natural language patterns and machine learn-

ing to fully automate the generation process. We rst decompose a

specication into sentences. Then, using a trained classier, we tag

each sentence as describing either a class or a relationship. Next, us-

ing grammar patterns, we map each sentence into a UML fragment.

Finally, we assemble the fragments into a complete UML diagram

using a composition algorithm. In addition to our approach, we

build a dataset thanks to the eort of the modeling community. This

dataset is used to train the classier et to evaluate the approach.

We evaluate our approach on a dataset of 62 diagrams containing

624 fragments. Although the accuracy of our approach does not

reach an accuracy level needed for practical use, our work explores

the benets of mixing machine learning with natural language

patterns for a fully automated process. Our approach can serve as

a baseline for future research on generating UML diagrams from

English specications, and the dataset created together with the

dened quantitative metrics can serve as a benchmark for this

problem.

The rest of the paper is structured as follows. Section 2 gives

an overview of the proposed generation pipeline and the details

of each step. The setup and the results of evaluating the approach

are provided in Section 3.1. Section 5 discusses the related work

and positions our contribution to it. Section 4 lists some threats to

validity. Finally, we conclude this paper in Section 6.

2 APPROACH

2.1 Overview

The goal is to design a method to translate English specications

to UML diagrams. To do this, we implement a tool pipeline that

generates UML class diagrams from natural language specica-

tions. First, we create a dataset. Secondly, we implement an NLP

arXiv:2210.14441v2 [cs.SE] 27 Oct 2022

MODELS ’22 Companion, October 23–28, 2022, Montreal, QC, Canada Yang et al.

pipeline that performs the extraction of UML class diagrams. Figure

2 summarizes the process. Figure 1 summarizes the process.

Our approach combines machine learning with pattern-based

diagram generation. To perform machine learning, we start by cre-

ating a dataset of UML class diagrams and their corresponding

specications in natural language (top part of Figure 1). We select

pre-existing UML class diagrams from the AtlanMod Zoo reposi-

tory 1. The selected diagrams are decomposed into fragments and

manually labeled by volunteer participants. After postprocessing,

the labeled diagrams are stored in a repository.

The bottom part of Figure 1 consists of the actual diagram genera-

tion process, which takes place right after a user submits a software

specication. The submitted natural language specication is then

preprocessed and decomposed into sentences. Using a classier

built from the above-mentioned dataset, the sentences are labeled

according to the nature of the UML construct they refer to, i.e. a

class or a relation. According to this label, specic procedures of

parsing and extraction are performed on the sentence to generate

a UML fragment. In the end, all UML fragments are composed back

together into one UML class diagram.

2.2 Dataset Creation

We create a new dataset for both the operation and the evaluation of

our approach. In particular, we use this dataset to learn a classier

for the Classication step in Figure 1.

To build the dataset, we start from an existing set of UML class

diagrams from the AtlanMod Zoo. The AtlanMod Zoo has a repos-

itory of 305 high-quality UML class diagrams that model various

domains. The size of the diagrams varies from a few to hundreds

of classes. We fragment each diagram into simple classes (Figure 2)

and relationships (Figure 3). Table 1 shows the size of the initial

set of diagrams and the fragments, as well as the portion that we

labeled.

Dataset UML models UML fragments

AtlanMod Zoo 305 8706

Labeled 62 649

Table 1: UML datasets and their sizes by version

Since we are interested in the translation of specications into

diagrams, each UML class diagram needs to be paired with an Eng-

lish specication. To achieve that goal, we set up a website where

we crowdsource the labeling of fragments. The website proposes

the labeling of 305 diagrams containing 8706 fragments. We present

the diagrams in ascending order of complexity. The website rst

shows a complete diagram, then iterates on its fragments for la-

beling while keeping the whole diagram in view. The volunteer

participants write an English specication for each fragment. We

give examples of labels to help the participants write at the right

level of abstraction.

We send the labeling invitation to dierent MDE mailing lists

and specic large research groups active in the MDE eld. Volunteer

participants are mostly university students and faculty members

across the world. To ensure that the labeling is done in good faith,

1https://web.imt-atlantique.fr/x-info/atlanmod/index.php?title=Zoos

we do not oer monetary compensation for participation. However,

since participation was low, we did not impose a contribution limit.

After about two months of crowdsourcing, we receive labels for

649 fragments across 62 UML class diagrams. The produced dataset

is available on a public repository

. To ensure quality, labels are

reviewed and some are rejected. We replace the rejected labels by

labeling them again ourselves. Figure 4 shows example labels.

2.3 Preprocessing and Fragmentation

Preprocessing is the rst step after receiving an input specication

from the user as shown in Figure 1. We substitute pronouns through-

out the text, such as it and him, by their reference nouns. This is

done using coreferee [

], which is a tool written in Python that

performs coreference resolution, including pronoun substitution.

A course is taught by a teacher. A classroom is as-

signed to it.

=⇒

A course is taught by a teacher. A classroom is

assigned to a course.

Pronoun substitution allows sentences in the English specica-

tion to be less dependent on each other for semantic purposes. The

accuracy of coreferee for general English text is 81%.

Sentence fragmentation is the second step in the runtime oper-

ations in Figure 1. We split the preprocessed text into individual

sentences, using spaCy [

]. spaCy is an NLP library in Python that

can be used for various NLP tasks, such as sentence splitting. spaCy

splits text into sentences by looking at punctuation and special

cases like abbreviations. Its decisions are powered by pre-trained

statistical models. We use the small English model, which has a

good speed and respectable performance. For instance, in the fol-

lowing example, the rst two dots are not considered for splitting

the sentences but the third dot is.

An employee has a level of studies, i.e., a degree. An

employee is aliated to a department.

=⇒

𝑠1: An employee has a level of studies, i.e., a degree.

𝑠2: An employee is aliated to a department.

2.4 Sentence Classication

Sentence classication is the third step in the runtime operations of

Figure 1. Classication provides additional information on the Eng-

lish specication that can be used later to better generate the related

UML diagram fragment. Each sentence is classied as describing

either a "class" or a "relationship".

The training data for the classier comes from the dataset de-

scribed in Section 2.2. Each data point is structured as a pair <Eng-

lish specication, UML fragment> and is assigned a label of a "class"

or "relationship" from the moment the dataset was processed from

AtlanMod Zoo. The pairing means that the English specication

belongs to that specic UML fragment. Our classier is trained to

predict the "class/relationship" label from an English specication.

To evaluate the accuracy of the classier, we use 80% of the data

for the training, and the remaining 20% for testing.

2https://github.com/XsongyangX/uml-classes-and-specs

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TowardsAutomaticallyExtractingUMLClassDiagramsfromNaturalLanguageSpecificationsSongYangUniversitédeMontréalMontreal,Canadasong.yang.1@umontreal.caHouariSahraouiUniversitédeMontréalMontreal,Canadahouari.sahraoui@umontreal.caABSTRACTInmodel-drivenengineering(MDE),UMLclassdiagramsserveasawaytoplanandco...

展开>> 收起<<

Towards Automatically Extracting UML Class Diagrams from Natural Language Specifications.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Towards Automatically Extracting UML Class Diagrams from Natural Language Specifications

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: