
PcMSP: A Dataset for Scientific Action Graphs Extraction from
Polycrystalline Materials Synthesis Procedure Text
Xianjun Yang1, Ya Zhuo2, Julia Zuo2, Xinlu Zhang1, Stephen Wilson2, Linda Petzold1
1Department of Computer Science 2Department of Materials Science and Engineering
University of California, Santa Barbara
{xianjunyang, yzhuo, jlzuo, xinluzhang, stephendwilson, petzold}@ucsb.edu
Abstract
Scientific action graphs extraction from ma-
terials synthesis procedures is important for
reproducible research, machine automation,
and material prediction. But the lack of
annotated data has hindered progress in
this field. We demonstrate an effort to
annotate Polycrystalline Materials Synthesis
Procedures (PcMSP) from 305 open access
scientific articles for the construction of syn-
thesis action graphs. This is a new dataset for
material science information extraction that si-
multaneously contains the synthesis sentences
extracted from the experimental paragraphs, as
well as the entity mentions and intra-sentence
relations. A two-step human annotation and
inter-annotator agreement study guarantee the
high quality of the PcMSP corpus. We intro-
duce four natural language processing tasks:
sentence classification, named entity recogni-
tion, relation classification, and joint extrac-
tion of entities and relations. Comprehen-
sive experiments validate the effectiveness of
several state-of-the-art models for these chal-
lenges while leaving large space for improve-
ment. We also perform the error analysis and
point out some unique challenges that require
further investigation. We will release our an-
notation scheme, the corpus, and codes to the
research community to alleviate the scarcity of
labeled data in this domain1.
1 Introduction
Synthesis procedural texts are written in instruc-
tional languages (Grishman,2001;Grishman and
Kittredge,2014) to represent the step-by-step reac-
tions, but also contain the distinct features in spe-
cific domains, such as the domain notations, writ-
ing styles, and journal requirements. The synthesis
procedures of materials science articles include
valuable information for new materials prediction
(Raccuglia et al.,2016), laboratory automation (Co-
ley et al.,2019) and knowledge graph construction
1https://github.com/Xianjun-Yang/PcMSP
Synthesis Paragraph
Polycrystalline[Descriptor]
sample of
composition
Sr2CoO4[Material_target]
was
synthesized[operation]
under
high pressure[Property_pressure]
at
high temperature[Property_temperature]
. Start-
ing materials of
SrO2[Material_recipe]
and
Co[Material_recipe]
were
well[Descriptor] mixed[operation]
in a
molar ratio[Descriptor]
of
SrO2[Material_recipe]
:
Co[Material_recipe]
=
2 : 1[Value]
. The
mixture[Material-intermedium]
was
sealed[operation]
into
a[Value] gold[Descriptor] capsule[Device]
. ...
The crystal structure of the polycrystalline sample
was identified by the powder X-ray diffraction
(XRD, Rigaku Smart- lab3), using Cu-K
α
radiation (λ=1.54184Å). ...
Table 1: An example of a synthesis paragraph from our
dataset with index srep27712 (Li et al.,2016).
(Mrdjenovich et al.,2020). However, available
datasets are extremely limited, despite the notable
work by (Mysore et al.,2017,2019;Friedrich et al.,
2020;O’Gorman et al.,2021).
The goal of information extraction from proce-
dures is to construct the action graphs, which refer
to all the steps in a synthesis making up a Directed
Acyclic Graph (DAG) (Mysore et al.,2019;Kulka-
rni et al.,2018) (as can be seen from one example
in Figure 1). This can be further breakdown into
three tasks: sentence classification, named entity
recognition (NER), and relation extraction (RE).
Previous research (Mysore et al.,2017,2019) ei-
ther annotates the whole synthesis paragraph in
the general inorganic domain, ignoring the non-
synthesis sentences and subdomain discrepancy or
only focuses on entity mentions (Friedrich et al.,
2020;O’Gorman et al.,2021).
To fill this gap, we focus on one important cat-
egory of polycrystalline materials and simultane-
arXiv:2210.12401v1 [cs.CL] 22 Oct 2022