PcMSP A Dataset for Scientific Action Graphs Extraction from Polycrystalline Materials Synthesis Procedure Text Xianjun Yang1 Ya Zhuo2 Julia Zuo2 Xinlu Zhang1 Stephen Wilson2 Linda Petzold1

2025-04-29 0 0 620.31KB 14 页 10玖币
侵权投诉
PcMSP: A Dataset for Scientific Action Graphs Extraction from
Polycrystalline Materials Synthesis Procedure Text
Xianjun Yang1, Ya Zhuo2, Julia Zuo2, Xinlu Zhang1, Stephen Wilson2, Linda Petzold1
1Department of Computer Science 2Department of Materials Science and Engineering
University of California, Santa Barbara
{xianjunyang, yzhuo, jlzuo, xinluzhang, stephendwilson, petzold}@ucsb.edu
Abstract
Scientific action graphs extraction from ma-
terials synthesis procedures is important for
reproducible research, machine automation,
and material prediction. But the lack of
annotated data has hindered progress in
this field. We demonstrate an effort to
annotate Polycrystalline Materials Synthesis
Procedures (PcMSP) from 305 open access
scientific articles for the construction of syn-
thesis action graphs. This is a new dataset for
material science information extraction that si-
multaneously contains the synthesis sentences
extracted from the experimental paragraphs, as
well as the entity mentions and intra-sentence
relations. A two-step human annotation and
inter-annotator agreement study guarantee the
high quality of the PcMSP corpus. We intro-
duce four natural language processing tasks:
sentence classification, named entity recogni-
tion, relation classification, and joint extrac-
tion of entities and relations. Comprehen-
sive experiments validate the effectiveness of
several state-of-the-art models for these chal-
lenges while leaving large space for improve-
ment. We also perform the error analysis and
point out some unique challenges that require
further investigation. We will release our an-
notation scheme, the corpus, and codes to the
research community to alleviate the scarcity of
labeled data in this domain1.
1 Introduction
Synthesis procedural texts are written in instruc-
tional languages (Grishman,2001;Grishman and
Kittredge,2014) to represent the step-by-step reac-
tions, but also contain the distinct features in spe-
cific domains, such as the domain notations, writ-
ing styles, and journal requirements. The synthesis
procedures of materials science articles include
valuable information for new materials prediction
(Raccuglia et al.,2016), laboratory automation (Co-
ley et al.,2019) and knowledge graph construction
1https://github.com/Xianjun-Yang/PcMSP
Synthesis Paragraph
Polycrystalline[Descriptor]
sample of
composition
Sr2CoO4[Material_target]
was
synthesized[operation]
under
high pressure[Property_pressure]
at
high temperature[Property_temperature]
. Start-
ing materials of
SrO2[Material_recipe]
and
Co[Material_recipe]
were
well[Descriptor] mixed[operation]
in a
molar ratio[Descriptor]
of
SrO2[Material_recipe]
:
Co[Material_recipe]
=
2 : 1[Value]
. The
mixture[Material-intermedium]
was
sealed[operation]
into
a[Value] gold[Descriptor] capsule[Device]
. ...
The crystal structure of the polycrystalline sample
was identified by the powder X-ray diffraction
(XRD, Rigaku Smart- lab3), using Cu-K
α
radiation (λ=1.54184Å). ...
Table 1: An example of a synthesis paragraph from our
dataset with index srep27712 (Li et al.,2016).
(Mrdjenovich et al.,2020). However, available
datasets are extremely limited, despite the notable
work by (Mysore et al.,2017,2019;Friedrich et al.,
2020;O’Gorman et al.,2021).
The goal of information extraction from proce-
dures is to construct the action graphs, which refer
to all the steps in a synthesis making up a Directed
Acyclic Graph (DAG) (Mysore et al.,2019;Kulka-
rni et al.,2018) (as can be seen from one example
in Figure 1). This can be further breakdown into
three tasks: sentence classification, named entity
recognition (NER), and relation extraction (RE).
Previous research (Mysore et al.,2017,2019) ei-
ther annotates the whole synthesis paragraph in
the general inorganic domain, ignoring the non-
synthesis sentences and subdomain discrepancy or
only focuses on entity mentions (Friedrich et al.,
2020;O’Gorman et al.,2021).
To fill this gap, we focus on one important cat-
egory of polycrystalline materials and simultane-
arXiv:2210.12401v1 [cs.CL] 22 Oct 2022
Figure 1: A synthesis action graph constructed from
Table 1.
ously include all three tasks. The annotation guide-
lines are designed by materials experts after com-
prehensive discussion, and the new dataset is sub-
sequently labeled with a two-round annotation.
The key contributions of this paper include:
We contribute a new large-scale dataset, as
well as an annotation scheme with high qual-
ity for information extraction in materials sci-
ence.
We conduct comprehensive experiments on
four tasks, sentence classification, named en-
tity recognition, relation extraction, and joint
extraction to provide baselines.
We perform error analysis and point out
unique challenges and potential use of this
dataset for future research.
2 Related Work
Scientific information extraction
With the fast-growing volume of scholarly pub-
lications, it is highly demanding to extract struc-
tured information from large-scale scientific liter-
ature in many domains (Augenstein et al.,2017;
Luan et al.,2018;Jiang et al.,2019;Beltagy et al.,
2019;Buscaldi et al.,2019), like biomedical do-
main (Shah et al.,2003;Lai et al.,2021;Zhang
et al.,2021;Lewis et al.,2020;Kulkarni et al.,
2018) and chemistry domain (Rocktäschel et al.,
2012;He et al.,2020). In the field of materials sci-
ence, there have been few attempts in this direction,
leaving many unexplored challenges for research
(Hong et al.,2021). Recent research mainly fo-
cuses on knowledge base construction (Jiang et al.,
2019;Luan et al.,2018), new materials discovery
(Isayev,2019), and automation of lab procedures
(Vaucher et al.,2020;Tamari et al.,2021;Steiner
et al.,2019). (Beltagy et al.,2019) trained a Bidi-
rectional Encoder Representations from Transform-
ers model (SciBERT) on 1.14M scientific papers
from Semantic Scholar for scientific information
extraction.
Materials procedures information extraction
In the area of annotation of materials synthesis
procedures, (Mysore et al.,2019) annotate 230
general materials synthesis paragraphs for NER
and RE tasks. Similar work is also undertaken
by (Friedrich et al.,2020), in which 45 open ac-
cess scholarly articles are labeled for experiment-
describing sentence classification, NER, and slot
filling tasks. However, in contrast to our works,
their annotation scheme focuses on the full text
rather than the experimental section. (Kuniyoshi
et al.,2020) annotate the synthesis process of all-
solid-state batteries from the scientific literature,
but their corpus is not publicly available. (Walker
et al.,2021) release MatBERT trained on 50 million
materials science paragraphs to explore the impact
of domain-specific pre-training on NER task. Also
of interest, (O’Gorman et al.,2021) recently create
the largest corpus for entity mentions extraction in
both general domain and subdomain from material
synthesis text, but the relations between entities are
still missing.
Named entity recognition and relation
extraction
Many neural network-based models have been pro-
posed for named entity recognition, for example,
(Huang et al.,2015;Lample et al.,2016;Panchen-
drarajan and Amaresan,2018). The core idea uses
one encoding layer (e.g. Long Short-Term Mem-
ory (LSTM) (Hochreiter and Schmidhuber,1997),
BERT) for representation and one additional condi-
tional random fields (CRF (Lafferty et al.,2001))
layer for sequence labeling. Then relations are pre-
dicted based on either gold entities or predicted
entities, and PURE (Zhong and Chen,2021) de-
signs two separate encoders for joint extraction of
Figure 2: An annotated PcMSP example on the INCEpTION platform, taken from srep15507 (Man et al.,2015).
entities and relations. We adopt their model for our
tasks due to its super performance.
3 The Selection of Our Dataset
Here we talk about the importance of our selec-
tion and how is it different from other materials
procedural text corpora.
Why do we choose inorganic polycrystalline
materials?
There are a number of sub-categories
within solid-state inorganic materials. For exam-
ple, materials can be divided based on function and
properties, such as the battery or thermoelectric
materials. Synthesis within both categories largely
falls within the broader category of solid-state syn-
thesis and even then, there is a high degree of over-
lap with other function categories, such as quantum
and magnetic materials. More importantly,
those
materials are usually in the form of polycrys-
talline
. Other subcategories relate to form factors,
for instance, single-crystalline synthesis often starts
with a
polycrystalline
synthesis and therefore has
a high degree of overlap with solid-state synthesis.
Inorganic polycrystal compounds span combina-
tions of the entire periodic table and different chem-
ical bonding schemes, such that their synthesis typi-
cally takes place under extreme conditions, such as
high temperature and pressure. Reaction pathways
are therefore difficult to characterize without spe-
cialized equipment and are not well established for
any given material. In particular, solid-state reac-
tions, which are the main techniques to synthesize
inorganic polycrystalline materials, are particularly
similar to a “black box”, where materials scientists
can only make educated guesses to the procedure
or stability of a new reaction. This presents a prime
opportunity (Mysore et al.,2017,2019) for compil-
ing published inorganic synthesis data in order to
demystify the black box of solid-state inorganic ma-
terials synthesis and create datasets for future text
mining endeavors. While there have been efforts
within general solid-state materials (Mysore et al.,
2017,2019;O’Gorman et al.,2021) and battery
materials subcategory (Friedrich et al.,2020), this
work aims to extend the subcategory of inorganic
solid-state synthesis methods in order to address
the frequent overlap and “borrowing” of materials
between subdisciplines of materials science.
Why do we discard characterization sen-
tences?
Inorganic reactions typically involve rel-
atively few reactions from a set of precursors and
there are very few purification pathways for solid
materials compared to organic materials or liq-
uids. Therefore, characterizations of solid-state
inorganic reactions are seldom reported in litera-
ture unless they proceed to complete purity within
standard measurement fidelity. This is in contrast
to organic materials where there are a number of
important characterization metrics in a compound,
such as molecular weight in polymers or reaction
yield. Therefore, these standard characterization
measurements do not add valuable information for
a researcher attempting to recreate the reported
synthesis method and we decide to discard these
characterization sentences.
Why do we annotate sentence, entity, and re-
lation simultaneously?
A full action graph con-
sists of both entities and relations extracted from
experimental-describing sentences. However, most
previous research either ignores the annotation of
sentence or relation information, making them in-
摘要:

PcMSP:ADatasetforScienticActionGraphsExtractionfromPolycrystallineMaterialsSynthesisProcedureTextXianjunYang1,YaZhuo2,JuliaZuo2,XinluZhang1,StephenWilson2,LindaPetzold11DepartmentofComputerScience2DepartmentofMaterialsScienceandEngineeringUniversityofCalifornia,SantaBarbara{xianjunyang,yzhuo,jlzuo,...

展开>> 收起<<
PcMSP A Dataset for Scientific Action Graphs Extraction from Polycrystalline Materials Synthesis Procedure Text Xianjun Yang1 Ya Zhuo2 Julia Zuo2 Xinlu Zhang1 Stephen Wilson2 Linda Petzold1.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:620.31KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注