Improving Graph-Based Text Representations with Character and Word Level N-grams Wenzhe Li and Nikolaos Aletras

2025-04-27 0 0 393.88KB 6 页 10玖币
侵权投诉
Improving Graph-Based Text Representations with Character and Word
Level N-grams
Wenzhe Li and Nikolaos Aletras
Computer Science Department, University of Sheffield, UK
{wli90, n.aletras}@sheffield.ac.uk
Abstract
Graph-based text representation focuses on
how text documents are represented as graphs
for exploiting dependency information be-
tween tokens and documents within a corpus.
Despite the increasing interest in graph repre-
sentation learning, there is limited research in
exploring new ways for graph-based text rep-
resentation, which is important in downstream
natural language processing tasks. In this pa-
per, we first propose a new heterogeneous
word-character text graph that combines word
and character n-gram nodes together with doc-
ument nodes, allowing us to better learn de-
pendencies among these entities. Additionally,
we propose two new graph-based neural mod-
els, WCTextGCN and WCTextGAT, for mod-
eling our proposed text graph. Extensive ex-
periments in text classification and automatic
text summarization benchmarks demonstrate
that our proposed models consistently outper-
form competitive baselines and state-of-the-art
graph-based models.1
1 Introduction
State-of-the art graph neural network (GNN) archi-
tectures (Scarselli et al.,2008) such as graph convo-
lutional networks (GCNs) (Kipf and Welling,2016)
and graph attention networks (GATs) (Veliˇ
ckovi´
c
et al.,2017) have been successfully applied to vari-
ous natural language processing (NLP) tasks such
as text classification (Yao et al.,2019;Liang et al.,
2022;Ragesh et al.,2021;Yao et al.,2021) and
automatic summarization (Wang et al.,2020;An
et al.,2021).
The success of GNNs in NLP tasks highly de-
pends on how effectively the text is represented as
a graph. A simple and widely adopted way to con-
struct a graph from text is to represent documents
and words as graph nodes and encode their depen-
dencies as edges (i.e., word-document graph). A
1
Code is available here:
https://github.com/
GraphForAI/TextGraph
given text is converted into a heterogeneous graph
where nodes representing documents are connected
to nodes representing words if the document con-
tains that particular word (Minaee et al.,2021;
Wang et al.,2020). Edges among words are typi-
cally weighted using word co-occurrence statistics
that quantify the association between two words,
as shown in Figure 1(left).
However, word-document graphs have several
drawbacks. Simply connecting individual word
nodes to document nodes ignores the ordering of
words in the document which is important in under-
standing the semantic meaning of text. Moreover,
such graphs cannot deal effectively with word spar-
sity. Most of the words in a corpus only appear
a few times that results in inaccurate representa-
tions of word nodes using GNNs. This limitation is
especially true for languages with large vocabular-
ies and many rare words, as noted by (Bojanowski
et al.,2017). Current word-document graphs also
ignore explicit document relations i.e., connections
created from pair-wise document similarity, that
may play an important role for learning better doc-
ument representations (Li et al.,2020).
Contributions:
In this paper, we propose a new
simple yet effective way of constructing graphs
from text for GNNs. First, we assume that word
ordering plays an important role for semantic un-
derstanding which could be captured by higher-
order n-gram nodes. Second, we introduce charac-
ter n-gram nodes as an effective way for mitigat-
ing sparsity (Bojanowski et al.,2017). Third, we
take into account document similarity allowing the
model to learn better associations between docu-
ments. Figure 1(right) shows our proposed Word-
Character Heterogeneous text graph compared to a
standard word-document graph (left). Finally, we
propose two variants of GNNs,
WCTextGCN
and
WCTextGAT
, that extend GCN and GAT respec-
tively, for modeling our proposed text graph.
arXiv:2210.05999v1 [cs.CL] 12 Oct 2022
Figure 1: A simple word-document graph (left); and our proposed Word-Character Heterogeneous graph (right).
For right figure, the edge types are defined as follows: (1) word-document connection if a document contains a
word (i.e., tf-idf); (2) word-word connection based on co-occurrence statistics (i.e., PMI); (3) document-document
connection with similarity score (i.e., cosine similarity); (4) word n-grams and words connection if a word is part
of n-grams (0/1); (5) word n-grams and document connection if a document contains a n-grams (0/1); and (6)
character n-grams and words connection if a character n-grams is part of a word (0/1).
2 Methodology
Given a corpus as a list of text documents
C=
{D1, ..., Dn}
, our goal is to learn an embedding
hi
for each document
Di
using GNNs. This rep-
resentation can subsequently be used in different
downstream tasks such as text classification and
summarization.
2.1 Word-Character Heterogeneous Graph
The Word-Character Heterogeneous graph
G=
(V, E)
consists of the node set
V=VdVwVg
Vc
, where
Vd={d1, .., dn}
corresponds to a set
of documents,
Vw={w1, ..., wm}
denotes a set of
unique words,
Vg={g1, ..., gl}
denotes a set of
unique n-gram tokens, and finally
Vc={c1, ..., cp}
denotes a set of unique character n-grams. The
edge types among different nodes vary depending
on the types of the connected nodes. In addition,
we also add edges between two documents if their
cosine similarity is larger than a pre-defined thresh-
old.
2.2 Word and Character N-grams Enhanced
Text GNNs
The goal of GNN models is to learn representa-
tion for each node. We use
HdRnd×k,Hw
Rnw×k,HgRng×k,HcRnc×k
to denote
representations of document nodes, word nodes,
word n-grams nodes and character n-grams nodes,
where
k
is the size of the hidden dimension size.
nd, nw, ng, nw
represent the number of documents,
words, word n-grams and character n-grams in the
graph respectively. We use
edw
ij
to denote the edge
weight between the
i
th document and
j
th word.
Similarly,
ecw
kj
denotes the edge weight between the
kth character n-gram and jth word.
The original GCN and GAT models only con-
sider simple graphs where the graph contains a
single type of nodes and edges. Since we now are
dealing with our Word-Character Heterogeneous
graph, we introduce appropriate modifications.
Word and Character N-grams Enhanced Text
GCN (WCTextGCN)
In order to support our new
graph type for GCNs, we need a modification for
the adjacency matrix
A
. The updating equation for
original GCN is:
H(L+1) =f(ˆ
AHLWL)
where
WL
is the free parameter to be learned for
layer
L
. We assume
H
is simply the concatena-
tion of
Hd,Hw,Hg,Hc
. For
WCTextGCN
, the
adjacency matrix Ais re-defined as:
A=
Add
sim Adw
tfidf Adg
tfidf
Awd
tfidf Aww
pmi Awg
0/1Awc
0/1
Agd
tfidf Agw
0/1− −
Acw
0/1− −
摘要:

ImprovingGraph-BasedTextRepresentationswithCharacterandWordLevelN-gramsWenzheLiandNikolaosAletrasComputerScienceDepartment,UniversityofShefeld,UK{wli90,n.aletras}@sheffield.ac.ukAbstractGraph-basedtextrepresentationfocusesonhowtextdocumentsarerepresentedasgraphsforexploitingdependencyinformationbe-...

展开>> 收起<<
Improving Graph-Based Text Representations with Character and Word Level N-grams Wenzhe Li and Nikolaos Aletras.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:393.88KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注