Improving Graph-Based Text Representations with Character and Word Level N-grams Wenzhe Li and Nikolaos Aletras

2025-04-27 0 0 393.88KB 6 页 10玖币

侵权投诉

Improving Graph-Based Text Representations with Character and Word

Level N-grams

Wenzhe Li and Nikolaos Aletras

Computer Science Department, University of Shefﬁeld, UK

{wli90, n.aletras}@sheffield.ac.uk

Abstract

Graph-based text representation focuses on

how text documents are represented as graphs

for exploiting dependency information be-

tween tokens and documents within a corpus.

Despite the increasing interest in graph repre-

sentation learning, there is limited research in

exploring new ways for graph-based text rep-

resentation, which is important in downstream

natural language processing tasks. In this pa-

per, we ﬁrst propose a new heterogeneous

word-character text graph that combines word

and character n-gram nodes together with doc-

ument nodes, allowing us to better learn de-

pendencies among these entities. Additionally,

we propose two new graph-based neural mod-

els, WCTextGCN and WCTextGAT, for mod-

eling our proposed text graph. Extensive ex-

periments in text classiﬁcation and automatic

text summarization benchmarks demonstrate

that our proposed models consistently outper-

form competitive baselines and state-of-the-art

graph-based models.1

1 Introduction

State-of-the art graph neural network (GNN) archi-

tectures (Scarselli et al.,2008) such as graph convo-

lutional networks (GCNs) (Kipf and Welling,2016)

and graph attention networks (GATs) (Veliˇ

ckovi´

et al.,2017) have been successfully applied to vari-

ous natural language processing (NLP) tasks such

as text classiﬁcation (Yao et al.,2019;Liang et al.,

2022;Ragesh et al.,2021;Yao et al.,2021) and

automatic summarization (Wang et al.,2020;An

et al.,2021).

The success of GNNs in NLP tasks highly de-

pends on how effectively the text is represented as

a graph. A simple and widely adopted way to con-

struct a graph from text is to represent documents

and words as graph nodes and encode their depen-

dencies as edges (i.e., word-document graph). A

Code is available here:

https://github.com/

GraphForAI/TextGraph

given text is converted into a heterogeneous graph

where nodes representing documents are connected

to nodes representing words if the document con-

tains that particular word (Minaee et al.,2021;

Wang et al.,2020). Edges among words are typi-

cally weighted using word co-occurrence statistics

that quantify the association between two words,

as shown in Figure 1(left).

However, word-document graphs have several

drawbacks. Simply connecting individual word

nodes to document nodes ignores the ordering of

words in the document which is important in under-

standing the semantic meaning of text. Moreover,

such graphs cannot deal effectively with word spar-

sity. Most of the words in a corpus only appear

a few times that results in inaccurate representa-

tions of word nodes using GNNs. This limitation is

especially true for languages with large vocabular-

ies and many rare words, as noted by (Bojanowski

et al.,2017). Current word-document graphs also

ignore explicit document relations i.e., connections

created from pair-wise document similarity, that

may play an important role for learning better doc-

ument representations (Li et al.,2020).

Contributions:

In this paper, we propose a new

simple yet effective way of constructing graphs

from text for GNNs. First, we assume that word

ordering plays an important role for semantic un-

derstanding which could be captured by higher-

order n-gram nodes. Second, we introduce charac-

ter n-gram nodes as an effective way for mitigat-

ing sparsity (Bojanowski et al.,2017). Third, we

take into account document similarity allowing the

model to learn better associations between docu-

ments. Figure 1(right) shows our proposed Word-

Character Heterogeneous text graph compared to a

standard word-document graph (left). Finally, we

propose two variants of GNNs,

WCTextGCN

and

WCTextGAT

, that extend GCN and GAT respec-

tively, for modeling our proposed text graph.

arXiv:2210.05999v1 [cs.CL] 12 Oct 2022

Figure 1: A simple word-document graph (left); and our proposed Word-Character Heterogeneous graph (right).

For right ﬁgure, the edge types are deﬁned as follows: (1) word-document connection if a document contains a

word (i.e., tf-idf); (2) word-word connection based on co-occurrence statistics (i.e., PMI); (3) document-document

connection with similarity score (i.e., cosine similarity); (4) word n-grams and words connection if a word is part

of n-grams (0/1); (5) word n-grams and document connection if a document contains a n-grams (0/1); and (6)

character n-grams and words connection if a character n-grams is part of a word (0/1).

2 Methodology

Given a corpus as a list of text documents

{D1, ..., Dn}

, our goal is to learn an embedding

for each document

using GNNs. This rep-

resentation can subsequently be used in different

downstream tasks such as text classiﬁcation and

summarization.

2.1 Word-Character Heterogeneous Graph

The Word-Character Heterogeneous graph

(V, E)

consists of the node set

V=Vd∪Vw∪Vg∪

, where

Vd={d1, .., dn}

corresponds to a set

of documents,

Vw={w1, ..., wm}

denotes a set of

unique words,

Vg={g1, ..., gl}

denotes a set of

unique n-gram tokens, and ﬁnally

Vc={c1, ..., cp}

denotes a set of unique character n-grams. The

edge types among different nodes vary depending

on the types of the connected nodes. In addition,

we also add edges between two documents if their

cosine similarity is larger than a pre-deﬁned thresh-

old.

2.2 Word and Character N-grams Enhanced

Text GNNs

The goal of GNN models is to learn representa-

tion for each node. We use

Hd∈Rnd×k,Hw∈

Rnw×k,Hg∈Rng×k,Hc∈Rnc×k

to denote

representations of document nodes, word nodes,

word n-grams nodes and character n-grams nodes,

where

is the size of the hidden dimension size.

nd, nw, ng, nw

represent the number of documents,

words, word n-grams and character n-grams in the

graph respectively. We use

edw

to denote the edge

weight between the

th document and

th word.

Similarly,

ecw

denotes the edge weight between the

kth character n-gram and jth word.

The original GCN and GAT models only con-

sider simple graphs where the graph contains a

single type of nodes and edges. Since we now are

dealing with our Word-Character Heterogeneous

graph, we introduce appropriate modiﬁcations.

Word and Character N-grams Enhanced Text

GCN (WCTextGCN)

In order to support our new

graph type for GCNs, we need a modiﬁcation for

the adjacency matrix

. The updating equation for

original GCN is:

H(L+1) =f(ˆ

AHLWL)

where

is the free parameter to be learned for

layer

. We assume

is simply the concatena-

tion of

Hd,Hw,Hg,Hc

. For

WCTextGCN

, the

adjacency matrix Ais re-deﬁned as:







Add

sim Adw

tfidf Adg

tfidf −

Awd

tfidf Aww

pmi Awg

0/1Awc

0/1

Agd

tfidf Agw

0/1− −

−Acw

0/1− −







文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ImprovingGraph-BasedTextRepresentationswithCharacterandWordLevelN-gramsWenzheLiandNikolaosAletrasComputerScienceDepartment,UniversityofShefeld,UK{wli90,n.aletras}@sheffield.ac.ukAbstractGraph-basedtextrepresentationfocusesonhowtextdocumentsarerepresentedasgraphsforexploitingdependencyinformationbe-...

展开>> 收起<<

Improving Graph-Based Text Representations with Character and Word Level N-grams Wenzhe Li and Nikolaos Aletras.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Improving Graph-Based Text Representations with Character and Word Level N-grams Wenzhe Li and Nikolaos Aletras

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: