
Improving Graph-Based Text Representations with Character and Word
Level N-grams
Wenzhe Li and Nikolaos Aletras
Computer Science Department, University of Sheffield, UK
{wli90, n.aletras}@sheffield.ac.uk
Abstract
Graph-based text representation focuses on
how text documents are represented as graphs
for exploiting dependency information be-
tween tokens and documents within a corpus.
Despite the increasing interest in graph repre-
sentation learning, there is limited research in
exploring new ways for graph-based text rep-
resentation, which is important in downstream
natural language processing tasks. In this pa-
per, we first propose a new heterogeneous
word-character text graph that combines word
and character n-gram nodes together with doc-
ument nodes, allowing us to better learn de-
pendencies among these entities. Additionally,
we propose two new graph-based neural mod-
els, WCTextGCN and WCTextGAT, for mod-
eling our proposed text graph. Extensive ex-
periments in text classification and automatic
text summarization benchmarks demonstrate
that our proposed models consistently outper-
form competitive baselines and state-of-the-art
graph-based models.1
1 Introduction
State-of-the art graph neural network (GNN) archi-
tectures (Scarselli et al.,2008) such as graph convo-
lutional networks (GCNs) (Kipf and Welling,2016)
and graph attention networks (GATs) (Veliˇ
ckovi´
c
et al.,2017) have been successfully applied to vari-
ous natural language processing (NLP) tasks such
as text classification (Yao et al.,2019;Liang et al.,
2022;Ragesh et al.,2021;Yao et al.,2021) and
automatic summarization (Wang et al.,2020;An
et al.,2021).
The success of GNNs in NLP tasks highly de-
pends on how effectively the text is represented as
a graph. A simple and widely adopted way to con-
struct a graph from text is to represent documents
and words as graph nodes and encode their depen-
dencies as edges (i.e., word-document graph). A
1
Code is available here:
https://github.com/
GraphForAI/TextGraph
given text is converted into a heterogeneous graph
where nodes representing documents are connected
to nodes representing words if the document con-
tains that particular word (Minaee et al.,2021;
Wang et al.,2020). Edges among words are typi-
cally weighted using word co-occurrence statistics
that quantify the association between two words,
as shown in Figure 1(left).
However, word-document graphs have several
drawbacks. Simply connecting individual word
nodes to document nodes ignores the ordering of
words in the document which is important in under-
standing the semantic meaning of text. Moreover,
such graphs cannot deal effectively with word spar-
sity. Most of the words in a corpus only appear
a few times that results in inaccurate representa-
tions of word nodes using GNNs. This limitation is
especially true for languages with large vocabular-
ies and many rare words, as noted by (Bojanowski
et al.,2017). Current word-document graphs also
ignore explicit document relations i.e., connections
created from pair-wise document similarity, that
may play an important role for learning better doc-
ument representations (Li et al.,2020).
Contributions:
In this paper, we propose a new
simple yet effective way of constructing graphs
from text for GNNs. First, we assume that word
ordering plays an important role for semantic un-
derstanding which could be captured by higher-
order n-gram nodes. Second, we introduce charac-
ter n-gram nodes as an effective way for mitigat-
ing sparsity (Bojanowski et al.,2017). Third, we
take into account document similarity allowing the
model to learn better associations between docu-
ments. Figure 1(right) shows our proposed Word-
Character Heterogeneous text graph compared to a
standard word-document graph (left). Finally, we
propose two variants of GNNs,
WCTextGCN
and
WCTextGAT
, that extend GCN and GAT respec-
tively, for modeling our proposed text graph.
arXiv:2210.05999v1 [cs.CL] 12 Oct 2022