Diving into Unified Data-Model Sparsity for Class-Imbalanced Graph Representation Learning Chunhui Zhang1 Chao Huang2 Yijun Tian3 Qianlong Wen3

2025-04-27 0 0 909.24KB 18 页 10玖币
侵权投诉
Diving into Unified Data-Model Sparsity for
Class-Imbalanced Graph Representation Learning
Chunhui Zhang1, Chao Huang2, Yijun Tian3, Qianlong Wen3,
Zhongyu Ouyang3,Youhuan Li4,Yanfang Ye3,Chuxu Zhang1
1Brandeis University, USA, 2University of Hong Kong, China
3University of Notre Dame, USA, 4Hunan University, China
{chunhuizhang,chuxuzhang}@brandeis.edu chaohuang75@gmail.com,
{ytian5,qwen,zouyang2,yye7}@nd.edu,liyouhuan@hnu.edu.cn
Abstract
Even pruned by the state-of-the-art network compression methods, recent research
shows that deep learning model training still suffers from the demand of massive
data usage. In particular, Graph Neural Networks (GNNs) training upon such
non-Euclidean graph data often encounters relatively higher time costs, due to
its irregular and nasty density properties, compared with data in the regular Eu-
clidean space (e.g., image or text). Another natural property concomitantly with
graph is class-imbalance which cannot be alleviated by the massive graph data
while hindering GNNs’ generalization. To fully tackle these unpleasant properties,
(i) theoretically
, we introduce a hypothesis about what extent a subset of the train-
ing data can approximate the full dataset’s learning effectiveness. The effectiveness
is further guaranteed and proved by the gradients’ distance between the subset
and the full set;
(ii) empirically
, we discover that during the learning process of a
GNN, some samples in the training dataset are informative for providing gradients
to update model parameters. Moreover, the informative subset is not fixed during
training process. Samples that are informative in the current training epoch may
not be so in the next one. We refer to this observation as dynamic data sparsity.
We also notice that sparse subnets pruned from a well-trained GNN sometimes
forget the information provided by the informative subset, reflected in their poor
performances upon the subset. Based on these findings, we develop a unified
data-model dynamic sparsity framework named
Graph Dec
antation (GraphDec)
to address challenges brought by training upon a massive class-imbalanced graph
data. The key idea of GraphDec is to identify the informative subset dynamically
during the training process by adopting sparse graph contrastive learning. Ex-
tensive experiments on multiple benchmark datasets demonstrate that GraphDec
outperforms state-of-the-art baselines for class-imbalanced graph classification and
class-imbalanced node classification tasks, with respect to classification accuracy
and data usage efficiency.
1 Introduction
Graph representation learning (GRL) [
23
] has shown remarkable power in dealing with non-Euclidean
structure data (e.g., social networks, biochemical molecules, knowledge graphs). Graph neural
networks (GNNs) [
23
,
11
,
38
,
43
], as the current state-of-the-art of GRL, have become essential
in various graph mining applications. To learn the representation of each node reflecting its local
structure pattern, GNNs gather features of the neighbor nodes and apply message passing along edges.
This topology-aware mechanism enables GNNs to achieve superior performances over different tasks.
However, in many real-world scenarios, graph data often preserves two properties: massiveness [
36
,
18
] and class-imbalance [
30
]. Firstly, message-passing over nodes with high degrees brings about
Preprint. Under review.
arXiv:2210.00162v1 [cs.LG] 1 Oct 2022
heavy computation burdens. Some of the calculations are even redundant, in that not all neighbors
are informative for learning task-related embeddings. Unlike regular data such as images or texts, the
connectivity of irregular graph data causes random memory access, which further slows down the
efficiency of data readout. Secondly, class imbalance naturally exists in datasets from diverse practical
domains, such as bioinformatics and social networks. GNNs are sensitive to this imbalance and can
be biased toward the dominant classes. This bias may mislead GNNs’ learning process, therefore
making the model underfit on samples that are of real importance with respect to the downstream
tasks, and as a result yielding poor performance on the test data.
Figure 1: The principle of graph decantation. It decants
data samples based on rankings of their gradient scores,
and then uses them as the training set in the next epoch.
Accordingly, recent studies [
3
,
44
,
30
]
arise to address the issues of massiveness
or class-imbalanced in graph data. To
tackle the massiveness issue, [
7
,
2
] ex-
plore efficient data sampling policies to
reduce the computational cost from the
data perspective. From the model im-
provement perspective, some approaches
design the quantization-aware training and
low-precision inference method to reduce
GNNs’ operating costs. For example,
GLT [
3
] applies the lottery ticket tech-
nique [
9
] to simplify graph data and GNN
model simultaneously. To deal with the
imbalance issue in node classification on
graphs, GraphSMOTE [
44
] tries to gener-
ate new nodes for the minority classes to balance the training data. Improved upon GraphSMOTE,
GraphENS [
30
] further proposes a new augmentation method by constructing an ego network to learn
the representations of the minority classes. Despite progress made so far, existing methods fail to
tackle the two issues altogether. Furthermore, while one of the issues is being handled, extra compu-
tation costs are introduced at the same time. For example, the rewind steps in GLT [
3
] which search
for lottery subnets and subsets heavily increase the computation cost, although the final lotteries are
lightweight. The newly synthetic nodes in GraphSMOTE [
44
,
1
] and GraphENS [
30
], although help
alleviate the data imbalance, bring extra computational burdens for the next-coming training process.
Regarding the issues above, we make several observations. Firstly, we notice that a sparse pruned
GNN easily "forgets" the minority samples when trained with class-imbalanced graph data, as it
yields worse performance over the minorities compared with the original GNN [
17
]. To investigate
the cause of the above observation, we study how each graph sample affects the model parameters
update process by taking a closer look at the gradients it brings to the parameters. Specifically, at early
training stages, we found a small subset of the samples providing the most informative supervisory
signals reflected by the gradient norms. One hypothesis we make is that the training effectiveness of
the full training set can be approximated, to some extent, by that of the subset. We further hypothesize
that this effective approximation is guaranteed by the distance between the gradients of the subset
and the full dataset.
Based on the above observations and the hypothesises, we propose a novel method called
Graph
Dec
antation (GraphDec) to explore dynamic sparsity training from both model and data aspects.
The principle behind our method is illustrated in Figure 1. Given that informative samples bring
about higher gradient values/scores when trained with a sparse GNN, our method is inspired by
contrastive self-supervised learning because it can be modified to dynamically prune one branch of
contrastive backbone for improving its capability of identifying minority samples in class-imbalanced
dataset. In particular, we design a new contrastive backbone with a sparse GNN and enable the
model to identify informative samples in a self-supervised manner. To the best of our knowledge,
other learning processes (e.g., graph auto-encoder, supervised learning) are either unable to identify
informative samples or incapable of learning in a self-supervised manner. Accordingly, the proposed
framework can score samples in the current training set and keep only
k
most informative samples as
training set for the next epoch. Considering that a currently unimportant sample does not imply that
it will always be unimportant, we further design a data recycling process to randomly recycle prior
discarded data samples (samples that are considered unimportant in previous training epochs), and
add them back to current informative sparse subsets for reuse. The dynamically updated informative
subset supports the sparse GNN to learn more balanced representations. To summarize, our major
contributions in this work are:
2
We develop a novel framework, Graph Decantation, which leverages dynamical sparse graph con-
trastive learning on class-imbalanced graph data with efficient data usage. To our best knowledge,
this is the first study to explore the dynamic sparsity property for class-imbalanced graphs.
We introduce cosine annealing to dynamically control the sizes of the sparse GNN model and graph
data subset to smooth the training process. Meanwhile, we introduce data recycling to refresh the
current data subset to avoid overfitting.
Comprehensive experiments on multiple benchmark datasets demonstrate that GraphDec outper-
forms state-of-the-art methods for both class-imbalanced graph classification and class-imbalanced
node classification tasks. Additional results show that GraphDec dynamically finds an informative
subset across different training epochs effectively.
2 Related Work
Training deep model with sparsity.
Despite the fact that deep neural networks work generally well
in practice, they are usually over-parameterized. Over-parameterized models, although usually achieve
good performance when trained properly, are usually associated with enormous computational cost.
Therefore, parameter pruning aiming at decreasing computational cost has been a popular topic and
many parameter-pruning strategies are proposed to balance the trade-off between model performance
and learning efficiency [
5
,
24
]. Among all of the existing parameter pruning methods, most of them
belong to the static pruning category and deep neural networks are pruned either by neurons [
14
,
13
]
or architectures (layer and filter) [
15
,
6
]. Parameters deleted by these methods will not be recovered
later. In contrast to static pruning methods, more recent works propose dynamic pruning strategies
where different compact subnets will be dynamically activated at each training iteration [
26
,
28
,
32
].
The other line of computation cost reduction lies in the dataset sparsity [
21
,
25
,
31
]. The core idea is
to prune the original dataset and filter out the most salient subset so that an over-parameterized deep
model could be trained upon (e.g., data diet subset [
31
]). Recently, the property of sparsity is also
used to improve model robustness [
4
,
10
]. In this work, we attempt to accomplish dynamic sparsity
from both the GNN model and the graph dataset simultaneously.
Class-imbalanced learning on graphs.
In real-world scenarios, imbalanced class distribution is
one of the natural properties in many datasets, including graph data. Except for conventional re-
balanced methods, like reweighting samples [
44
,
30
] and oversampling [
44
,
30
], different methods
have been proposed to solve the class imbalance issue in graph data given a specific task. For
the node classification task, an early work [
45
] tries to accurately characterize the rare categories
through a curriculum self-paced strategy while some other previous works [
34
,
44
,
30
] solve the
class-imbalanced issue by proposing different methods to generate synthetic samples to rebalance
the dataset. Compared to the node-level task, the re-balanced methods specific to graph-level task
are relatively unexplored. A recent work [
39
] proposes to utilize additional supervisory signals from
neighboring graphs to alleviate the class-imbalanced problem for a graph-level task. To the best
of our knowledge, our proposed GraphDec is the first work to solve the class-imbalanced for both
node-level and graph-level tasks.
3 Preliminary
In this work, we denote graph as
G“ pV, E, Xq
, where
V
is the set of nodes,
E
is the set of edges,
and
XPRd
represents the node (and edge) attributes of dimension
d
. In addition, we represent the
neighbor set of node vPVas Nv.
Graph Neural Networks.
GNNs [
40
] learn node representations from the graph structure and node
attributes. This process can be formulated as:
hplq
vCOMBINEplq´hpl´1q
v,AGGREGATEplq´!hpl´1q
u,@uPNv)¯¯,(1)
where
hplq
v
denotes feature of node
v
at
l
-th GNN layer;
AGGREGATEp¨q
and
COMBINEp¨q
are
neighbor aggregation and combination functions, respectively;
hp0q
v
is initialized with node attribute
Xv
. We obtain the output representation of each node after repeating the process in Equation (1) for
L
rounds. The representation of the whole graph, denoted as
hGPRd
, can be obtained by using a
READOUT function to combine the final node representations learned above:
hGREADOUT !hpLq
v| @vPV),(2)
where the READOUT function can be any permutation invariant, like summation, averaging, etc.
Graph Contrastive Learning.
Given a graph dataset
D“ tGiuN
i1
, Graph Contrastive Learning
(GCL) methods firstly implement proper transformations on each graph
Gi
to generate two views
3
G1
i
and
G2
i
. The goal of GCL is to map samples within positive pairs closer in the hidden space,
while those of the negative pairs are further. GCL methods are usually optimized by contrastive loss.
Taking the most popular InfoNCE loss [29] as an example, the contrastive loss is defined as:
LCLpG1
i, G2
iq“´log exp psim pzi,1,zi,2qq
řN
j1,jiexp psim pzi,1,zj,2qq,(3)
where zi,1fθpG1
iq,zi,2fθpG2
iq, and sim denotes the similarity function.
Network Pruning.
Given an over-parameterized deep neural network
fθp¨q
with weights
θ
, the
network pruning is usually performed layer-by-layer. The pruning process of the
lth
layer in
fθp¨q
can be formulated as follows:
θlth
pruned TopKpθlth , kq, k αˆ |θlth |,(4)
where
θlth
is the parameters in the
lth
layer of
fθp¨q
and
TopK, kq
refers to the operation to choose
the top-
k
largest elements of
θlth
. We use a pre-defined sparse rate
α
to control the fraction of
parameters kept in the pruned network
θlth
pruned
. Finally, only the top
kαˆ |θlth |
largest weights
will be kept in the pruned layer. The pruning process will be implemented iteratively to prune the
parameters in each layer of deep neural network [12].
4 Methodology
In this section, we first illustrate our sparse subset approximation hypothesis supported by the theorem,
which means that if the gradients of a data subset approximate well to the gradients of the full data
set, the model trained on subset performs closely to the model trained with full set. Guided by this
hypothesis, we develop GraphDec to continually refine a compact training subset with the dynamic
graph contrastive learning methodology. In detail, we describe procedures about how to rank the
importance of each sample, smooth the refining procedure, and avoid overfitting.
4.1 Sparse Subset Approximation Hypothesis
Firstly, we propose the sparse subset approximation hypothesis to show how a model trained with
a subset data
DS
can approximate the effect of a model trained with full data
D
. This hypothesis
explains why the performance of the model trained with a subset data selected by specific methods
(e.g., data diet [31]) achieves performance close to the one trained on the full dataset.
Theorem 1
For a data selection algorithm, we assume the model is optimized with full gradient
descent. At epoch
t
where
tPr1, T s
, denote the model’s parameters as
θptq
where
θptq
2ďd2
and
d is constant, the optimal model’s parameters as
θ˚
, subset data as
Dptq
S
, and learning rate as
α
.
Define gradient error
ErrpDptq
S,L,Ltrain, θptqq “
řiPDptq
S
θLi
trainpθptqq ´ θLpθptqq
, where
L
denotes training loss
Ltrain
over full training data or validation loss
Lval
over full validation data.
Lis a convex function. Then we have the following guarantee:
If
Ltrain
is Lipschitz continuous with parameter
σT
and
αd
σT?T
, then
mint1:TLpθptqq ´
Lpθ˚q ď T
?T`d
TřT´1
t1ErrpDptq
S,L,Ltrain, θptqq.
The detailed proof is provided in the Section A of Appendix. According to the above hypothesis, one
intuitive illumination is that reducing the distance between gradients of the subset and the full set,
formulated as
řiPDptq
S
θLi
trainpθptqq ´ θLpθptqq
, is the key to minimize the gap between the
performance of the model trained with the subset and the optimal model, denoted as
Lpθq ´ Lpθ˚q
.
From the perspective of minimizing
řiPDptq
S
θLi
trainpθptqq ´ θLpθptqq
, the success of data
diet [
31
] (a prior coreset algorithm) is understandable: data diet computes each sample’s error/gradient
norm based on a slight-trained model, then deletes a portion of the full set with smaller values, which
can be represented as
¯
DSptqD´DSptq
. The gradient
řjP¯
DSptqθLj
trainpθptqq
of the removed
data samples is much smaller than that of the remaining data samples
řiPDptq
S
θLi
trainpθptqq
. As
we will show in the experiments (Section 5.5), the static data diet cannot always capture the most
important samples across all epochs during training [
31
]. Although rankings of all elements in
DS
seemly keep static and unchangeable, the ranking order of elements in full training dataset
D
changes
much more actively than the diet subset
DS
, which implies one-shot subset
DS
can not provide
gradient řiPDptq
S
θLi
trainpθptqqto approximate the full set Ds gradient θLpθptqq.
4
摘要:

DivingintoUniedData-ModelSparsityforClass-ImbalancedGraphRepresentationLearningChunhuiZhang1,ChaoHuang2,YijunTian3,QianlongWen3,ZhongyuOuyang3,YouhuanLi4,YanfangYe3,ChuxuZhang11BrandeisUniversity,USA,2UniversityofHongKong,China3UniversityofNotreDame,USA,4HunanUniversity,China{chunhuizhang,chuxuzhan...

展开>> 收起<<
Diving into Unified Data-Model Sparsity for Class-Imbalanced Graph Representation Learning Chunhui Zhang1 Chao Huang2 Yijun Tian3 Qianlong Wen3.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:909.24KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注