Diving into Uniﬁed Data-Model Sparsity for Class-Imbalanced Graph Representation Learning Chunhui Zhang1 Chao Huang2 Yijun Tian3 Qianlong Wen3

2025-04-27 0 0 909.24KB 18 页 10玖币

侵权投诉

Diving into Uniﬁed Data-Model Sparsity for

Class-Imbalanced Graph Representation Learning

Chunhui Zhang1, Chao Huang2, Yijun Tian3, Qianlong Wen3,

Zhongyu Ouyang3,Youhuan Li4,Yanfang Ye3,Chuxu Zhang1

1Brandeis University, USA, 2University of Hong Kong, China

3University of Notre Dame, USA, 4Hunan University, China

{chunhuizhang,chuxuzhang}@brandeis.edu chaohuang75@gmail.com,

{ytian5,qwen,zouyang2,yye7}@nd.edu,liyouhuan@hnu.edu.cn

Abstract

Even pruned by the state-of-the-art network compression methods, recent research

shows that deep learning model training still suffers from the demand of massive

data usage. In particular, Graph Neural Networks (GNNs) training upon such

non-Euclidean graph data often encounters relatively higher time costs, due to

its irregular and nasty density properties, compared with data in the regular Eu-

clidean space (e.g., image or text). Another natural property concomitantly with

graph is class-imbalance which cannot be alleviated by the massive graph data

while hindering GNNs’ generalization. To fully tackle these unpleasant properties,

(i) theoretically

, we introduce a hypothesis about what extent a subset of the train-

ing data can approximate the full dataset’s learning effectiveness. The effectiveness

is further guaranteed and proved by the gradients’ distance between the subset

and the full set;

(ii) empirically

, we discover that during the learning process of a

GNN, some samples in the training dataset are informative for providing gradients

to update model parameters. Moreover, the informative subset is not ﬁxed during

training process. Samples that are informative in the current training epoch may

not be so in the next one. We refer to this observation as dynamic data sparsity.

We also notice that sparse subnets pruned from a well-trained GNN sometimes

forget the information provided by the informative subset, reﬂected in their poor

performances upon the subset. Based on these ﬁndings, we develop a uniﬁed

data-model dynamic sparsity framework named

Graph Dec

antation (GraphDec)

to address challenges brought by training upon a massive class-imbalanced graph

data. The key idea of GraphDec is to identify the informative subset dynamically

during the training process by adopting sparse graph contrastive learning. Ex-

tensive experiments on multiple benchmark datasets demonstrate that GraphDec

outperforms state-of-the-art baselines for class-imbalanced graph classiﬁcation and

class-imbalanced node classiﬁcation tasks, with respect to classiﬁcation accuracy

and data usage efﬁciency.

1 Introduction

Graph representation learning (GRL) [

] has shown remarkable power in dealing with non-Euclidean

structure data (e.g., social networks, biochemical molecules, knowledge graphs). Graph neural

networks (GNNs) [

], as the current state-of-the-art of GRL, have become essential

in various graph mining applications. To learn the representation of each node reﬂecting its local

structure pattern, GNNs gather features of the neighbor nodes and apply message passing along edges.

This topology-aware mechanism enables GNNs to achieve superior performances over different tasks.

However, in many real-world scenarios, graph data often preserves two properties: massiveness [

] and class-imbalance [

]. Firstly, message-passing over nodes with high degrees brings about

Preprint. Under review.

arXiv:2210.00162v1 [cs.LG] 1 Oct 2022

heavy computation burdens. Some of the calculations are even redundant, in that not all neighbors

are informative for learning task-related embeddings. Unlike regular data such as images or texts, the

connectivity of irregular graph data causes random memory access, which further slows down the

efﬁciency of data readout. Secondly, class imbalance naturally exists in datasets from diverse practical

domains, such as bioinformatics and social networks. GNNs are sensitive to this imbalance and can

be biased toward the dominant classes. This bias may mislead GNNs’ learning process, therefore

making the model underﬁt on samples that are of real importance with respect to the downstream

tasks, and as a result yielding poor performance on the test data.

Figure 1: The principle of graph decantation. It decants

data samples based on rankings of their gradient scores,

and then uses them as the training set in the next epoch.

Accordingly, recent studies [

]

arise to address the issues of massiveness

or class-imbalanced in graph data. To

tackle the massiveness issue, [

] ex-

plore efﬁcient data sampling policies to

reduce the computational cost from the

data perspective. From the model im-

provement perspective, some approaches

design the quantization-aware training and

low-precision inference method to reduce

GNNs’ operating costs. For example,

GLT [

] applies the lottery ticket tech-

nique [

] to simplify graph data and GNN

model simultaneously. To deal with the

imbalance issue in node classiﬁcation on

graphs, GraphSMOTE [

] tries to gener-

ate new nodes for the minority classes to balance the training data. Improved upon GraphSMOTE,

GraphENS [

] further proposes a new augmentation method by constructing an ego network to learn

the representations of the minority classes. Despite progress made so far, existing methods fail to

tackle the two issues altogether. Furthermore, while one of the issues is being handled, extra compu-

tation costs are introduced at the same time. For example, the rewind steps in GLT [

] which search

for lottery subnets and subsets heavily increase the computation cost, although the ﬁnal lotteries are

lightweight. The newly synthetic nodes in GraphSMOTE [

] and GraphENS [

], although help

alleviate the data imbalance, bring extra computational burdens for the next-coming training process.

Regarding the issues above, we make several observations. Firstly, we notice that a sparse pruned

GNN easily "forgets" the minority samples when trained with class-imbalanced graph data, as it

yields worse performance over the minorities compared with the original GNN [

]. To investigate

the cause of the above observation, we study how each graph sample affects the model parameters

update process by taking a closer look at the gradients it brings to the parameters. Speciﬁcally, at early

training stages, we found a small subset of the samples providing the most informative supervisory

signals reﬂected by the gradient norms. One hypothesis we make is that the training effectiveness of

the full training set can be approximated, to some extent, by that of the subset. We further hypothesize

that this effective approximation is guaranteed by the distance between the gradients of the subset

and the full dataset.

Based on the above observations and the hypothesises, we propose a novel method called

Graph

Dec

antation (GraphDec) to explore dynamic sparsity training from both model and data aspects.

The principle behind our method is illustrated in Figure 1. Given that informative samples bring

about higher gradient values/scores when trained with a sparse GNN, our method is inspired by

contrastive self-supervised learning because it can be modiﬁed to dynamically prune one branch of

contrastive backbone for improving its capability of identifying minority samples in class-imbalanced

dataset. In particular, we design a new contrastive backbone with a sparse GNN and enable the

model to identify informative samples in a self-supervised manner. To the best of our knowledge,

other learning processes (e.g., graph auto-encoder, supervised learning) are either unable to identify

informative samples or incapable of learning in a self-supervised manner. Accordingly, the proposed

framework can score samples in the current training set and keep only

most informative samples as

training set for the next epoch. Considering that a currently unimportant sample does not imply that

it will always be unimportant, we further design a data recycling process to randomly recycle prior

discarded data samples (samples that are considered unimportant in previous training epochs), and

add them back to current informative sparse subsets for reuse. The dynamically updated informative

subset supports the sparse GNN to learn more balanced representations. To summarize, our major

contributions in this work are:

•

We develop a novel framework, Graph Decantation, which leverages dynamical sparse graph con-

trastive learning on class-imbalanced graph data with efﬁcient data usage. To our best knowledge,

this is the ﬁrst study to explore the dynamic sparsity property for class-imbalanced graphs.

•

We introduce cosine annealing to dynamically control the sizes of the sparse GNN model and graph

data subset to smooth the training process. Meanwhile, we introduce data recycling to refresh the

current data subset to avoid overﬁtting.

•

Comprehensive experiments on multiple benchmark datasets demonstrate that GraphDec outper-

forms state-of-the-art methods for both class-imbalanced graph classiﬁcation and class-imbalanced

node classiﬁcation tasks. Additional results show that GraphDec dynamically ﬁnds an informative

subset across different training epochs effectively.

2 Related Work

Training deep model with sparsity.

Despite the fact that deep neural networks work generally well

in practice, they are usually over-parameterized. Over-parameterized models, although usually achieve

good performance when trained properly, are usually associated with enormous computational cost.

Therefore, parameter pruning aiming at decreasing computational cost has been a popular topic and

many parameter-pruning strategies are proposed to balance the trade-off between model performance

and learning efﬁciency [

]. Among all of the existing parameter pruning methods, most of them

belong to the static pruning category and deep neural networks are pruned either by neurons [

]

or architectures (layer and ﬁlter) [

]. Parameters deleted by these methods will not be recovered

later. In contrast to static pruning methods, more recent works propose dynamic pruning strategies

where different compact subnets will be dynamically activated at each training iteration [

The other line of computation cost reduction lies in the dataset sparsity [

]. The core idea is

to prune the original dataset and ﬁlter out the most salient subset so that an over-parameterized deep

model could be trained upon (e.g., data diet subset [

]). Recently, the property of sparsity is also

used to improve model robustness [

]. In this work, we attempt to accomplish dynamic sparsity

from both the GNN model and the graph dataset simultaneously.

Class-imbalanced learning on graphs.

In real-world scenarios, imbalanced class distribution is

one of the natural properties in many datasets, including graph data. Except for conventional re-

balanced methods, like reweighting samples [

] and oversampling [

], different methods

have been proposed to solve the class imbalance issue in graph data given a speciﬁc task. For

the node classiﬁcation task, an early work [

] tries to accurately characterize the rare categories

through a curriculum self-paced strategy while some other previous works [

] solve the

class-imbalanced issue by proposing different methods to generate synthetic samples to rebalance

the dataset. Compared to the node-level task, the re-balanced methods speciﬁc to graph-level task

are relatively unexplored. A recent work [

] proposes to utilize additional supervisory signals from

neighboring graphs to alleviate the class-imbalanced problem for a graph-level task. To the best

of our knowledge, our proposed GraphDec is the ﬁrst work to solve the class-imbalanced for both

node-level and graph-level tasks.

3 Preliminary

In this work, we denote graph as

G“ pV, E, Xq

, where

is the set of nodes,

is the set of edges,

and

XPRd

represents the node (and edge) attributes of dimension

. In addition, we represent the

neighbor set of node vPVas Nv.

Graph Neural Networks.

GNNs [

] learn node representations from the graph structure and node

attributes. This process can be formulated as:

hplq

v“COMBINEplq´hpl´1q

v,AGGREGATEplq´!hpl´1q

u,@uPNv)¯¯,(1)

where

hplq

denotes feature of node

-th GNN layer;

AGGREGATEp¨q

and

COMBINEp¨q

are

neighbor aggregation and combination functions, respectively;

hp0q

is initialized with node attribute

. We obtain the output representation of each node after repeating the process in Equation (1) for

rounds. The representation of the whole graph, denoted as

hGPRd

, can be obtained by using a

READOUT function to combine the ﬁnal node representations learned above:

hG“READOUT !hpLq

v| @vPV),(2)

where the READOUT function can be any permutation invariant, like summation, averaging, etc.

Graph Contrastive Learning.

Given a graph dataset

D“ tGiuN

i“1

, Graph Contrastive Learning

(GCL) methods ﬁrstly implement proper transformations on each graph

to generate two views

and

. The goal of GCL is to map samples within positive pairs closer in the hidden space,

while those of the negative pairs are further. GCL methods are usually optimized by contrastive loss.

Taking the most popular InfoNCE loss [29] as an example, the contrastive loss is deﬁned as:

LCLpG1

i, G2

iq“´log exp psim pzi,1,zi,2qq

řN

j“1,j‰iexp psim pzi,1,zj,2qq,(3)

where zi,1“fθpG1

iq,zi,2“fθpG2

iq, and sim denotes the similarity function.

Network Pruning.

Given an over-parameterized deep neural network

fθp¨q

with weights

, the

network pruning is usually performed layer-by-layer. The pruning process of the

lth

layer in

fθp¨q

can be formulated as follows:

θlth

pruned “TopKpθlth , kq, k “αˆ |θlth |,(4)

where

θlth

is the parameters in the

lth

layer of

fθp¨q

and

TopKp¨, kq

refers to the operation to choose

the top-

largest elements of

θlth

. We use a pre-deﬁned sparse rate

to control the fraction of

parameters kept in the pruned network

θlth

pruned

. Finally, only the top

k“αˆ |θlth |

largest weights

will be kept in the pruned layer. The pruning process will be implemented iteratively to prune the

parameters in each layer of deep neural network [12].

4 Methodology

In this section, we ﬁrst illustrate our sparse subset approximation hypothesis supported by the theorem,

which means that if the gradients of a data subset approximate well to the gradients of the full data

set, the model trained on subset performs closely to the model trained with full set. Guided by this

hypothesis, we develop GraphDec to continually reﬁne a compact training subset with the dynamic

graph contrastive learning methodology. In detail, we describe procedures about how to rank the

importance of each sample, smooth the reﬁning procedure, and avoid overﬁtting.

4.1 Sparse Subset Approximation Hypothesis

Firstly, we propose the sparse subset approximation hypothesis to show how a model trained with

a subset data

can approximate the effect of a model trained with full data

. This hypothesis

explains why the performance of the model trained with a subset data selected by speciﬁc methods

(e.g., data diet [31]) achieves performance close to the one trained on the full dataset.

Theorem 1

For a data selection algorithm, we assume the model is optimized with full gradient

descent. At epoch

where

tPr1, T s

, denote the model’s parameters as

θptq

where

›

›θptq›

›

2ďd2

and

d is constant, the optimal model’s parameters as

θ˚

, subset data as

Dptq

, and learning rate as

Deﬁne gradient error

ErrpDptq

S,L,Ltrain, θptqq “ ›

›

›řiPDptq

∇θLi

trainpθptqq ´ ∇θLpθptqq›

›

, where

denotes training loss

Ltrain

over full training data or validation loss

Lval

over full validation data.

Lis a convex function. Then we have the following guarantee:

Ltrain

is Lipschitz continuous with parameter

σT

and

α“d

σT?T

, then

mint“1:TLpθptqq ´

Lpθ˚q ď dσT

?T`d

TřT´1

t“1ErrpDptq

S,L,Ltrain, θptqq.

The detailed proof is provided in the Section A of Appendix. According to the above hypothesis, one

intuitive illumination is that reducing the distance between gradients of the subset and the full set,

formulated as

›

›řiPDptq

∇θLi

trainpθptqq ´ ∇θLpθptqq›

›

, is the key to minimize the gap between the

performance of the model trained with the subset and the optimal model, denoted as

Lpθq ´ Lpθ˚q

From the perspective of minimizing

›

›řiPDptq

∇θLi

trainpθptqq ´ ∇θLpθptqq›

›

, the success of data

diet [

] (a prior coreset algorithm) is understandable: data diet computes each sample’s error/gradient

norm based on a slight-trained model, then deletes a portion of the full set with smaller values, which

can be represented as

DSptq“D´DSptq

. The gradient

řjP¯

DSptq∇θLj

trainpθptqq

of the removed

data samples is much smaller than that of the remaining data samples

řiPDptq

∇θLi

trainpθptqq

. As

we will show in the experiments (Section 5.5), the static data diet cannot always capture the most

important samples across all epochs during training [

]. Although rankings of all elements in

seemly keep static and unchangeable, the ranking order of elements in full training dataset

changes

much more actively than the diet subset

, which implies one-shot subset

can not provide

gradient řiPDptq

∇θLi

trainpθptqqto approximate the full set D’s gradient ∇θLpθptqq.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DivingintoUniedData-ModelSparsityforClass-ImbalancedGraphRepresentationLearningChunhuiZhang1,ChaoHuang2,YijunTian3,QianlongWen3,ZhongyuOuyang3,YouhuanLi4,YanfangYe3,ChuxuZhang11BrandeisUniversity,USA,2UniversityofHongKong,China3UniversityofNotreDame,USA,4HunanUniversity,China{chunhuizhang,chuxuzhan...

展开>> 收起<<

Diving into Uniﬁed Data-Model Sparsity for Class-Imbalanced Graph Representation Learning Chunhui Zhang1 Chao Huang2 Yijun Tian3 Qianlong Wen3.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Diving into Uniﬁed Data-Model Sparsity for Class-Imbalanced Graph Representation Learning Chunhui Zhang1 Chao Huang2 Yijun Tian3 Qianlong Wen3

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: