Federated Graph Representation Learning using Self-Supervision

2025-05-06 0 0 1.62MB 9 页 10玖币

侵权投诉

Federated Graph Representation Learning

using Self-Supervision

Susheel Suresh∗

Purdue University

USA

suresh43@purdue.edu

Danny Godbout

Microsoft

USA

danny.godbout@microsoft.com

Arko Mukherjee

Microsoft

USA

arko.mukherjee@microsoft.com

Mayank Shrivastava

Microsoft

USA

mayank.shrivastava@microsoft.com

Jennifer Neville

Purdue University / Microsoft

USA

jenneville@microsoft.com

Pan Li

Purdue University

USA

panli@purdue.edu

ABSTRACT

Federated graph representation learning (FedGRL) is an important

research direction that brings the benets of distributed training

to graph structured data while simultaneously addressing some

privacy and compliance concerns related to data curation. However,

several interesting real-world graph data characteristics viz. label

deciency and downstream task heterogeneity are not taken into

consideration in current FedGRL setups. In this paper, we consider

a realistic and novel problem setting, wherein cross-silo clients

have access to vast amounts of unlabeled data with limited or no

labeled data and additionally have diverse downstream class label

domains. We then propose a novel FedGRL formulation based on

model interpolation where we aim to learn a shared global model

that is optimized collaboratively using a self-supervised objective

and gets downstream task supervision through local client mod-

els. We provide a specic instantiation of our general formulation

using BGRL a SoTA self-supervised graph representation learning

method and we empirically verify its eectiveness through real-

istic cross-slio datasets: (1) we adapt the Twitch Gamer Network

which naturally simulates a cross-geo scenario and show that our

formulation can provide consistent and avg. 6.1% gains over tra-

ditional supervised federated learning objectives and on avg. 1.7%

gains compared to individual client specic self-supervised training

and (2) we construct and introduce a new cross-silo dataset called

Amazon Co-purchase Networks that have both the characteristics

of the motivated problem setting. And, we witness on avg. 11.5%

gains over traditional supervised federated learning and on avg.

1.9% gains over individually trained self-supervised models. Both

experimental results point to the eectiveness of our proposed

formulation. Finally, both our novel problem setting and dataset

contributions provide new avenues for the research in FedGRL.

∗Work performed during internship at Microsoft.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior specic permission and/or a

fee. Request permissions from permissions@acm.org.

FedGraph ’22, October 21, 2022, Atlanta, GA, USA

ACM ISBN 978-x-xxxx-xxxx-x/YY/MM.. .$15.00

https://doi.org/10.1145/nnnnnnn.nnnnnnn

KEYWORDS

graph neural networks, federated learning, self-supervised learning

ACM Reference Format:

Susheel Suresh, Danny Godbout, Arko Mukherjee, Mayank Shrivastava,

Jennifer Neville, and Pan Li. 2022. Federated Graph Representation Learning

using Self-Supervision. In The First International Workshop on Federated

Learning over Graph Data (FedGraph ’22), October 21, 2022, Atlanta, GA,

USA. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/nnnnnnn.

nnnnnnn

1 INTRODUCTION

The widespread adoption of Graph neural networks (GNNs), a class

of powerful encoders for graph representation learning [

] have shown enormous potential for downstream applications

in a variety of domains spanning social, physical and biochemical

sciences [

]. While GNNs have been extensively studied

in both supervised [

] and self-supervised

[

] settings, the bulk of the work falls under

a traditional data-centralized training regime. With heightened

data security concerns, privacy, and compliance regulations in key

GNN application domains such as social networks and healthcare,

increasingly, vast amounts of such graph data is siloed away or held

behind strict data boundary constraints [

]. Therefore, there is

great need for understanding and developing decentralized training

processes for GNNs.

Federated learning (FL) [

] has risen as a widely popular

distributed learning approach that brings model training processes

to the training data held at the clients, thereby avoiding transfer of

raw client data. The benets of FL are two fold, rst, it is seen as a

key ingredient in enabling privacy-preserving model learning in

cross-geo or cross-silo scenarios [

]. Second, when certain partici-

pating clients have scarce training data or lack diverse distributions,

FL enables them to potentiality leverage the power of data from

others—thereby helping them improve performance on their own

local tasks [32].

Recent research eorts have looked at applying federated learn-

ing algorithms to graph structured data [

]. How-

ever, several interesting and real-world graph data characteristics

are not taken into consideration: (1)

Label deciency

- current

arXiv:2210.15120v1 [cs.LG] 27 Oct 2022

FedGraph ’22, October 21, 2022, Atlanta, GA, USA Susheel Suresh, Danny Godbout, Arko Mukherjee, Mayank Shrivastava, Jennifer Neville, and Pan Li

methods assume that training labels (node / graph level) for the cor-

responding tasks

are readily available at the clients and a global

model is trained end-to-end in a federated fashion. However, in

many cross-silo applications, clients might have very little or no

labeled data points. It is a well known that annotating labels of

node / graph data takes a lot of time and resources [

], e.g.,

diculties in obtaining explicit user feedback in social network

applications and costly in vitro experiments for biological networks.

Moreover, certain clients may be unwilling to share labels due to

competition or other regulatory reasons. (2)

Downstream task

heterogeneity

- it is reasonable to assume that while clients may

share the same graph data domain, the downstream tasks may be

client-dependent and vary signicantly across clients. It is also

reasonable to expect that some clients may have new downstream

tasks added at a later point, where a model supervised by previous

tasks may be ineective.

With these observations, we propose a realistic and unexplored

problem setting for FedGRL: Participating clients have a shared space

of graph-structured data, though the distributions may dierent across

clients. And, clients have the access to vast amounts of unlabeled data.

Additionally, they may have very dierent local downstream tasks

with very few private labeled data points. Fundamentally, our prob-

lem setting asks if one can leverage unlabeled data across clients to

learn a shared graph representation (akin to “knowledge transfer")

which can then be further personalized to perform well in the lo-

cal downstream tasks at each client. In a data centralized training

regime, a number of works that utilize GNN pre-training [

]

and self-supervision [

] have shown the benets of such

approaches in dealing with label deciency and transfer learning

scenarios which motivate us to explore and utilize them for the

proposed FedGRL problem setting.

In this paper, we propose a novel FedGRL formulation based on

model interpolation where we aim to learn a shared global model

that is optimized collaboratively using a self-supervised objective

and gets downstream task supervision through local client models.

We provide a specic instantiation of our general formulation using

BGRL [

] a SoTA self-supervised graph representation learning