Neural Graphical Models Harsh Shrivastava Urszula Chajewska Microsoft Research Redmond USA

2025-05-02 0 0 6.32MB 24 页 10玖币
侵权投诉
Neural Graphical Models
Harsh Shrivastava & Urszula Chajewska
Microsoft Research, Redmond, USA
Contact:{hshrivastava,urszc}@microsoft.com
Abstract.
Probabilistic Graphical Models are often used to understand
dynamics of a system. They can model relationships between features
(nodes) and the underlying distribution. Theoretically these models can
represent very complex dependency functions, but in practice often simpli-
fying assumptions are made due to computational limitations associated
with graph operations. In this work we introduce Neural Graphical Models
(
NGMs
) which attempt to represent complex feature dependencies with
reasonable computational costs. Given a graph of feature relationships
and corresponding samples, we capture the dependency structure between
the features along with their complex function representations by using a
neural network as a multi-task learning framework. We provide efficient
learning, inference and sampling algorithms.
NGMs
can fit generic graph
structures including directed, undirected and mixed-edge graphs as well
as support mixed input data types. We present empirical studies that
show
NGMs
’ capability to represent Gaussian graphical models, perform
inference analysis of a lung cancer data and extract insights from a real
world infant mortality data provided by CDC.
Software:https://github.com/harshs27/neural-graphical-models
Keywords: Probabilistic Graphical Models, Deep learning, Learning representa-
tions
1 Introduction
Graphical models are a powerful tool to analyze data. They can represent
the relationships between features and provide underlying distributions that
model functional dependencies between them [
20
,
15
]. Learning, inference and
sampling are operations that make such graphical models useful for domain
exploration. Learning, in a broad sense, consists of fitting the distribution function
parameters from data. Inference is the procedure of answering queries in the form
of conditional distributions with one or more observed variables. Sampling is the
ability to draw samples from the underlying distribution defined by the graphical
model. One of the common bottlenecks of graphical model representations is
having high computational complexities for one or more of these procedures.
In particular, various graphical models have placed restrictions on the set of
distributions or types of variables in the domain. Some graphical models work
with continuous variables only (or categorical variables only) or place restrictions
arXiv:2210.00453v4 [cs.LG] 16 Aug 2023
2 H. Shrivastava, U. Chajewska
on the graph structure (e.g., that continuous variables cannot be parents of
categorical variables in a DAG). Other restrictions affect the set of distributions
the models are capable of representing, e.g., to multivariate Gaussian.
For wide adoption of graphical models, the following properties are desired:
Rich representations of complex underlying distributions.
Ability to simultaneously handle various input types such as categorical,
continuous, images and embedding representations.
Efficient algorithms for learning, inference and sampling.
Support for various representations: directed, undirected, mixed-edge graphs.
Access to the learned underlying distributions for analysis.
In this work, we propose Neural Graphical Models (
NGMs
) that satisfy the afore-
mentioned desiderata in a computationally efficient way.
NGMs
accept a feature
dependency structure that can be given by an expert or learned from data. The
dependency structure may have the form of a graph with clearly defined semantics
(e.g., a Bayesian network graph or a Markov network graph) or an adjacency
matrix. Note that the graph may be either directed or undirected. Based on this
dependency structure,
NGMs
learn to represent the probability function over the
domain using a deep neural network. The parameterization of such a network
can be learned from data efficiently, with a loss function that jointly optimizes
adherence to the given dependency structure and fit to the data. Probability
functions represented by
NGMs
are unrestricted by any of the common restrictions
inherent in other PGMs. They also support efficient inference and sampling.
2 Related works
Probabilistic Graphical Models (PGMs) aim to learn the underlying joint dis-
tribution from which input data is sampled. Often, to make learning of the
distribution computationally feasible, inducing an independence graph structure
between the features helps. In cases where this independence graph structure is
provided by a domain expert, the problem of fitting PGMs reduces to learning
distributions over this graph. Alternatively, there are many methods traditionally
used to jointly learn the structure as well as the parameters [
12
,
35
,
15
,
23
] and
have been widely used to analyse data in many domains [2,6,7,26,25,1].
Recently, many interesting deep learning based approaches for DAG recovery
have been proposed [
43
,
44
,
16
,
42
]. These works primarily focus on structure
learning but technically they are learning a Probabilistic Graphical Model. These
works depend on the existing algorithms developed for the Bayesian networks for
the inference and sampling tasks. A parallel line of work combining graphical
models with deep learning are Bayesian deep learning approaches: Variational
AutoEncoders, Boltzmann Machines etc. [
17
,
13
,
40
]. The deep learning models
have significantly more parameters than traditional Bayesian networks, which
makes them less suitable for datasets with a small number of samples. Using
these deep graphical models for downstream tasks is computationally expensive
and often impedes their adoption.
Neural Graphical Models 3
We would be remiss not to mention the technical similarities
NGMs
have
with some recent research works. We found "Learning sparse nonparametric
DAGs" [
44
] to be the closest in terms of representation ability. In one of their
versions, they model each independence structure with a different neural network
(MLP). However, their choice of modeling feature independence criterion differs
from
NGM
. They zero out the weights of the row in the first layer of the neural
network to induce independence between the input and output features. This
formulation restricts them from sharing the NNs across different factors. Second,
we found in [
16
] path norm formulations of using the product of NN weights
for input to output connectivity similar to
NGMs
. They used the path norm
to parametrize the DAG constraint for continuous optimization, while [
33
,
34
]
used them within unrolled algorithm framework to learn sparse gene regulatory
networks.
Methods that model the conditional independence graphs [
10
,
3
,
30
,
29
,
28
,
27
]
are a type of graphical models that are based on underlying multivariate Gaussian
distribution. Probabilistic Circuits [
21
], Conditional Random Fields or Markov
Networks [
36
] and some other PGM formulations like [
39
,
38
,
41
,
19
] are popular.
These PGMs often make simplifying assumptions about the underlying distribu-
tions and place restrictions on the accepted input data types. Real-world input
data often consist of mixed datatypes (real, categorical, text, images etc.) and it
is challenging for the existing graphical model formulations to handle.
3 Neural Graphical Models
We propose a new Probabilistic Graphical Model type, called Neural Graphical
Models (
NGMs
) and describe the associated learning, inference and sampling
algorithms. Our model accepts all input types and avoids placing any restrictions
on the form of underlying distributions.
Problem setting: We consider input data Xthat have
M
samples with
each sample consisting of
D
features. An example can be gene expression data,
where we have a matrix of the microarray expression values (samples) and genes
(features). In the medical domain, we can have a mix of continuous and categorical
data describing a patient’s health. We are also provided a graph Gwhich can be
directed, undirected or have mixed-edge types that represents our belief about
the feature dependency relationships (in a probabilistic sense). Such graphs are
often provided by experts and include inductive biases and domain knowledge
about the underlying system functions. In cases where the graph is not provided,
we make use of the state-of-the-art algorithms to recover DAGs or CI graphs,
refer to Sec. 2. The NGM input is the tuple (X,G).
3.1 Representation
Fig. 1 shows a sample graph recovered and how we view the value of each feature
as a function of the values of its neighbors. For directed graphs, each feature’s
value is represented as a function of its Markov blanket in the graph. We use
4 H. Shrivastava, U. Chajewska
Fig. 1: Graphical view of
NGMs
: The input graph G(undirected) for given input data
XRM×D
. Each feature
xi
=
fi
(
Nbrs
(
xi
)) is a function of the neighboring features.
For a DAG, the functions between features will be defined by the Markov Blanket
relationship
xi
=
fi
(
MB
(
xi
)). The adjacency matrix (right) represents the associated
dependency structures.
the graph Gto understand the domain’s dependency structure, but ignore any
potential parametrization associated with it.
We introduce a neural view which is another way of representing G, as shown
in Fig. 2. The neural networks used are multi-layer perceptrons with appropriate
input and output dimensions that represent graph connections in
NGMs
. We denote
a NN with
L
number of layers with the weights
W
=
{W1, W2,· · · , WL}
and
biases
B
=
{b1, b2,· · · , bL}
as
fW,B
(
·
)with non-linearity not mentioned explicitly.
We experimented with multiple non-linearities and found that
ReLU
fits well
with our framework. Applying the NN to the input
X
evaluates the following
mathematical expression,
fW,B
(
X
) =
ReLU
(
WL·
(
· · ·
(
W2·ReLU
(
W1·X
+
b1
) +
b2
)
· · ·
) +
bL
). The dimensions of the weights and biases are chosen such that the
neural network input and output units are equal to
|D|
with the hidden layers
dimension
H
remaining a design choice. In experiments, we start with
H
= 2
|D|
and subsequently adjust the dimensions based on the validation loss. The product
of the weights of the neural networks
Snn
=
QL
l=1|Wl|
=
|W1| × |W2| × · · · × |WL|
,
where
|W|
computes the absolute value of each element in
W
, gives us path
dependencies between the input and the output units. For short hand, we denote
Snn
=
Πi|Wi|
. If
Snn
[
xi, xo
] = 0, then the output unit
xo
is independent of the
input unit
xi
. Increasing the layers and hidden dimensions of the NNs provide us
with richer dependence function complexities.
Initially, the NN is fully connected. Some of the connections will be dropped
during training, as the associated weights are zeroed out. We can view the
resulting NN as a glass-box model (indicating transparency), since we can discover
functional dependencies by analyzing paths from input to output.
3.2 Learning
Using the rich and compact functional representation achieved by using the
neural view, the learning task is to fit the neural networks to achieve the desired
dependency structure S(encoded by the input graph G), along with fitting the
regression to the input data X. Given the input data Xwe want to learn the
Neural Graphical Models 5
Fig. 2: Neural view of
NGMs
: NN as a multitask learning architecture capturing
non-linear dependencies for the features of the undirected graph in Fig. 1. If there is a
path from the input feature to an output feature, that indicates a dependency between
them. The dependency matrix between the input and output of the NN reduces to
matrix product operation
Snn
=
Πi|Wi|
=
|W1|×|W2|
. Note that not all the zeroed
out weights of the MLP (in black-dashed lines) are shown for the sake of clarity.
functions as described by the
NGMs
graphical view, Fig. 1. These can be obtained
by solving the multiple regression problems shown in neural view, Fig. 2. We
achieve this by using the neural view as a multi-task learning framework. The
goal is to find the set of parameters
W
that minimize the loss expressed as the
distance from
Xk
to
fW
(
Xk
)(averaged over all samples
k
) while maintaining
the dependency structure provided in the input graph G. We can define the
regression operation as follows:
arg min
W,B
M
X
k=1
XkfW,B(Xk)
2
2s.t. ΠL
i=1|Wi|Sc= 0 (1)
where we introduced a soft-graph constraint. Here,
Sc
represents the complement
of the matrix
S
, which essentially replaces 0by 1and vice-versa. The
AB
represents the Hadamard operator which does an element-wise matrix multiplica-
tion between the same dimension matrices
A, B
. Including the constraint as a
Lagrangian term with
1
penalty and a constant
λ
that acts a tradeoff between
fitting the regression and matching the graph dependency structure, we get the
following optimization formulation
arg min
W,B
M
X
k=1
XkfW,B(Xk)
2
2+λlog
ΠL
i=1|Wi|Sc
1(2)
In our implementation, the individual weights are normalized using
2
-norm
before taking the product. We normalize the regression loss and the structure
loss terms and apply appropriate scaling to the input data features.
Proximal Initialization strategy: To get a good initialization for the
NN parameters
W
and
λ
we implement the following procedure. We solve
摘要:

NeuralGraphicalModelsHarshShrivastava&UrszulaChajewskaMicrosoftResearch,Redmond,USAContact:{hshrivastava,urszc}@microsoft.comAbstract.ProbabilisticGraphicalModelsareoftenusedtounderstanddynamicsofasystem.Theycanmodelrelationshipsbetweenfeatures(nodes)andtheunderlyingdistribution.Theoreticallythesemo...

展开>> 收起<<
Neural Graphical Models Harsh Shrivastava Urszula Chajewska Microsoft Research Redmond USA.pdf

共24页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:24 页 大小:6.32MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 24
客服
关注