Learning Multivariate CDFs and Copulas using Tensor Factorization Magda Amiridi Nicholas D. Sidiropoulos

2025-05-02 0 0 2.13MB 15 页 10玖币
侵权投诉
Learning Multivariate CDFs and Copulas using Tensor
Factorization
Magda Amiridi, Nicholas D. Sidiropoulos
University of Virginia
{ma7bx,nikos}@virginia.edu
October 14, 2022
Abstract
Learning the multivariate distribution of data is a core challenge in statistics and machine learning.
Traditional methods aim for the probability density function (PDF) and are limited by the curse of
dimensionality. Modern neural methods are mostly based on black-box models, lacking identifiability
guarantees. In this work, we aim to learn multivariate cumulative distribution functions (CDFs), as they
can handle mixed random variables, allow efficient ‘box’ probability evaluation, and have the potential to
overcome local sample scarcity owing to their cumulative nature. We show that any grid-sampled version
of a joint CDF of mixed random variables admits a universal representation as a naive Bayes model via
the Canonical Polyadic (tensor-rank) decomposition. By introducing a low-rank model, either directly in
the raw data domain, or indirectly in a transformed (Copula) domain, the resulting model affords efficient
sampling, closed form inference and uncertainty quantification, and comes with uniqueness guarantees
under relatively mild conditions. We demonstrate the superior performance of the proposed model in
several synthetic and real datasets and applications including regression, sampling and data imputation.
Interestingly, our experiments with real data show that it is possible to obtain better density/mass estimates
indirectly via a low-rank CDF model, than a low-rank PDF/PMF model.
1 Introduction
Modeling complex data distributions is a task of cen-
tral interest in statistics and machine learning. Given
an accurate and tractable estimate of the joint distri-
bution function, various kinds of statistical tasks can
follow naturally including fast sampling, tractable
computation of expectations, and deriving condi-
tional and marginal densities. To list a few recent ap-
plications, such models have demonstrated success in
generating high-fidelity images23,5, realistic speech
synthesis18,30; semi-supervised learning29; reinforce-
ment learning31; and detecting adversarial data10.
The purpose of this work is to introduce a new class
of universal estimators for multivariate distributions
based on CDFs and the Canonical Polyadic (tensor-
rank) decomposition, and to demonstrate their di-
rect applicability and efficiency in missing data impu-
tation, sampling, density estimation, and regression
tasks.
Distribution modeling is often studied under the
perspective of non-parametric PDF estimation, in
which histogram, kernel35,34, and orthogonal series
methods13,11,41 are popular approaches with well-
understood statistical properties. Although these es-
timators are data-driven and do not impose restric-
tive parameterisations on the form of the data dis-
tribution, they usually have poor performance on
1
arXiv:2210.07132v1 [stat.ML] 13 Oct 2022
datasets of high dimensions because of the “curse
of dimensionality”. Currently, the most prominent
methods for modeling multivariate distributions rely
on neural networks8,32,25,14. Such methods are
capable of modeling higher dimensional data such
as images and sound, either implicitly or explicitly.
However, most of them are black-box models6with-
out any identifiability guarantees, lacking the sim-
plicity and interpretability of the classical methods.
Additionally, they lack the ability to efficiently com-
pute expectations, marginalize over subsets of vari-
ables, and evaluate conditionals, which is limiting in
many critical machine learning applications.
This paper attempts to bridge the gap between
principled traditional non-parametric statistics and
the scalability benefits of neural-based models by de-
veloping two variants of a rank-constrained estimator
for multivariate CDFs based on tensor rank decompo-
sition – known as Canonical Polyadic (CP) or CAN-
DECOMP/PARAFAC Decomposition17,16,39. Our
starting point is that any grid-sampled version of
an N-dimensional CDF is an Nway cumulative
probability tensor b
F, evaluated on a predefined
I1×I2× · · · × INgrid G. We will refer to b
Fas
grid-sampled CDF tensor. The evaluation grid G
describes the (finite) levels/cut-offs of the CDF for
every dimension and can be taken to be the cartesian
product of the training samples in each dimension, or
reduced via scalar or vector k-means. Each element
of b
Fcan be easily estimated via sample averaging
from realizations of the random vector of interest.
Any tensor can be decomposed as a sum of Rrank-
1 tensors, for high-enough but finite R39. To main-
tain direct control over the number of tensor param-
eters with growing dimensionality N, which entails
OQN
n=1 InCDF tensor elements, we introduce the
reconstructed approximation of b
F, using the rank-R
N-dimensional parameterization F RI1×I2×···×IN.
Such parametrization has much fewer degrees of free-
dom designated by the rank and size of the the ten-
sor. Tensor decompositions arise as a powerful tool
for extracting meaningful latent structure from given
data and can encode the salient characteristics of the
multivariate grid-sampled CDF tensor b
F. We seek
to minimize the squared loss between Fand the em-
pirical CDF tensor b
F. We formulate this task both
directly, i.e., by forming and decomposing b
F, and
indirectly, as a hidden tensor factorization problem,
i.e., the pertinent parts of the latent factors of the
CPD model are updated from rank-1 measurements
of the “hidden” discretized CDF tensor b
F. From an
algorithmic perspective, we propose alternating op-
timization, where each matrix factor is updated us-
ing ADMM, as well as stochastic optimization using
Adam to allow scaling to larger datasets.
1.1 Contributions
In summary, the present paper shows that:
Every multivariate CDF evaluated on a prede-
fined grid admits a compact representation via a
latent variable naive Bayes model with bounded
number of hidden states equal to the rank of the
grid-sampled CDF tensor.
This affords easy sampling, marginalization (by
discarding the subset of factor matrices corre-
sponding to the variables we are not interesting
in), derivation of conditional distributions and
expectations, and uncertainty quantification –
bypassing the curse of dimensionality.
The proposed model also affords direct and effi-
cient estimates of (possibly semi-infinite) “box”
probabilities, which is important for classifica-
tion tasks. Multivariate PDF estimators, on
the other hand, require multidimensional in-
tegration (analytical, numerical, or sampling-
based Monte-Carlo) to estimate box probabili-
ties, which is cumbersome and often intractable.
At the same time, the proposed rank-constrained
estimator is unassuming of the structure of the
data (thus offering greater expressive power) and
identifiable under relatively mild rank conditions
– see39.
On the experimental side, our results indicate
that, perhaps surprisingly, estimating the grid-
sampled CDF and then deriving a PDF estimate
from it yields improved performance relative to
direct PDF estimation in several machine learn-
ing applications of interest. In addition, the per-
formance of the proposed non-parametric model
in the Copula domain outperforms state-of-the-
2
art Copula-based baselines.
2 Background
2.1 Related work
Unsupervised learning of multivariate distributions
has seen tremendous progress over recent years, for
the case of PDF modeling in particular. Classical
methods in the literature include kernel density es-
timation (KDE)34,35, histogram density estimation
(HDE), and Orthogonal Series Density Estimation
(OSDE)11,13,41. All of the aforementioned methods,
however are inefficient for datasets with higher di-
mensionalities. Neural network-based approaches for
distribution estimation have recently shown promis-
ing results in high-dimensional problems. Auto-
regressive (AR) models such as30,36 decompose the
distribution into a product of conditionals, where
each conditional is modeled by a parametric distri-
bution (e.g., Gaussian or mixture of Gaussians in the
continuous case). Normalizing flows (NFs)33 repre-
sent a density value though an invertible transforma-
tion of latent variables with known density.
On the down-side, AR models are naturally sen-
sitive to the order of the variables/features while
strong network constraints of NFs can be restrictive
for model expressiveness. Most importantly, AR and
NF do not yield an explicit estimate of the density
function; they are ‘oracles’ that can be queried to
output an estimate of the density at any given input
point, i.e., to generate samples of the sought den-
sity – the difference is important. Therefore, given
a trained model, calculating expectations, marginal
and conditional distributions is not straightforward
with these methods. The same holds for generative
adversarial networks14 (GANs) as they do not al-
low for likelihood evaluation on held-out data. Fur-
thermore, deep multivariate CDF based models such
as6, do not address model identifiability, and can not
guarantee the recovery of the true latent factors that
generated the observed samples.
Tensor modeling of distributions: Tensor
models for estimating distributions have been pro-
posed for both discrete and continuous variables. In
the discrete case, the work in22 showed that any joint
PMF can be represented as an N-way probability
tensor and by introducing a CPD model, every mul-
tivariate PMF can be represented by a latent vari-
able naive Bayes model with a finite number of la-
tent states. For continuous random vectors, the joint
PDF can no longer be directly represented by a ten-
sor. Earlier work (40) has dealt with latent variable
models, but not general distributions. In contrast to
prior work (40,21), we make no assumptions regard-
ing a multivariate mixture model of non-parametric
product distributions in this paper. Another line of
work (see2,3) proposed a “universal” approach for
smooth, compactly supported multivariate densities
by representing the underlying density in terms of a
finite tensor of leading Fourier coefficients. Our work
requires less restrictive assumptions, as it also works
with discrete or mixed random variables, of possibly
unbounded support.
2.2 Notation, Definitions, and Pre-
liminaries
We use the symbols x,X,Xfor vectors, matrices
and tensors respectively. We use the notation x(n),
X(:, n), X(:,:, n) to refer to a particular element of
a vector, a column of a matrix and a slab of a ten-
sor. Symbols ,,~,denote the outer, Kronecker,
Hadamard and Khatri-Rao (column-wise Kronecker)
product respectively. The vectorization operator is
denoted as vec(X), vec(X) for a matrix and ten-
sor respectively39. Additionally, diag(x)RI×Ide-
notes the diagonal matrix with the elements of vector
xRIon its diagonal. Symbols kxk1,kxk2,kXkF,
and dT V correspond to L1norm, L2norm, Frobenius
norm, and total variation distance. The total varia-
tion distance between distributions pand qis defined
as dT V (p, q) = 1
2kpqk1.
Given an N-dimensional random vector X:=
[X1, . . . , XN]T,XFXwill denote that the random
vector Xfollows distribution FX.1(A) is the indica-
tor function of event A, i.e., it is 1 if and only if A
is true. The set of integers {1, . . . , N}is denoted as
[N]. Given Mdata samples, D={xm}M
m=1 denotes
the given dataset.
3
摘要:

LearningMultivariateCDFsandCopulasusingTensorFactorizationMagdaAmiridi,NicholasD.SidiropoulosUniversityofVirginiafma7bx,nikosg@virginia.eduOctober14,2022AbstractLearningthemultivariatedistributionofdataisacorechallengeinstatisticsandmachinelearning.Traditionalmethodsaimfortheprobabilitydensityfuncti...

展开>> 收起<<
Learning Multivariate CDFs and Copulas using Tensor Factorization Magda Amiridi Nicholas D. Sidiropoulos.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:2.13MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注