Learning Multivariate CDFs and Copulas using Tensor Factorization Magda Amiridi Nicholas D. Sidiropoulos

2025-05-02 0 0 2.13MB 15 页 10玖币

侵权投诉

Learning Multivariate CDFs and Copulas using Tensor

Factorization

Magda Amiridi, Nicholas D. Sidiropoulos

University of Virginia

{ma7bx,nikos}@virginia.edu

October 14, 2022

Abstract

Learning the multivariate distribution of data is a core challenge in statistics and machine learning.

Traditional methods aim for the probability density function (PDF) and are limited by the curse of

dimensionality. Modern neural methods are mostly based on black-box models, lacking identiﬁability

guarantees. In this work, we aim to learn multivariate cumulative distribution functions (CDFs), as they

can handle mixed random variables, allow eﬃcient ‘box’ probability evaluation, and have the potential to

overcome local sample scarcity owing to their cumulative nature. We show that any grid-sampled version

of a joint CDF of mixed random variables admits a universal representation as a naive Bayes model via

the Canonical Polyadic (tensor-rank) decomposition. By introducing a low-rank model, either directly in

the raw data domain, or indirectly in a transformed (Copula) domain, the resulting model aﬀords eﬃcient

sampling, closed form inference and uncertainty quantiﬁcation, and comes with uniqueness guarantees

under relatively mild conditions. We demonstrate the superior performance of the proposed model in

several synthetic and real datasets and applications including regression, sampling and data imputation.

Interestingly, our experiments with real data show that it is possible to obtain better density/mass estimates

indirectly via a low-rank CDF model, than a low-rank PDF/PMF model.

1 Introduction

Modeling complex data distributions is a task of cen-

tral interest in statistics and machine learning. Given

an accurate and tractable estimate of the joint distri-

bution function, various kinds of statistical tasks can

follow naturally including fast sampling, tractable

computation of expectations, and deriving condi-

tional and marginal densities. To list a few recent ap-

plications, such models have demonstrated success in

generating high-ﬁdelity images23,5, realistic speech

synthesis18,30; semi-supervised learning29; reinforce-

ment learning31; and detecting adversarial data10.

The purpose of this work is to introduce a new class

of universal estimators for multivariate distributions

based on CDFs and the Canonical Polyadic (tensor-

rank) decomposition, and to demonstrate their di-

rect applicability and eﬃciency in missing data impu-

tation, sampling, density estimation, and regression

tasks.

Distribution modeling is often studied under the

perspective of non-parametric PDF estimation, in

which histogram, kernel35,34, and orthogonal series

methods13,11,41 are popular approaches with well-

understood statistical properties. Although these es-

timators are data-driven and do not impose restric-

tive parameterisations on the form of the data dis-

tribution, they usually have poor performance on

arXiv:2210.07132v1 [stat.ML] 13 Oct 2022

datasets of high dimensions because of the “curse

of dimensionality”. Currently, the most prominent

methods for modeling multivariate distributions rely

on neural networks8,32,25,14. Such methods are

capable of modeling higher dimensional data such

as images and sound, either implicitly or explicitly.

However, most of them are black-box models6with-

out any identiﬁability guarantees, lacking the sim-

plicity and interpretability of the classical methods.

Additionally, they lack the ability to eﬃciently com-

pute expectations, marginalize over subsets of vari-

ables, and evaluate conditionals, which is limiting in

many critical machine learning applications.

This paper attempts to bridge the gap between

principled traditional non-parametric statistics and

the scalability beneﬁts of neural-based models by de-

veloping two variants of a rank-constrained estimator

for multivariate CDFs based on tensor rank decompo-

sition – known as Canonical Polyadic (CP) or CAN-

DECOMP/PARAFAC Decomposition17,16,39. Our

starting point is that any grid-sampled version of

an N-dimensional CDF is an N−way cumulative

probability tensor b

F, evaluated on a predeﬁned

I1×I2× · · · × INgrid G. We will refer to b

Fas

grid-sampled CDF tensor. The evaluation grid G

describes the (ﬁnite) levels/cut-oﬀs of the CDF for

every dimension and can be taken to be the cartesian

product of the training samples in each dimension, or

reduced via scalar or vector k-means. Each element

of b

Fcan be easily estimated via sample averaging

from realizations of the random vector of interest.

Any tensor can be decomposed as a sum of Rrank-

1 tensors, for high-enough but ﬁnite R39. To main-

tain direct control over the number of tensor param-

eters with growing dimensionality N, which entails

OQN

n=1 InCDF tensor elements, we introduce the

reconstructed approximation of b

F, using the rank-R

N-dimensional parameterization F ∈ RI1×I2×···×IN.

Such parametrization has much fewer degrees of free-

dom designated by the rank and size of the the ten-

sor. Tensor decompositions arise as a powerful tool

for extracting meaningful latent structure from given

data and can encode the salient characteristics of the

multivariate grid-sampled CDF tensor b

F. We seek

to minimize the squared loss between Fand the em-

pirical CDF tensor b

F. We formulate this task both

directly, i.e., by forming and decomposing b

F, and

indirectly, as a hidden tensor factorization problem,

i.e., the pertinent parts of the latent factors of the

CPD model are updated from rank-1 measurements

of the “hidden” discretized CDF tensor b

F. From an

algorithmic perspective, we propose alternating op-

timization, where each matrix factor is updated us-

ing ADMM, as well as stochastic optimization using

Adam to allow scaling to larger datasets.

1.1 Contributions

In summary, the present paper shows that:

•Every multivariate CDF evaluated on a prede-

ﬁned grid admits a compact representation via a

latent variable naive Bayes model with bounded

number of hidden states equal to the rank of the

grid-sampled CDF tensor.

•This aﬀords easy sampling, marginalization (by

discarding the subset of factor matrices corre-

sponding to the variables we are not interesting

in), derivation of conditional distributions and

expectations, and uncertainty quantiﬁcation –

bypassing the curse of dimensionality.

•The proposed model also aﬀords direct and eﬃ-

cient estimates of (possibly semi-inﬁnite) “box”

probabilities, which is important for classiﬁca-

tion tasks. Multivariate PDF estimators, on

the other hand, require multidimensional in-

tegration (analytical, numerical, or sampling-

based Monte-Carlo) to estimate box probabili-

ties, which is cumbersome and often intractable.

•At the same time, the proposed rank-constrained

estimator is unassuming of the structure of the

data (thus oﬀering greater expressive power) and

identiﬁable under relatively mild rank conditions

– see39.

•On the experimental side, our results indicate

that, perhaps surprisingly, estimating the grid-

sampled CDF and then deriving a PDF estimate

from it yields improved performance relative to

direct PDF estimation in several machine learn-

ing applications of interest. In addition, the per-

formance of the proposed non-parametric model

in the Copula domain outperforms state-of-the-

art Copula-based baselines.

2 Background

2.1 Related work

Unsupervised learning of multivariate distributions

has seen tremendous progress over recent years, for

the case of PDF modeling in particular. Classical

methods in the literature include kernel density es-

timation (KDE)34,35, histogram density estimation

(HDE), and Orthogonal Series Density Estimation

(OSDE)11,13,41. All of the aforementioned methods,

however are ineﬃcient for datasets with higher di-

mensionalities. Neural network-based approaches for

distribution estimation have recently shown promis-

ing results in high-dimensional problems. Auto-

regressive (AR) models such as30,36 decompose the

distribution into a product of conditionals, where

each conditional is modeled by a parametric distri-

bution (e.g., Gaussian or mixture of Gaussians in the

continuous case). Normalizing ﬂows (NFs)33 repre-

sent a density value though an invertible transforma-

tion of latent variables with known density.

On the down-side, AR models are naturally sen-

sitive to the order of the variables/features while

strong network constraints of NFs can be restrictive

for model expressiveness. Most importantly, AR and

NF do not yield an explicit estimate of the density

function; they are ‘oracles’ that can be queried to

output an estimate of the density at any given input

point, i.e., to generate samples of the sought den-

sity – the diﬀerence is important. Therefore, given

a trained model, calculating expectations, marginal

and conditional distributions is not straightforward

with these methods. The same holds for generative

adversarial networks14 (GANs) as they do not al-

low for likelihood evaluation on held-out data. Fur-

thermore, deep multivariate CDF based models such

as6, do not address model identiﬁability, and can not

guarantee the recovery of the true latent factors that

generated the observed samples.

Tensor modeling of distributions: Tensor

models for estimating distributions have been pro-

posed for both discrete and continuous variables. In

the discrete case, the work in22 showed that any joint

PMF can be represented as an N-way probability

tensor and by introducing a CPD model, every mul-

tivariate PMF can be represented by a latent vari-

able naive Bayes model with a ﬁnite number of la-

tent states. For continuous random vectors, the joint

PDF can no longer be directly represented by a ten-

sor. Earlier work (40) has dealt with latent variable

models, but not general distributions. In contrast to

prior work (40,21), we make no assumptions regard-

ing a multivariate mixture model of non-parametric

product distributions in this paper. Another line of

work (see2,3) proposed a “universal” approach for

smooth, compactly supported multivariate densities

by representing the underlying density in terms of a

ﬁnite tensor of leading Fourier coeﬃcients. Our work

requires less restrictive assumptions, as it also works

with discrete or mixed random variables, of possibly

unbounded support.

2.2 Notation, Deﬁnitions, and Pre-

liminaries

We use the symbols x,X,Xfor vectors, matrices

and tensors respectively. We use the notation x(n),

X(:, n), X(:,:, n) to refer to a particular element of

a vector, a column of a matrix and a slab of a ten-

sor. Symbols ◦,⊗,~,denote the outer, Kronecker,

Hadamard and Khatri-Rao (column-wise Kronecker)

product respectively. The vectorization operator is

denoted as vec(X), vec(X) for a matrix and ten-

sor respectively39. Additionally, diag(x)∈RI×Ide-

notes the diagonal matrix with the elements of vector

x∈RIon its diagonal. Symbols kxk1,kxk2,kXkF,

and dT V correspond to L1norm, L2norm, Frobenius

norm, and total variation distance. The total varia-

tion distance between distributions pand qis deﬁned

as dT V (p, q) = 1

2kp−qk1.

Given an N-dimensional random vector X:=

[X1, . . . , XN]T,X∼FXwill denote that the random

vector Xfollows distribution FX.1(A) is the indica-

tor function of event A, i.e., it is 1 if and only if A

is true. The set of integers {1, . . . , N}is denoted as

[N]. Given Mdata samples, D={xm}M

m=1 denotes

the given dataset.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LearningMultivariateCDFsandCopulasusingTensorFactorizationMagdaAmiridi,NicholasD.SidiropoulosUniversityofVirginiafma7bx,nikosg@virginia.eduOctober14,2022AbstractLearningthemultivariatedistributionofdataisacorechallengeinstatisticsandmachinelearning.Traditionalmethodsaimfortheprobabilitydensityfuncti...

展开>> 收起<<

Learning Multivariate CDFs and Copulas using Tensor Factorization Magda Amiridi Nicholas D. Sidiropoulos.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Learning Multivariate CDFs and Copulas using Tensor Factorization Magda Amiridi Nicholas D. Sidiropoulos

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: