Learning from aggregated data with a maximum entropy model Alexandre Gilotte a.gilottecriteo.com

2025-04-29 0 0 603.53KB 25 页 10玖币

侵权投诉

Learning from aggregated data with a maximum entropy

model

Alexandre Gilotte a.gilotte@criteo.com

Criteo

AI Lab

Paris, France

Ahmed Ben Yahmed ahmed.ben yahmed@ens-paris-saclay.fr

Criteo

AI Lab

France

David Rohde d.rohde@criteo.com

Criteo

AI Lab

France

Abstract

Aggregating a dataset, then injecting some noise, is a simple and common way to

release diﬀerentially private data. However, aggregated data -even without noise- is not an

appropriate input for machine learning classiﬁers. In this work, we show how a new model,

similar to a logistic regression, may be learned from aggregated data only by approximating

the unobserved feature distribution with a maximum entropy hypothesis. The resulting

model is a Markov Random Field (MRF), and we detail how to apply, modify and scale a

MRF training algorithm to our setting. Finally we present empirical evidence on several

public datasets that the model learned this way can achieve performances comparable to

those of a logistic model trained with the full unaggregated data.

Keywords: aggregated data, diﬀerential privacy, Markov random ﬁeld, logistic regression,

ad click prediction

1. Introduction

1.1 Learning from aggregated data

A recent societal trend is to provide user with more privacy protection, and this leads

to rethinking how user data are collected and shared to oﬀer meaningful and provable

privacy guarantees. One popular and simple way to provide these guarantees is to aggregate

the dataset, discarding the individual records early in the data processing. The resulting

aggregated dataset is made of a list of contingency tables, counting examples of the original

record-level dataset along several projections. As an example, Table 1 shows a toy dataset

made of individual records, and Table 2 contains the ”aggregated data” computed from this

toy dataset by projecting on each pair of features1.

1. A good reason to aggregate the data is that it is easy to make aggregated data diﬀerentially private, by

adding some Laplace or Gaussian noise (Dwork et al., 2014) . The resulting noisy aggregated data may

then be shared or published without damaging the privacy of the users.

arXiv:2210.02450v1 [cs.LG] 5 Oct 2022

There is however a strong limitation to the usefulness of aggregated data: The vast

majority of high performance machine learning algorithms from logistic regression to deep

learning require access to disaggregate data (Table 1.) and cannot be trained from aggregate

data (Table 2.).

One notable exception is the “Naive Bayes classiﬁer” (Lewis, 1998) , which requires only

aggregated data. Speciﬁcally, it requires, for each feature, one contingency table aggregating

on this feature and the label. Naive Bayes heavily relies on an independence assumption

between the features of the dataset, and is known for often providing poor quality models

on real world datasets.We notice also that the aggregated data of Table 2 allows to learn

about the correlations between pairs of features, and this information is lost in the Naive

Bayes model.

To summarize brieﬂy this method, it consists on retrieving the distribution of maximal

entropy among these which would in expectation produce the observed aggregated data; and

on predicting with the conditional distribution of the label inferred from this distribution.

This method may be understood as a generalization of Naive Bayes, but takes into account

all the observed correlations between features. The quality of the obtained model depends

on the set of available contingency tables, and we focused in our experiments in the case

when there is one contingency table for each pair of feature, as in Table 2. The case when

aggregated data are noisy was kept outside of the scope of this paper.

Paper outline The problem and notations are deﬁned in section 2. We then detail

our solution in 3 , discuss its limitations and how to scale it in section 4, and present our

experimental results in section 5. Finally section 6 reviews some related works.

1.2 Online advertising, Chrome Privacy Sandbox the Criteo-AdKdd challenge

Our main motivation in the development of this method comes from the changes happening

in the online advertising with the introduction of Google Chrome Privacy Sandbox (privacy-

sandbox).

The online advertising industry currently heavily relies on user data to train machine

learning (ML) models predicting the probability that an ad is clicked, as a function of the

ad’s features. These models are used to price an ad display opportunity, and to bid in real

time on these opportunities. The Privacy Sandbox would however restrict how advertisers

can access to these user data. In particular in the FLEDGE (Kleber, 2021) proposal from

Chrome, the advertisers would still be able to compute their bids as a function of many

of the features they use today, but they would not be able to collect an individual record

Table 1: Toy example dataset

Feature 1 Feature 2 Feature 3 label

Example 1 ”1” B a 1

Example 2 ”2” A b 1

Example 3 ”1” B b 0

Example 4 ”2” B a 1

Example 5 ”1” A b 0

Table 2: Example of aggregated data

Feature 1 Feature 2 Examples count Sum(label)

”1” A 1 0

”1” B 1 1

”2” A 1 1

”2” B 2 1

Feature 1 Feature 3 Examples count Sum(label)

”1” a 1 1

”1” b 2 0

”2” a 1 0

”2” b 1 1

Feature 2 Feature 3 Examples count Sum(label)

A b 2 1

B a 2 2

B b 1 0

training set 2. Instead, Chrome would provide an ”aggregation API”, from which advertisers

would recover noisy aggregated data similar to these of Table 2. But this means that new

algorithms able to learn directly from these aggregated data would be required.

Criteo Ad-Kdd challenge Because learning a bidding model in this setting is a serious

challenge for the advertising industry, Criteo proposed in 2021 a public competition where

the goal was to learn a model from aggregated data (Diemert et al., 2022). We try to

tackle a similar problem as the one from the competition, except that in the AdKKD’s

competition, a small amount of disaggregate examples were also provided and largely used

by the top-performing solutions. Here we focus on a solution which does not require any

disaggregate sample.

2. Problem formalization

Let Xa feature vector made of Dcategorical features, each with Mmodalities: X∈ X :=

{1...M}D. Let Y∈ {0,1}a binary label, and πan unknown joined distribution on X, Y .

Let (xi, yi)i∈1...n a dataset of niid samples of this distribution.

From this dataset is computed a list of contingency tables, counting the displays and

labels projected on subsets of the feature vector, as in the example of Table 2. Formally,

we may deﬁne these tables as follow: let φbe a ﬁnite set of K“projections” φk:X −→

{0,1}for k ∈[1..K]. We may think of each function φkas one row in a contingency table:

an example xis either counted ( when φk(x) = 1 ) or not ( when φk(x) = 0 ) in this row.

We then deﬁne, for x∈ X , the binary vector:

φ(x) := (φk(x)){φk∈φ}∈ {0,1}K.

2. This would be achieved by letting the advertiser upload their bidding model on the browser, where user

data would be stored. This model would be applied used by the browser itself to compute a bid, and

the advertiser would thus never observe directly the inputs of the model.

We also deﬁne the aggregated data as the two vectors3:

d:= X

φ(xi),

c:= X

φ(xi)·yi.

Each coordinate of d(respectively c) is thus the count of examples (respectively the

examples with label 1) on one row of a contingency table. To keep notations compact, we

will also note athe concatenation of vectors cand di.e. a=c

d.

The vector φ(x) thus encodes the list of rows where example xis counted. In practice,

we ﬁrst choose a list of contingency tables, and deﬁne φfrom these tables.

Problem statement Our goal is to learn a model predicting the label Yas a function of

the features X, when the observed data consist only on the vectors cand dof aggregated data.

and the ”granular” training set (xi, yi) from which it was computed cannot be accessed.

Note that the state space Xand the encoding φare assumed to be known, i.e. we know how

the data were aggregated. In other words, and noting Athe random variable associated

with the observed aggregated data a, the only information available at learning time is that

we observed the event (A=a).

Aggregation on pairs of features Of course, the quality of the learned model depends

greatly on the list of available contingency tables. For example, in the extreme case when we

would have one table aggregating on all features, we could rebuild the full original dataset,

and could then apply classical ML methods. At the other extreme, if we only have one

table for each feature, aggregating the examples and labels on this feature only, and no

other information on the correlation between features, there is little opportunities to learn

a meaningful model other than Naive Bayes or some simple ensemble of predictors using

one single feature each.

We are thus interested in the intermediate cases, when the tables provide some infor-

mation on the correlations between features of the Xvector, but the full dataset cannot

be reconstructed. In practice, we focus in the case when there is one contingency table

counting examples and labels for each pair of features of X.

3. Modelling the joint distribution of features and labels

Recall that classical ML methods, which directly model the conditional distribution P(Y|X=

x), require observing individual samples (x, y) to deﬁne a loss and are thus not applicable

here.

Instead, we will directly model the joined distribution on X, Y . Since the aggregated

data Aare random variables computed from iid samples of X, Y , this model is able to fully

deﬁne the distribution on the random variables A, and may be ﬁtted by trying to maximise

the likelihood of the observed event A=a.

A joined model may also be used to infer the label on a sample x, using the Bayes rule

to retrieve the conditional P(Y|X=x) from the joined model.

3. In the vocabulary of online advertising, dare the counts of Displays and care the counts of Clicks

Symbol Meaning

XThe space state of feature vectors.

Y ≡ {0,1}The set of values of the labels

nThe number of samples of the dataset

πThe unknown distribution on X × Y

(xi, yi)i∈1...N Dataset of independent samples of π

φFinite family of Kbinary functions

φk:X −→ {0; 1}

φ(x) The Kdimensional binary

vector (φk(x))k∈[1..K]

cand dThe observed vectors of aggregated counts

and labels on projections φ

athe concatenation of vectors cand d

C,D,AThe random variables associated with c,d,a

Table 3: Notation

3.1 Log linear model

The class of models we propose to use is the class of log-linears models whose suﬃcient

statistics are exactly the aggregated data.

Formally, let µ, θ ∈ RKtwo vectors of parameters, and πµ,θ the parametric distribution

on X, Y deﬁned as follow:

πµ,θ(x, y) := 1

Zµ,θ

exp φ(x)·(µ+y·θ),(1)

where Zµ,θ is the normalization constant:

Zµ,θ ≡X

x0,y0∈X ×Y

exp(φ(x0)·(µ+y0·θ)).(2)

Remark 1 This normalisation constant Zµ,θ is a sum on a number of terms exponentially

large in the number of features. It is not reasonable to compute it explicitly, except in a

small ”toy” problems. But as we will see, this is not necessary.

Models such as Equation 1 belong to the class of ”Random Markov Fields“ (MRFs).

Motivations for choosing this parametric model Let us summarize quickly the rea-

sons why we choose this speciﬁc family of parametric models.

•The aggregated data are suﬃcient statistics for this model, making the optimization

problem reasonably tractable. In particular, the objective4is convex, with a uniquely

deﬁned optimal distribution π∗.

4. To be more precise, this is the case if we allow the parameters to go to inﬁnity, or if there is no 0 cell

with a count of 0 in the contingency tables. Also note that the model is slightly over-parametrised: the

distribution is unique, not the optimal parameters. In practice we use a regularization, which makes the

loss strictly convex and avoids these complications.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LearningfromaggregateddatawithamaximumentropymodelAlexandreGilottea.gilotte@criteo.comCriteoAILabParis,FranceAhmedBenYahmedahmed.benyahmed@ens-paris-saclay.frCriteoAILabFranceDavidRohded.rohde@criteo.comCriteoAILabFranceAbstractAggregatingadataset,theninjectingsomenoise,isasimpleandcommonwaytoreleas...

展开>> 收起<<

Learning from aggregated data with a maximum entropy model Alexandre Gilotte a.gilottecriteo.com.pdf

共25页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Learning from aggregated data with a maximum entropy model Alexandre Gilotte a.gilottecriteo.com

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: