Learning from aggregated data with a maximum entropy model Alexandre Gilotte a.gilottecriteo.com

2025-04-29 0 0 603.53KB 25 页 10玖币
侵权投诉
Learning from aggregated data with a maximum entropy
model
Alexandre Gilotte a.gilotte@criteo.com
Criteo
AI Lab
Paris, France
Ahmed Ben Yahmed ahmed.ben yahmed@ens-paris-saclay.fr
Criteo
AI Lab
France
David Rohde d.rohde@criteo.com
Criteo
AI Lab
France
Abstract
Aggregating a dataset, then injecting some noise, is a simple and common way to
release differentially private data. However, aggregated data -even without noise- is not an
appropriate input for machine learning classifiers. In this work, we show how a new model,
similar to a logistic regression, may be learned from aggregated data only by approximating
the unobserved feature distribution with a maximum entropy hypothesis. The resulting
model is a Markov Random Field (MRF), and we detail how to apply, modify and scale a
MRF training algorithm to our setting. Finally we present empirical evidence on several
public datasets that the model learned this way can achieve performances comparable to
those of a logistic model trained with the full unaggregated data.
Keywords: aggregated data, differential privacy, Markov random field, logistic regression,
ad click prediction
1. Introduction
1.1 Learning from aggregated data
A recent societal trend is to provide user with more privacy protection, and this leads
to rethinking how user data are collected and shared to offer meaningful and provable
privacy guarantees. One popular and simple way to provide these guarantees is to aggregate
the dataset, discarding the individual records early in the data processing. The resulting
aggregated dataset is made of a list of contingency tables, counting examples of the original
record-level dataset along several projections. As an example, Table 1 shows a toy dataset
made of individual records, and Table 2 contains the ”aggregated data” computed from this
toy dataset by projecting on each pair of features1.
1. A good reason to aggregate the data is that it is easy to make aggregated data differentially private, by
adding some Laplace or Gaussian noise (Dwork et al., 2014) . The resulting noisy aggregated data may
then be shared or published without damaging the privacy of the users.
1
arXiv:2210.02450v1 [cs.LG] 5 Oct 2022
There is however a strong limitation to the usefulness of aggregated data: The vast
majority of high performance machine learning algorithms from logistic regression to deep
learning require access to disaggregate data (Table 1.) and cannot be trained from aggregate
data (Table 2.).
One notable exception is the “Naive Bayes classifier” (Lewis, 1998) , which requires only
aggregated data. Specifically, it requires, for each feature, one contingency table aggregating
on this feature and the label. Naive Bayes heavily relies on an independence assumption
between the features of the dataset, and is known for often providing poor quality models
on real world datasets.We notice also that the aggregated data of Table 2 allows to learn
about the correlations between pairs of features, and this information is lost in the Naive
Bayes model.
To summarize briefly this method, it consists on retrieving the distribution of maximal
entropy among these which would in expectation produce the observed aggregated data; and
on predicting with the conditional distribution of the label inferred from this distribution.
This method may be understood as a generalization of Naive Bayes, but takes into account
all the observed correlations between features. The quality of the obtained model depends
on the set of available contingency tables, and we focused in our experiments in the case
when there is one contingency table for each pair of feature, as in Table 2. The case when
aggregated data are noisy was kept outside of the scope of this paper.
Paper outline The problem and notations are defined in section 2. We then detail
our solution in 3 , discuss its limitations and how to scale it in section 4, and present our
experimental results in section 5. Finally section 6 reviews some related works.
1.2 Online advertising, Chrome Privacy Sandbox the Criteo-AdKdd challenge
Our main motivation in the development of this method comes from the changes happening
in the online advertising with the introduction of Google Chrome Privacy Sandbox (privacy-
sandbox).
The online advertising industry currently heavily relies on user data to train machine
learning (ML) models predicting the probability that an ad is clicked, as a function of the
ad’s features. These models are used to price an ad display opportunity, and to bid in real
time on these opportunities. The Privacy Sandbox would however restrict how advertisers
can access to these user data. In particular in the FLEDGE (Kleber, 2021) proposal from
Chrome, the advertisers would still be able to compute their bids as a function of many
of the features they use today, but they would not be able to collect an individual record
Table 1: Toy example dataset
Feature 1 Feature 2 Feature 3 label
Example 1 ”1” B a 1
Example 2 ”2” A b 1
Example 3 ”1” B b 0
Example 4 ”2” B a 1
Example 5 ”1” A b 0
2
Table 2: Example of aggregated data
Feature 1 Feature 2 Examples count Sum(label)
”1” A 1 0
”1” B 1 1
”2” A 1 1
”2” B 2 1
Feature 1 Feature 3 Examples count Sum(label)
”1” a 1 1
”1” b 2 0
”2” a 1 0
”2” b 1 1
Feature 2 Feature 3 Examples count Sum(label)
A b 2 1
B a 2 2
B b 1 0
training set 2. Instead, Chrome would provide an ”aggregation API”, from which advertisers
would recover noisy aggregated data similar to these of Table 2. But this means that new
algorithms able to learn directly from these aggregated data would be required.
Criteo Ad-Kdd challenge Because learning a bidding model in this setting is a serious
challenge for the advertising industry, Criteo proposed in 2021 a public competition where
the goal was to learn a model from aggregated data (Diemert et al., 2022). We try to
tackle a similar problem as the one from the competition, except that in the AdKKD’s
competition, a small amount of disaggregate examples were also provided and largely used
by the top-performing solutions. Here we focus on a solution which does not require any
disaggregate sample.
2. Problem formalization
Let Xa feature vector made of Dcategorical features, each with Mmodalities: X∈ X :=
{1...M}D. Let Y∈ {0,1}a binary label, and πan unknown joined distribution on X, Y .
Let (xi, yi)i1...n a dataset of niid samples of this distribution.
From this dataset is computed a list of contingency tables, counting the displays and
labels projected on subsets of the feature vector, as in the example of Table 2. Formally,
we may define these tables as follow: let φbe a finite set of K“projections” φk:X −
{0,1}for k [1..K]. We may think of each function φkas one row in a contingency table:
an example xis either counted ( when φk(x) = 1 ) or not ( when φk(x) = 0 ) in this row.
We then define, for x X , the binary vector:
φ(x) := (φk(x)){φkφ}∈ {0,1}K.
2. This would be achieved by letting the advertiser upload their bidding model on the browser, where user
data would be stored. This model would be applied used by the browser itself to compute a bid, and
the advertiser would thus never observe directly the inputs of the model.
3
We also define the aggregated data as the two vectors3:
d:= X
i
φ(xi),
c:= X
i
φ(xi)·yi.
Each coordinate of d(respectively c) is thus the count of examples (respectively the
examples with label 1) on one row of a contingency table. To keep notations compact, we
will also note athe concatenation of vectors cand di.e. a=c
d.
The vector φ(x) thus encodes the list of rows where example xis counted. In practice,
we first choose a list of contingency tables, and define φfrom these tables.
Problem statement Our goal is to learn a model predicting the label Yas a function of
the features X, when the observed data consist only on the vectors cand dof aggregated data.
and the ”granular” training set (xi, yi) from which it was computed cannot be accessed.
Note that the state space Xand the encoding φare assumed to be known, i.e. we know how
the data were aggregated. In other words, and noting Athe random variable associated
with the observed aggregated data a, the only information available at learning time is that
we observed the event (A=a).
Aggregation on pairs of features Of course, the quality of the learned model depends
greatly on the list of available contingency tables. For example, in the extreme case when we
would have one table aggregating on all features, we could rebuild the full original dataset,
and could then apply classical ML methods. At the other extreme, if we only have one
table for each feature, aggregating the examples and labels on this feature only, and no
other information on the correlation between features, there is little opportunities to learn
a meaningful model other than Naive Bayes or some simple ensemble of predictors using
one single feature each.
We are thus interested in the intermediate cases, when the tables provide some infor-
mation on the correlations between features of the Xvector, but the full dataset cannot
be reconstructed. In practice, we focus in the case when there is one contingency table
counting examples and labels for each pair of features of X.
3. Modelling the joint distribution of features and labels
Recall that classical ML methods, which directly model the conditional distribution P(Y|X=
x), require observing individual samples (x, y) to define a loss and are thus not applicable
here.
Instead, we will directly model the joined distribution on X, Y . Since the aggregated
data Aare random variables computed from iid samples of X, Y , this model is able to fully
define the distribution on the random variables A, and may be fitted by trying to maximise
the likelihood of the observed event A=a.
A joined model may also be used to infer the label on a sample x, using the Bayes rule
to retrieve the conditional P(Y|X=x) from the joined model.
3. In the vocabulary of online advertising, dare the counts of Displays and care the counts of Clicks
4
Symbol Meaning
XThe space state of feature vectors.
Y ≡ {0,1}The set of values of the labels
nThe number of samples of the dataset
πThe unknown distribution on X × Y
(xi, yi)i1...N Dataset of independent samples of π
φFinite family of Kbinary functions
φk:X → {0; 1}
φ(x) The Kdimensional binary
vector (φk(x))k[1..K]
cand dThe observed vectors of aggregated counts
and labels on projections φ
athe concatenation of vectors cand d
C,D,AThe random variables associated with c,d,a
Table 3: Notation
3.1 Log linear model
The class of models we propose to use is the class of log-linears models whose sufficient
statistics are exactly the aggregated data.
Formally, let µ, θ ∈ RKtwo vectors of parameters, and πµ,θ the parametric distribution
on X, Y defined as follow:
πµ,θ(x, y) := 1
Zµ,θ
exp φ(x)·(µ+y·θ),(1)
where Zµ,θ is the normalization constant:
Zµ,θ X
x0,y0∈X ×Y
exp(φ(x0)·(µ+y0·θ)).(2)
Remark 1 This normalisation constant Zµ,θ is a sum on a number of terms exponentially
large in the number of features. It is not reasonable to compute it explicitly, except in a
small ”toy” problems. But as we will see, this is not necessary.
Models such as Equation 1 belong to the class of ”Random Markov Fields“ (MRFs).
Motivations for choosing this parametric model Let us summarize quickly the rea-
sons why we choose this specific family of parametric models.
The aggregated data are sufficient statistics for this model, making the optimization
problem reasonably tractable. In particular, the objective4is convex, with a uniquely
defined optimal distribution π.
4. To be more precise, this is the case if we allow the parameters to go to infinity, or if there is no 0 cell
with a count of 0 in the contingency tables. Also note that the model is slightly over-parametrised: the
distribution is unique, not the optimal parameters. In practice we use a regularization, which makes the
loss strictly convex and avoids these complications.
5
摘要:

LearningfromaggregateddatawithamaximumentropymodelAlexandreGilottea.gilotte@criteo.comCriteoAILabParis,FranceAhmedBenYahmedahmed.benyahmed@ens-paris-saclay.frCriteoAILabFranceDavidRohded.rohde@criteo.comCriteoAILabFranceAbstractAggregatingadataset,theninjectingsomenoise,isasimpleandcommonwaytoreleas...

展开>> 收起<<
Learning from aggregated data with a maximum entropy model Alexandre Gilotte a.gilottecriteo.com.pdf

共25页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:25 页 大小:603.53KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 25
客服
关注