ACTIVE LEARNING FOR REGRESSION WITH AGGREGATED OUTPUTS A P REPRINT_2

2025-04-30 0 0 689.76KB 13 页 10玖币
侵权投诉
ACTIVE LEARNING FOR REGRESSION WITH AGGREGATED
OUTPUTS
A PREPRINT
Tomoharu Iwata
NTT Communication Science Laboratories
ABSTRACT
Due to the privacy protection or the difficulty of data collection, we cannot observe individual
outputs for each instance, but we can observe aggregated outputs that are summed over multiple
instances in a set in some real-world applications. To reduce the labeling cost for training regression
models for such aggregated data, we propose an active learning method that sequentially selects
sets to be labeled to improve the predictive performance with fewer labeled sets. For the selection
measurement, the proposed method uses the mutual information, which quantifies the reduction of
the uncertainty of the model parameters by observing the aggregated output. With Bayesian linear
basis functions for modeling outputs given an input, which include approximated Gaussian processes
and neural networks, we can efficiently calculate the mutual information in a closed form. With
the experiments using various datasets, we demonstrate that the proposed method achieves better
predictive performance with fewer labeled sets than existing methods.
1 Introduction
Data are often aggregated for privacy protection, cost reduction, or the difficulty of data collection [
28
,
1
,
3
]. For
example, census data are averaged over spatial regions, IoT data are aggregated to reduce the communication overhead,
the gene expression level is measured for each set of multiple cells, and brain imaging data are observed for each set of
voxels. Since learning from such aggregated data is important for applications where only aggregated data are available,
many machine learning methods for aggregated data have been proposed [32, 38, 4].
Although the predictive performance of the machine learning model improves as the number of training labeled data
increases in general, obtaining many labeled data incurs considerable cost. Active learning has been successfully
used for reducing the labeling cost, where instances to be labeled are sequentially selected to improve the predictive
performance [
41
,
29
,
48
,
45
]. However, there have been no active learning methods for regression with aggregated data.
In this paper, we propose an active learning method for regression with aggregated outputs. At the beginning of the
active learning process, we are given unlabeled sets of instances. Then, for each active learning step, we select a
set to observe its aggregated output, where we cannot observe outputs for each instance. Our aim is to improve the
predictive performance of the outputs for each test instance. The proposed method selects a set that maximizes the
mutual information between the aggregated output and model parameters, which corresponds to the reduction of the
uncertainty of the model parameters by observing the aggregated output of the set. Mutual information-based active
learning has been successfully used for non-aggregated data [30, 25, 21].
We derive the mutual information using linear basis function models as a regression model that predicts the non-
aggregated output given an input vector. Various regression models can be formulated as a linear basis function model,
which include polynomial regression, approximated Gaussian processes [
39
], and neural networks by changing basis
functions. With the Bayesian inference framework of the linear basis function models, we can model the distribution of
the aggregated output as a Gaussian distribution, and we can calculate the mutual information on aggregated outputs
efficiently in a closed form. Figure 1 shows the framework of our active learning.
The major contributions of this paper are as follows:
1. We propose the first active learning method for regression with aggregated outputs.
arXiv:2210.01329v1 [stat.ML] 4 Oct 2022
Active Learning for Regression with Aggregated Outputs A PREPRINT
Figure 1: Our framework of active learning with aggregated outputs. In the beginning, we are given unlabeled sets
of instances. For each step, we iterate the following procedures: 1) Predict the distribution of the aggregated output
for each of the unlabeled sets using the model. 2) Select a set from the unlabeled sets to be labeled using mutual
information calculated based on the predicted distributions, and query the oracle. 3) Observe the aggregated output of
the selected set, include it in the labeled sets, and exclude it from the unlabeled sets. 4) Retrain the model using the
updated labeled sets.
2.
The proposed method is based on entropy and mutual information, which are calculated efficiently using
Bayesian linear basis function models, and considers the correlation among the instances in each set.
3.
We demonstrate the effectiveness of the proposed method with various datasets compared with existing active
learning methods for non-aggregated data.
The remainder of this paper is organized as follows. In Section 2, we briefly review related work. In Section 3, we
describe the probability distribution of the weighted sum of Gaussian distributed random variables, which is used in
the proposed method. In Section 4, we define our task, and propose our active learning method for regression with
aggregated outputs based on entropy and mutual information. In Section 5, we evaluate the performance of our method
by comparing existing methods. Finally, we present concluding remarks and a discussion of future work in Section 6.
2 Related work
Several frameworks of learning from aggregated data have been proposed [
13
,
11
,
53
]. Learning from label propor-
tions [
38
,
35
] considers classification tasks, where outputs are categorical. Multiple instance learning [
31
] learns
classification models from labeled sets, where each set is positively labeled if at least one individual is positive, and it
is otherwise negatively labeled. Collective graphical models learn from contingency tables [
42
,
23
,
17
]. Regression
from aggregated data has also been considered, where outputs are continuous [
4
,
52
,
47
,
24
]. Summed or averaged
values are assumed to be observed in [
33
,
52
,
47
,
24
] as with our setting, and histograms are assumed to be observed
in [
4
]. Some methods assume that both inputs and outputs are aggregated [
4
], and others assume that only outputs are
aggregated while inputs are not aggregated [
32
,
22
]. In this paper, we consider regression with aggregated outputs by a
linear weighted summation.
Many active learning methods have been proposed [
41
,
49
,
5
,
2
,
29
,
48
], which include those for multiple instance
learning [
8
], and those for learning from label proportions [
37
]. However, they are not for regression with aggregated
outputs, and they are inapplicable to our task. Batch active learning [
20
,
14
,
36
] selects multiple instances to be labeled,
where outputs for each instance are observed. It is different from our task, where aggregated outputs are observed, but
individual outputs cannot be observed.
3 Preliminaries
Let y= [y1,· · · , yN]RNbe jointly Gaussian distributed random variables,
y N (y|µ,Σ),(1)
2
Active Learning for Regression with Aggregated Outputs A PREPRINT
Table 1: Notation.
Symbol Description
xan input vector of the nth instance of the ath set.
Xaath set of input vectors.
yan unobserved output value of the nth instance of the ath set.
¯yaaggregated output value of the instances in the ath set.
θalinear weights for aggregation of the ath set.
Nanumber of instances in the ath set.
Dnumber of attributes.
φ(·)basis function.
wlinear projection vector of the linear basis function model, or parameters of the neural network model.
Dlabeled sets with aggregated outputs.
where
N(·|µ,Σ)
is the Gaussian distribution with mean
µRN
and covariance
ΣRN×N
. The weighted sum of
the random variables ¯y=PN
n=1 θnyn=θ>yRfollows the Gaussian distribution [18, 26],
¯y N (¯y|θ>µ,θ>Σθ),(2)
where θnRis the weight, and θ= [θ1,· · · , θN]RN.
4 Proposed method
In Section 4.1, we define our task of active learning for aggregated outputs. In Section 4.2, we present our model for
predicting outputs that are trained from labeled sets with aggregated outputs based on linear basis function models. In
Sections 4.3 and 4.4, we propose entropy-based and mutual information-based active learning methods using our model
that select a set to be observed next to improve the predictive performance, respectively. In Section 4.5, we present the
procedures of the proposed method.
4.1 Problem formulation
Suppose that we are given sets of input vectors
{Xa}A
a=1
, where
Xa={xan}Na
n=1
is the
a
th set of input vectors,
xan RD
is the
n
th input vector,
D
is the number of attributes, and
Na
is the number of input vectors in the set. For
each active learning step, we select a set from
{1,· · · , A}
, and observe the aggregated output of selected set
a
that is
obtained by the weighted sum of the output of the input vectors in the set,
¯ya=
Na
X
n=1
θanyan =θ>
ayaR,(3)
where
¯
·
indicates that it is an aggregated value,
yan R
is the unknown output of input vector
xan
,
ya=
[ya1,· · · , yaNa]RNa
, and
θa= [θa1,· · · , θaNa]RNa
is the weights. We assume that weights
θa
for all
sets are known. For example,
θan = 1
when the aggregated data are obtained by summation, and
θan =1
Na
when they
are obtained by average. Our aim is to improve the test predictive performance of the outputs with as few observations
of aggregated outputs as possible. Table 1 shows our notation. Although we assume that outputs are scalar, the proposed
method can be straightforwardly extended to multivariate outputs. When the aggregated response value is obtained
by integral
¯ya=Rθanyandn
, we can apply the proposed method by approximating the integral by the summation by
dividing the space into a finite number of bins.
4.2 Model
Let
ˆy(x;w)
be a regression model to predict the non-aggregated output given input vector
x
, where
w
is the parameters
to be considered as random variables. We consider the following linear basis function model,
ˆy(x;w) = w>φ(x),(4)
where
φ(x)RK
is the nonlinear basis function that transforms a
D
-dimensional input vector to a
K
-dimensional
vector, and
wRK
. A wide variety of regression models can be formulated by a linear basis function model, which
include linear regression, polynomial regression, approximated Gaussian processes with random features-based basis
functions [
39
], and neural networks with the last layer represented by random variables and neural network-based
3
摘要:

ACTIVELEARNINGFORREGRESSIONWITHAGGREGATEDOUTPUTSAPREPRINTTomoharuIwataNTTCommunicationScienceLaboratoriesABSTRACTDuetotheprivacyprotectionorthedifcultyofdatacollection,wecannotobserveindividualoutputsforeachinstance,butwecanobserveaggregatedoutputsthataresummedovermultipleinstancesinasetinsomereal-...

展开>> 收起<<
ACTIVE LEARNING FOR REGRESSION WITH AGGREGATED OUTPUTS A P REPRINT_2.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:689.76KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注