ACTIVE LEARNING FOR REGRESSION WITH AGGREGATED OUTPUTS A P REPRINT_2

2025-04-30 4 0 689.76KB 13 页 10玖币

侵权投诉

ACTIVE LEARNING FOR REGRESSION WITH AGGREGATED

OUTPUTS

A PREPRINT

Tomoharu Iwata

NTT Communication Science Laboratories

ABSTRACT

Due to the privacy protection or the difﬁculty of data collection, we cannot observe individual

outputs for each instance, but we can observe aggregated outputs that are summed over multiple

instances in a set in some real-world applications. To reduce the labeling cost for training regression

models for such aggregated data, we propose an active learning method that sequentially selects

sets to be labeled to improve the predictive performance with fewer labeled sets. For the selection

measurement, the proposed method uses the mutual information, which quantiﬁes the reduction of

the uncertainty of the model parameters by observing the aggregated output. With Bayesian linear

basis functions for modeling outputs given an input, which include approximated Gaussian processes

and neural networks, we can efﬁciently calculate the mutual information in a closed form. With

the experiments using various datasets, we demonstrate that the proposed method achieves better

predictive performance with fewer labeled sets than existing methods.

1 Introduction

Data are often aggregated for privacy protection, cost reduction, or the difﬁculty of data collection [

]. For

example, census data are averaged over spatial regions, IoT data are aggregated to reduce the communication overhead,

the gene expression level is measured for each set of multiple cells, and brain imaging data are observed for each set of

voxels. Since learning from such aggregated data is important for applications where only aggregated data are available,

many machine learning methods for aggregated data have been proposed [32, 38, 4].

Although the predictive performance of the machine learning model improves as the number of training labeled data

increases in general, obtaining many labeled data incurs considerable cost. Active learning has been successfully

used for reducing the labeling cost, where instances to be labeled are sequentially selected to improve the predictive

performance [

]. However, there have been no active learning methods for regression with aggregated data.

In this paper, we propose an active learning method for regression with aggregated outputs. At the beginning of the

active learning process, we are given unlabeled sets of instances. Then, for each active learning step, we select a

set to observe its aggregated output, where we cannot observe outputs for each instance. Our aim is to improve the

predictive performance of the outputs for each test instance. The proposed method selects a set that maximizes the

mutual information between the aggregated output and model parameters, which corresponds to the reduction of the

uncertainty of the model parameters by observing the aggregated output of the set. Mutual information-based active

learning has been successfully used for non-aggregated data [30, 25, 21].

We derive the mutual information using linear basis function models as a regression model that predicts the non-

aggregated output given an input vector. Various regression models can be formulated as a linear basis function model,

which include polynomial regression, approximated Gaussian processes [

], and neural networks by changing basis

functions. With the Bayesian inference framework of the linear basis function models, we can model the distribution of

the aggregated output as a Gaussian distribution, and we can calculate the mutual information on aggregated outputs

efﬁciently in a closed form. Figure 1 shows the framework of our active learning.

The major contributions of this paper are as follows:

1. We propose the ﬁrst active learning method for regression with aggregated outputs.

arXiv:2210.01329v1 [stat.ML] 4 Oct 2022

Active Learning for Regression with Aggregated Outputs A PREPRINT

Figure 1: Our framework of active learning with aggregated outputs. In the beginning, we are given unlabeled sets

of instances. For each step, we iterate the following procedures: 1) Predict the distribution of the aggregated output

for each of the unlabeled sets using the model. 2) Select a set from the unlabeled sets to be labeled using mutual

information calculated based on the predicted distributions, and query the oracle. 3) Observe the aggregated output of

the selected set, include it in the labeled sets, and exclude it from the unlabeled sets. 4) Retrain the model using the

updated labeled sets.

The proposed method is based on entropy and mutual information, which are calculated efﬁciently using

Bayesian linear basis function models, and considers the correlation among the instances in each set.

We demonstrate the effectiveness of the proposed method with various datasets compared with existing active

learning methods for non-aggregated data.

The remainder of this paper is organized as follows. In Section 2, we brieﬂy review related work. In Section 3, we

describe the probability distribution of the weighted sum of Gaussian distributed random variables, which is used in

the proposed method. In Section 4, we deﬁne our task, and propose our active learning method for regression with

aggregated outputs based on entropy and mutual information. In Section 5, we evaluate the performance of our method

by comparing existing methods. Finally, we present concluding remarks and a discussion of future work in Section 6.

2 Related work

Several frameworks of learning from aggregated data have been proposed [

]. Learning from label propor-

tions [

] considers classiﬁcation tasks, where outputs are categorical. Multiple instance learning [

] learns

classiﬁcation models from labeled sets, where each set is positively labeled if at least one individual is positive, and it

is otherwise negatively labeled. Collective graphical models learn from contingency tables [

]. Regression

from aggregated data has also been considered, where outputs are continuous [

]. Summed or averaged

values are assumed to be observed in [

] as with our setting, and histograms are assumed to be observed

in [

]. Some methods assume that both inputs and outputs are aggregated [

], and others assume that only outputs are

aggregated while inputs are not aggregated [

]. In this paper, we consider regression with aggregated outputs by a

linear weighted summation.

Many active learning methods have been proposed [

], which include those for multiple instance

learning [

], and those for learning from label proportions [

]. However, they are not for regression with aggregated

outputs, and they are inapplicable to our task. Batch active learning [

] selects multiple instances to be labeled,

where outputs for each instance are observed. It is different from our task, where aggregated outputs are observed, but

individual outputs cannot be observed.

3 Preliminaries

Let y= [y1,· · · , yN]∈RNbe jointly Gaussian distributed random variables,

y∼ N (y|µ,Σ),(1)

Active Learning for Regression with Aggregated Outputs A PREPRINT

Table 1: Notation.

Symbol Description

xan input vector of the nth instance of the ath set.

Xaath set of input vectors.

yan unobserved output value of the nth instance of the ath set.

¯yaaggregated output value of the instances in the ath set.

θalinear weights for aggregation of the ath set.

Nanumber of instances in the ath set.

Dnumber of attributes.

φ(·)basis function.

wlinear projection vector of the linear basis function model, or parameters of the neural network model.

Dlabeled sets with aggregated outputs.

where

N(·|µ,Σ)

is the Gaussian distribution with mean

µ∈RN

and covariance

Σ∈RN×N

. The weighted sum of

the random variables ¯y=PN

n=1 θnyn=θ>y∈Rfollows the Gaussian distribution [18, 26],

¯y∼ N (¯y|θ>µ,θ>Σθ),(2)

where θn∈Ris the weight, and θ= [θ1,· · · , θN]∈RN.

4 Proposed method

In Section 4.1, we deﬁne our task of active learning for aggregated outputs. In Section 4.2, we present our model for

predicting outputs that are trained from labeled sets with aggregated outputs based on linear basis function models. In

Sections 4.3 and 4.4, we propose entropy-based and mutual information-based active learning methods using our model

that select a set to be observed next to improve the predictive performance, respectively. In Section 4.5, we present the

procedures of the proposed method.

4.1 Problem formulation

Suppose that we are given sets of input vectors

{Xa}A

a=1

, where

Xa={xan}Na

n=1

is the

th set of input vectors,

xan ∈RD

is the

th input vector,

is the number of attributes, and

is the number of input vectors in the set. For

each active learning step, we select a set from

{1,· · · , A}

, and observe the aggregated output of selected set

that is

obtained by the weighted sum of the output of the input vectors in the set,

¯ya=

n=1

θanyan =θ>

aya∈R,(3)

where

indicates that it is an aggregated value,

yan ∈R

is the unknown output of input vector

xan

ya=

[ya1,· · · , yaNa]∈RNa

, and

θa= [θa1,· · · , θaNa]∈RNa

is the weights. We assume that weights

θa

for all

sets are known. For example,

θan = 1

when the aggregated data are obtained by summation, and

θan =1

when they

are obtained by average. Our aim is to improve the test predictive performance of the outputs with as few observations

of aggregated outputs as possible. Table 1 shows our notation. Although we assume that outputs are scalar, the proposed

method can be straightforwardly extended to multivariate outputs. When the aggregated response value is obtained

by integral

¯ya=Rθanyandn

, we can apply the proposed method by approximating the integral by the summation by

dividing the space into a ﬁnite number of bins.

4.2 Model

Let

ˆy(x;w)

be a regression model to predict the non-aggregated output given input vector

, where

is the parameters

to be considered as random variables. We consider the following linear basis function model,

ˆy(x;w) = w>φ(x),(4)

where

φ(x)∈RK

is the nonlinear basis function that transforms a

-dimensional input vector to a

-dimensional

vector, and

w∈RK

. A wide variety of regression models can be formulated by a linear basis function model, which

include linear regression, polynomial regression, approximated Gaussian processes with random features-based basis

functions [

], and neural networks with the last layer represented by random variables and neural network-based

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ACTIVELEARNINGFORREGRESSIONWITHAGGREGATEDOUTPUTSAPREPRINTTomoharuIwataNTTCommunicationScienceLaboratoriesABSTRACTDuetotheprivacyprotectionorthedifcultyofdatacollection,wecannotobserveindividualoutputsforeachinstance,butwecanobserveaggregatedoutputsthataresummedovermultipleinstancesinasetinsomereal-...

展开>> 收起<<

ACTIVE LEARNING FOR REGRESSION WITH AGGREGATED OUTPUTS A P REPRINT_2.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

ACTIVE LEARNING FOR REGRESSION WITH AGGREGATED OUTPUTS A P REPRINT_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: