
ACTIVE LEARNING FOR REGRESSION WITH AGGREGATED
OUTPUTS
A PREPRINT
Tomoharu Iwata
NTT Communication Science Laboratories
ABSTRACT
Due to the privacy protection or the difficulty of data collection, we cannot observe individual
outputs for each instance, but we can observe aggregated outputs that are summed over multiple
instances in a set in some real-world applications. To reduce the labeling cost for training regression
models for such aggregated data, we propose an active learning method that sequentially selects
sets to be labeled to improve the predictive performance with fewer labeled sets. For the selection
measurement, the proposed method uses the mutual information, which quantifies the reduction of
the uncertainty of the model parameters by observing the aggregated output. With Bayesian linear
basis functions for modeling outputs given an input, which include approximated Gaussian processes
and neural networks, we can efficiently calculate the mutual information in a closed form. With
the experiments using various datasets, we demonstrate that the proposed method achieves better
predictive performance with fewer labeled sets than existing methods.
1 Introduction
Data are often aggregated for privacy protection, cost reduction, or the difficulty of data collection [
28
,
1
,
3
]. For
example, census data are averaged over spatial regions, IoT data are aggregated to reduce the communication overhead,
the gene expression level is measured for each set of multiple cells, and brain imaging data are observed for each set of
voxels. Since learning from such aggregated data is important for applications where only aggregated data are available,
many machine learning methods for aggregated data have been proposed [32, 38, 4].
Although the predictive performance of the machine learning model improves as the number of training labeled data
increases in general, obtaining many labeled data incurs considerable cost. Active learning has been successfully
used for reducing the labeling cost, where instances to be labeled are sequentially selected to improve the predictive
performance [
41
,
29
,
48
,
45
]. However, there have been no active learning methods for regression with aggregated data.
In this paper, we propose an active learning method for regression with aggregated outputs. At the beginning of the
active learning process, we are given unlabeled sets of instances. Then, for each active learning step, we select a
set to observe its aggregated output, where we cannot observe outputs for each instance. Our aim is to improve the
predictive performance of the outputs for each test instance. The proposed method selects a set that maximizes the
mutual information between the aggregated output and model parameters, which corresponds to the reduction of the
uncertainty of the model parameters by observing the aggregated output of the set. Mutual information-based active
learning has been successfully used for non-aggregated data [30, 25, 21].
We derive the mutual information using linear basis function models as a regression model that predicts the non-
aggregated output given an input vector. Various regression models can be formulated as a linear basis function model,
which include polynomial regression, approximated Gaussian processes [
39
], and neural networks by changing basis
functions. With the Bayesian inference framework of the linear basis function models, we can model the distribution of
the aggregated output as a Gaussian distribution, and we can calculate the mutual information on aggregated outputs
efficiently in a closed form. Figure 1 shows the framework of our active learning.
The major contributions of this paper are as follows:
1. We propose the first active learning method for regression with aggregated outputs.
arXiv:2210.01329v1 [stat.ML] 4 Oct 2022