Multiple Instance Learning for Detecting Anomalies over Sequential Real-World Datasets

2025-05-02 0 0 737.25KB 9 页 10玖币
侵权投诉
Multiple Instance Learning for Detecting Anomalies over
Sequential Real-World Datasets
Parastoo Kamranfar
George Mason University
Fairfax, Virginia, USA
pkamranf@gmu.edu
David Lattanzi
George Mason University
Fairfax, Virginia, USA
dlattanz@gmu.edu
Amarda Shehu
George Mason University
Fairfax, Virginia, USA
ashehu@gmu.edu
Daniel Barbará
George Mason University
Fairfax, Virginia, USA
dbarbara@gmu.edu
ABSTRACT
Detecting anomalies over real-world datasets remains a challeng-
ing task. Data annotation is an intensive human labor problem,
particularly in sequential datasets, where the start and end time of
anomalies are not known. As a result, data collected from sequential
real-world processes can be largely unlabeled or contain inaccurate
labels. These characteristics challenge the application of anomaly
detection techniques based on supervised learning. In contrast,
Multiple Instance Learning (MIL) has been shown eective on prob-
lems with incomplete knowledge of labels in the training dataset,
mainly due to the notion of bags. While largely under-leveraged
for anomaly detection, MIL provides an appealing formulation for
anomaly detection over real-world datasets, and it is the primary
contribution of this paper. In this paper, we propose an MIL-based
formulation and various algorithmic instantiations of this frame-
work based on dierent design decisions for key components of
the framework. We evaluate the resulting algorithms over four
datasets that capture dierent physical processes along dierent
modalities. The experimental evaluation draws out several obser-
vations. The MIL-based formulation performs no worse than single
instance learning on easy to moderate datasets and outperforms
single-instance learning on more challenging datasets. Altogether,
the results show that the framework generalizes well over diverse
datasets resulting from dierent real-world application domains.
CCS CONCEPTS
Computing methodologies Machine learning algorithms
;
Multiple Instance-based learning.
Corresponding author.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
ACM SIGKDD ’22, August 14–18, 2022, Washington, DC
©2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-XXXX-X/22/08.. .$15.00
https://doi.org/XXXXXXX.XXXXXXX
KEYWORDS
Anomaly Detection, Multiple Instance Learning, Strangeness, Out-
lier Detection.
ACM Reference Format:
Parastoo Kamranfar, David Lattanzi, Amarda Shehu, and Daniel Barbará.
2022. Multiple Instance Learning for Detecting Anomalies over Sequential
Real-World Datasets. In ACM SIGKDD ’22: ACM SIGKDD Workshops, August
14–18, 2022, Washington, DC. ACM, New York, NY, USA, 9 pages. https:
//doi.org/XXXXXXX.XXXXXXX
1 INTRODUCTION
Anomaly detection (AD) is a well-studied problem in machine learn-
ing and seeks to detect data points/instances that deviate from an
expected behavior; the deviating instances are known as outliers or
anomalies [
12
]. Applications of AD have been explored extensively
in literature and over a variety of datasets, including data generated
from sequential real-world processes [9, 14].
Despite being well-studied, AD over real-world datasets remains
challenging. In many real-world settings, annotation of sequential
data requires intensive human labor; annotation is also made dif-
cult by unknown or noisy start and end times of anomalies. As
a result, data collected from sequential real-world processes are
often largely unlabeled, or labels are often inaccurate. This setting
is particularly challenging for AD methods formulated under the
umbrella of supervised learning.
In contrast, Multiple Instance Learning (MIL) has been shown
eective in dealing with incomplete knowledge of labels in the
training dataset. MIL was introduced in [
17
] in the context of bi-
nary classication as a weakly-supervised approach that reduces
annotation eorts. MIL assumes that the data points/instances are
organized in sets, also known as bags. Essentially, MIL deals with
bag labels instead of individual instance labels [
23
]. The label of a
bag (negative/positive) is assigned according to the instance labels
the bag contains. The existence of at least one positive instance is
enough for labelling the bag as positive, while negative bags need
to include all negative instances [
11
,
21
]. This provides a way of
building a more robust classier than what can be built by relying
exclusively on single-instance labels.
MIL has received much attention in the machine learning com-
munity in the past two decades, and various techniques have been
developed to exploit the structure of the data to enhance perfor-
mance in a variety of applications [
2
,
3
,
6
,
31
,
41
,
42
]. In particular,
arXiv:2210.01707v1 [cs.LG] 4 Oct 2022
ACM SIGKDD ’22, August 14–18, 2022, Washington, DC Kamranfar, et al.
computer vision, where videos are naturally structured into bags
and require signicant single-instance labeling eorts, has bene-
ted tremendously from MIL-based methods. Application of MIL for
AD remains largely under-leveraged [
4
,
22
,
38
]. Yet, MIL provides
an appealing formulation for semi-supervised AD over real-world
datasets, and it is the primary contribution of this paper.
In this paper, we propose an MIL-based formulation and propose
various algorithmic instantiations of this framework based on dif-
ferent design decisions for key components of the framework. In
particular, we leverage the combination of MIL and the Strangeness
based OUtlier Detection (StrOUD) algorithm [
8
]. StrOUD computes
a strangeness/anomaly value for each data point and detects out-
liers by means of statistical testing and calculation of p-value. Thus,
the framework is dependent on two primary design decisions: de-
nition of the strangeness factor and the aggregation function. The
degree of outlying/strangeness is needed for recognizing anomalous
data points, while the aggregate function is needed to aggregate
the measures of strangeness into a single anomaly score of a bag.
In this paper, we utilize two alternative scores, the ’Local Outlier
Factor’ and the Autoencoder (AE) reconstruction error. We utilize
six aggregate functions (minimum, maximum, average, median,
spread, dspread) for the strangeness measure of a bag. We note that
we are not limited to these design decisions, and they can be easily
generalized to any other choices. To the best of our knowledge, the
eort we describe here is the rst to truly leverage the power of
MIL in AD by hybridizing the concept of bags naturally with AD
methods that are routinely used in the eld.
We evaluate various algorithmic instantiations of the MIL frame-
work over four datasets that capture dierent physical processes
along dierent modalities (video and vibration signals). The results
show that the MIL-based formulation performs no worse than sin-
gle instance learning on easy to moderate datasets and outperforms
single-instance learning on more challenging datasets. Altogether,
the results show that the framework generalizes well over diverse
benchmark datasets resulting from dierent real-world application
domains.
The rest of the paper is organized as follows. Section 2 relates
prior work in AD. The proposed methodology is described in Sec-
tion 3, and the experimental evaluation is related in Section 4. Sec-
tion 5 concludes the paper.
2 PRIOR WORK
Extensive studies have been performed on AD in many applica-
tion domains, from fraud detection in credit cards, to structural
health monitoring in engineering, to bioinformatics in molecular
biology [
3
,
27
,
43
]. Dierent types of data have been considered,
from sequential time series data [
9
,
13
], to image data, to molecular
structure data [3, 16, 40].
Many methods have been developed for AD, varying from tra-
ditional density-based methods [
25
,
35
] to more recent AE-based
ones [
7
,
15
]. Density-based methods assign an anomaly score to a
single data point/instance by comparing the local neighborhood
of a point to the local neighborhoods of its
𝑘
nearest neighbours.
Higher scores are indicative of anomalous instances.
The Local Outlier Factor (LOF) and its variants are density-based
anomaly scores that are utilized extensively in AD literature [
1
,
24
].
For instance, work in [
35
] introduced an LOF variant score called
’Connectivity based Outlier Factor’ (COF) which diers from LOF in
the way that the neighborhood of an instance is computed. ’Outlier
Detection using In-degree Number’ (ODIN) score has been pre-
sented in [
24
]. ODIN measures the number
𝑘
of nearest neighbors
of a data point which also have that data point in their neighbor-
hood. The inverse of ODIN is dened as the anomaly score.
There are other methods, known as deviation-based methods,
that also utilize anomaly scores. These methods attempt to nd
a lower-dimensional space of normal data by capturing the cor-
relation among the features. The data are projected onto a latent,
lower-dimensional subspace, and unseen test data points with large
reconstruction errors are determined to be anomalies. As these
methods only encounter normal data in the training phase that
seeks to learn the latent space, they are known as semi-supervised
methods.
PCA- and AE-based methods are in the category of AD meth-
ods that seek to capture linear and non-linear feature correlations,
respectively [
15
,
32
]. Due to the capability of AEs in nding more
complex, non-linear correlations, AE-based AD methods tends to
perform better than PCA-based ones, generating fewer false anom-
alies.
In [
15
], conventional and convolutional AE-based (CAE) meth-
ods are presented and compared with PCA-based methods. Work
in [
5
] proposes a variational AE-based (VAE) method which takes
advantage of the probabilistic nature of VAE. The method leverages
the reconstruction probability instead of the reconstruction error as
the anomaly score. AE-based methods, however, require setting a
threshold for how large the reconstruction error or reconstruction
probability has to be for an instance to be predicted as anomalous.
Generally, most AD methods require careful setting of many
parameters, including the anomaly threshold, which is an ad-hoc
process. This hyperparameter regulates sensitivity to anomalies
and the false alarm rate (rate of normal instances detected as out-
liers/anomalies), and is the indicator of AD performance. To main-
tain a low false alarm rate, conformal AD (CAD) methods build on
the conformal prediction (CFD) concept [28].
The underlying idea in CFD methods is to predict potential
labels for each test data point by means of the p-value (one p-
value per possible label). The non-conformity measure is utilized
as an anomaly score, and p-values are calculated To this end, the
signicance level needs to be determined in order to retain or reject
the null hypothesis [29].
In their utilization of signicance testing to avoid overtting,
CFD methods are highly similar to the classic Strangeness based
OUtlier Detection (StrOUD) method [8]. StrOUD was proposed as
an AD method that combines the ideas of transduction and hypoth-
esis testing. It eliminates the need for anomaly ad-hoc thresholds
and can additionally be used for dataset cleaning.
Apart from determination of the anomaly score or the AD cate-
gory, there are many fundamental challenges in AD, not the least of
which is nding appropriate training instances. It is generally easier
to obtain instances from the normal behaviour of a system rather
than anomalies, especially in real-world settings (i.e. large engineer-
ing infrastructures or industrial systems) [
37
], where anomalous
physical processes that generate anomalous data are rare events.
摘要:

MultipleInstanceLearningforDetectingAnomaliesoverSequentialReal-WorldDatasetsParastooKamranfarGeorgeMasonUniversityFairfax,Virginia,USApkamranf@gmu.eduDavidLattanziGeorgeMasonUniversityFairfax,Virginia,USAdlattanz@gmu.eduAmardaShehuGeorgeMasonUniversityFairfax,Virginia,USAashehu@gmu.eduDanielBarbará...

展开>> 收起<<
Multiple Instance Learning for Detecting Anomalies over Sequential Real-World Datasets.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:737.25KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注