Multiple Instance Learning for Detecting Anomalies over Sequential Real-World Datasets

2025-05-02 0 0 737.25KB 9 页 10玖币

侵权投诉

Multiple Instance Learning for Detecting Anomalies over

Sequential Real-World Datasets

Parastoo Kamranfar

George Mason University

Fairfax, Virginia, USA

pkamranf@gmu.edu

David Lattanzi

George Mason University

Fairfax, Virginia, USA

dlattanz@gmu.edu

Amarda Shehu

George Mason University

Fairfax, Virginia, USA

ashehu@gmu.edu

Daniel Barbará∗

George Mason University

Fairfax, Virginia, USA

dbarbara@gmu.edu

ABSTRACT

Detecting anomalies over real-world datasets remains a challeng-

ing task. Data annotation is an intensive human labor problem,

particularly in sequential datasets, where the start and end time of

anomalies are not known. As a result, data collected from sequential

real-world processes can be largely unlabeled or contain inaccurate

labels. These characteristics challenge the application of anomaly

detection techniques based on supervised learning. In contrast,

Multiple Instance Learning (MIL) has been shown eective on prob-

lems with incomplete knowledge of labels in the training dataset,

mainly due to the notion of bags. While largely under-leveraged

for anomaly detection, MIL provides an appealing formulation for

anomaly detection over real-world datasets, and it is the primary

contribution of this paper. In this paper, we propose an MIL-based

formulation and various algorithmic instantiations of this frame-

work based on dierent design decisions for key components of

the framework. We evaluate the resulting algorithms over four

datasets that capture dierent physical processes along dierent

modalities. The experimental evaluation draws out several obser-

vations. The MIL-based formulation performs no worse than single

instance learning on easy to moderate datasets and outperforms

single-instance learning on more challenging datasets. Altogether,

the results show that the framework generalizes well over diverse

datasets resulting from dierent real-world application domains.

CCS CONCEPTS

•Computing methodologies →Machine learning algorithms

;

Multiple Instance-based learning.

∗Corresponding author.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior specic permission and/or a

fee. Request permissions from permissions@acm.org.

ACM SIGKDD ’22, August 14–18, 2022, Washington, DC

ACM ISBN 978-1-4503-XXXX-X/22/08.. .$15.00

https://doi.org/XXXXXXX.XXXXXXX

KEYWORDS

Anomaly Detection, Multiple Instance Learning, Strangeness, Out-

lier Detection.

ACM Reference Format:

Parastoo Kamranfar, David Lattanzi, Amarda Shehu, and Daniel Barbará.

2022. Multiple Instance Learning for Detecting Anomalies over Sequential

Real-World Datasets. In ACM SIGKDD ’22: ACM SIGKDD Workshops, August

14–18, 2022, Washington, DC. ACM, New York, NY, USA, 9 pages. https:

//doi.org/XXXXXXX.XXXXXXX

1 INTRODUCTION

Anomaly detection (AD) is a well-studied problem in machine learn-

ing and seeks to detect data points/instances that deviate from an

expected behavior; the deviating instances are known as outliers or

anomalies [

]. Applications of AD have been explored extensively

in literature and over a variety of datasets, including data generated

from sequential real-world processes [9, 14].

Despite being well-studied, AD over real-world datasets remains

challenging. In many real-world settings, annotation of sequential

data requires intensive human labor; annotation is also made dif-

cult by unknown or noisy start and end times of anomalies. As

a result, data collected from sequential real-world processes are

often largely unlabeled, or labels are often inaccurate. This setting

is particularly challenging for AD methods formulated under the

umbrella of supervised learning.

In contrast, Multiple Instance Learning (MIL) has been shown

eective in dealing with incomplete knowledge of labels in the

training dataset. MIL was introduced in [

] in the context of bi-

nary classication as a weakly-supervised approach that reduces

annotation eorts. MIL assumes that the data points/instances are

organized in sets, also known as bags. Essentially, MIL deals with

bag labels instead of individual instance labels [

]. The label of a

bag (negative/positive) is assigned according to the instance labels

the bag contains. The existence of at least one positive instance is

enough for labelling the bag as positive, while negative bags need

to include all negative instances [

]. This provides a way of

building a more robust classier than what can be built by relying

exclusively on single-instance labels.

MIL has received much attention in the machine learning com-

munity in the past two decades, and various techniques have been

developed to exploit the structure of the data to enhance perfor-

mance in a variety of applications [

]. In particular,

arXiv:2210.01707v1 [cs.LG] 4 Oct 2022

ACM SIGKDD ’22, August 14–18, 2022, Washington, DC Kamranfar, et al.

computer vision, where videos are naturally structured into bags

and require signicant single-instance labeling eorts, has bene-

ted tremendously from MIL-based methods. Application of MIL for

AD remains largely under-leveraged [

]. Yet, MIL provides

an appealing formulation for semi-supervised AD over real-world

datasets, and it is the primary contribution of this paper.

In this paper, we propose an MIL-based formulation and propose

various algorithmic instantiations of this framework based on dif-

ferent design decisions for key components of the framework. In

particular, we leverage the combination of MIL and the Strangeness

based OUtlier Detection (StrOUD) algorithm [

]. StrOUD computes

a strangeness/anomaly value for each data point and detects out-

liers by means of statistical testing and calculation of p-value. Thus,

the framework is dependent on two primary design decisions: de-

nition of the strangeness factor and the aggregation function. The

degree of outlying/strangeness is needed for recognizing anomalous

data points, while the aggregate function is needed to aggregate

the measures of strangeness into a single anomaly score of a bag.

In this paper, we utilize two alternative scores, the ’Local Outlier

Factor’ and the Autoencoder (AE) reconstruction error. We utilize

six aggregate functions (minimum, maximum, average, median,

spread, dspread) for the strangeness measure of a bag. We note that

we are not limited to these design decisions, and they can be easily

generalized to any other choices. To the best of our knowledge, the

eort we describe here is the rst to truly leverage the power of

MIL in AD by hybridizing the concept of bags naturally with AD

methods that are routinely used in the eld.

We evaluate various algorithmic instantiations of the MIL frame-

work over four datasets that capture dierent physical processes

along dierent modalities (video and vibration signals). The results

show that the MIL-based formulation performs no worse than sin-

gle instance learning on easy to moderate datasets and outperforms

single-instance learning on more challenging datasets. Altogether,

the results show that the framework generalizes well over diverse

benchmark datasets resulting from dierent real-world application

domains.

The rest of the paper is organized as follows. Section 2 relates

prior work in AD. The proposed methodology is described in Sec-

tion 3, and the experimental evaluation is related in Section 4. Sec-

tion 5 concludes the paper.

2 PRIOR WORK

Extensive studies have been performed on AD in many applica-

tion domains, from fraud detection in credit cards, to structural

health monitoring in engineering, to bioinformatics in molecular

biology [

]. Dierent types of data have been considered,

from sequential time series data [

], to image data, to molecular

structure data [3, 16, 40].

Many methods have been developed for AD, varying from tra-

ditional density-based methods [

] to more recent AE-based

ones [

]. Density-based methods assign an anomaly score to a

single data point/instance by comparing the local neighborhood

of a point to the local neighborhoods of its

𝑘

nearest neighbours.

Higher scores are indicative of anomalous instances.

The Local Outlier Factor (LOF) and its variants are density-based

anomaly scores that are utilized extensively in AD literature [

For instance, work in [

] introduced an LOF variant score called

’Connectivity based Outlier Factor’ (COF) which diers from LOF in

the way that the neighborhood of an instance is computed. ’Outlier

Detection using In-degree Number’ (ODIN) score has been pre-

sented in [

]. ODIN measures the number

𝑘

of nearest neighbors

of a data point which also have that data point in their neighbor-

hood. The inverse of ODIN is dened as the anomaly score.

There are other methods, known as deviation-based methods,

that also utilize anomaly scores. These methods attempt to nd

a lower-dimensional space of normal data by capturing the cor-

relation among the features. The data are projected onto a latent,

lower-dimensional subspace, and unseen test data points with large

reconstruction errors are determined to be anomalies. As these

methods only encounter normal data in the training phase that

seeks to learn the latent space, they are known as semi-supervised

methods.

PCA- and AE-based methods are in the category of AD meth-

ods that seek to capture linear and non-linear feature correlations,

respectively [

]. Due to the capability of AEs in nding more

complex, non-linear correlations, AE-based AD methods tends to

perform better than PCA-based ones, generating fewer false anom-

alies.

In [

], conventional and convolutional AE-based (CAE) meth-

ods are presented and compared with PCA-based methods. Work

in [

] proposes a variational AE-based (VAE) method which takes

advantage of the probabilistic nature of VAE. The method leverages

the reconstruction probability instead of the reconstruction error as

the anomaly score. AE-based methods, however, require setting a

threshold for how large the reconstruction error or reconstruction

probability has to be for an instance to be predicted as anomalous.

Generally, most AD methods require careful setting of many

parameters, including the anomaly threshold, which is an ad-hoc

process. This hyperparameter regulates sensitivity to anomalies

and the false alarm rate (rate of normal instances detected as out-

liers/anomalies), and is the indicator of AD performance. To main-

tain a low false alarm rate, conformal AD (CAD) methods build on

the conformal prediction (CFD) concept [28].

The underlying idea in CFD methods is to predict potential

labels for each test data point by means of the p-value (one p-

value per possible label). The non-conformity measure is utilized

as an anomaly score, and p-values are calculated To this end, the

signicance level needs to be determined in order to retain or reject

the null hypothesis [29].

In their utilization of signicance testing to avoid overtting,

CFD methods are highly similar to the classic Strangeness based

OUtlier Detection (StrOUD) method [8]. StrOUD was proposed as

an AD method that combines the ideas of transduction and hypoth-

esis testing. It eliminates the need for anomaly ad-hoc thresholds

and can additionally be used for dataset cleaning.

Apart from determination of the anomaly score or the AD cate-

gory, there are many fundamental challenges in AD, not the least of

which is nding appropriate training instances. It is generally easier

to obtain instances from the normal behaviour of a system rather

than anomalies, especially in real-world settings (i.e. large engineer-

ing infrastructures or industrial systems) [

], where anomalous

physical processes that generate anomalous data are rare events.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MultipleInstanceLearningforDetectingAnomaliesoverSequentialReal-WorldDatasetsParastooKamranfarGeorgeMasonUniversityFairfax,Virginia,USApkamranf@gmu.eduDavidLattanziGeorgeMasonUniversityFairfax,Virginia,USAdlattanz@gmu.eduAmardaShehuGeorgeMasonUniversityFairfax,Virginia,USAashehu@gmu.eduDanielBarbará...

展开>> 收起<<

Multiple Instance Learning for Detecting Anomalies over Sequential Real-World Datasets.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Multiple Instance Learning for Detecting Anomalies over Sequential Real-World Datasets

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: