Unsupervised Anomaly Detection for Auditing Data and Impact of Categorical Encodings Ajay Chawda

2025-05-06 0 0 1.65MB 16 页 10玖币

侵权投诉

Unsupervised Anomaly Detection for Auditing Data

and Impact of Categorical Encodings

Ajay Chawda

Department of Computer Science

TU Kaiserslautern

Financial Mathematics, Fraunhofer ITWM

a_chawda19@cs.uni-kl.de

Stefanie Grimm

Financial Mathematics

Fraunhofer ITWM

Kaiserslautern DE

stefanie.grimm@itwm.fraunhofer.de

Marius Kloft

Department of Computer Science

TU Kaiserslautern

Kaiserslautern DE

kloft@cs.uni-kl.de

Abstract

In this paper, we introduce the Vehicle Claims dataset, consisting of fraudulent

insurance claims for automotive repairs. The data belongs to the more broad

category of

Auditing

data, which includes also Journals and Network Intrusion

data. Insurance claim data are distinctively different from other auditing data

(such as network intrusion data) in their high number of categorical attributes. We

tackle the common problem of missing benchmark datasets for anomaly detection:

datasets are mostly conﬁdential, and the public tabular datasets do not contain

relevant and sufﬁcient categorical attributes. Therefore, a large-sized dataset is

created for this purpose and referred to as Vehicle Claims (VC) dataset. The

dataset is evaluated on shallow and deep learning methods. Due to the introduction

of categorical attributes, we encounter the challenge of encoding them for the

large dataset. As One Hot encoding of high cardinal dataset invokes the "curse

of dimensionality", we experiment with GEL encoding and embedding layer

for representing categorical attributes. Our work compares competitive learning,

reconstruction-error, density estimation and contrastive learning approaches for

Label,One Hot,GEL encoding and embedding layer to handle categorical values.

1 Introduction

In the context of auditing data, most of the anomaly detection methods are trained on private datasets.

These datasets contain personal information regarding individuals and companies, which can be used

for malicious purposes by hackers in the event the knowledge becomes public. To hide the sensitive

information, we need a dataset that models the behaviour of private auditing data but does not contain

personal information about individuals. The lack of a benchmark dataset in the context of auditing

data motivates us to create an anomaly benchmark. (Ruff et al., 2021) mentions three strategies

for anomaly benchmarks: k-classes-out, where we consider a few classes in multiclass data to be

normal and the rest as anomalous; Synthetic, where we use a supervised dataset and insert synthetic

anomalies; and Real World, where we label the dataset with the help by a human expert. In this paper,

we follow the Synthetic approach and create a synthetic dataset based on domain knowledge learned

from the fraudulent claims dataset of an automotive dealer.

NeurIPS 2022 Workshop on Synthetic Data for Empowering ML Research.

arXiv:2210.14056v2 [cs.LG] 26 Oct 2022

Fraud in ﬁnance, insurance, and healthcare is ubiquitous. An insurance claim for a vehicle can be

marked as an anomaly if it amounts to the sum of repairing an engine issue when in reality, the

claim is to ﬁx a punctured tyre, or a transaction on a credit card of the same amount as the total

yearly bill for the purchase of a book will be an anomalous sample. These seemingly anomalous

data points might be fraudulent. However, they may be noisy normal observations naturally varying

from the default behavior and not fraudulent, which may become evident when studying them using

domain knowledge. An insurance claim of an unusually high amount might be genuine, and due to

the incorrect reason listed in the dataset, it results in a fraudulent claim. If there is prior knowledge

about the data being an anomaly, it will make our task easier. However, in a real-world scenario, data

does not come with labels, and labeling a dataset is an expensive task. Due to the unavailability of

labels, we focus our attention on unsupervised methods for anomaly detection.

In anomaly detection, our goal is to distinguish between normal and anomalous samples. We use

classic machine learning and modern deep learning methods for anomaly detection (AD) as baselines

in this paper. Classic unsupervised methods include Isolation forest (Liu et al., 2008), which is an

ensemble of binary decision trees and create shorter paths for anomalies, One-class support vector

machines (OC-SVM) (Schölkopf et al., 2001), which encompasses the normal samples with the

smallest possible hypersphere, and Local Outlier Factor (Breunig et al., 2000), which predicts outliers

based on the local neighborhood of the sample. These methods are used as shallow unsupervised

baselines. Other supervised baselines used in this paper are Random Forest (Breiman, 2001) and

Gradient Boosting (Friedman, 2001). Our work evaluates RSRAE (Lai et al., 2020) for reconstruction

error method, DAGMM, SOM-DAGMM (Zong et al., 2018; Chen et al., 2021a) for density estimation

method, SOM (Kohonen, 1990) for competitive learning and NeutralAD (Qiu et al., 2021) and

LOE (Qiu et al., 2022) for contrastive learning approaches in unsupervised anomaly detection.

Comparing deep unsupervised methods based on distinguishing concepts helps us analyze the suitable

methods for auditing datasets. Further, the encoding of categorical attributes is another challenge in

unsupervised learning. In the context of auditing data, there is literature associated with the evaluation

of models (Nonnenmacher and Gómez, 2021; Gomes et al., 2021), but there is a lack of information

on the representation of categorical features in unsupervised learning scenarios. Recent use of

embeddings (Dahouda and Joe, 2021; Guo and Berkhahn, 2016; Karingula et al., 2021) encourages

us to use embedding for encoding categorical features. We also use GEL (Golinko and Zhu, 2019)

encoding to mitigate issues due to the One Hot representation of the high cardinality dataset. The

labels are available for the datasets used in the paper, which helped us to report AUC and F1 scores.

In an entirely unsupervised setting, this would not be possible, and we need to look for alternatives.

The choice of evaluation metric for anomaly detection is important as noted by (Alvarez et al., 2022),

so we discuss the difference between AUC and evaluating under thresholds using F1 scores.

Our work solves the problem of a missing public dataset for auditing data where categorical attributes

are dominant and decisive in the classiﬁcation process of normal and anomalous samples. We

demonstrate the performance of the synthetic dataset for various shallow and deep unsupervised

methods and show that it performs similarly to the existing datasets. We also observe that encoding

categorical attributes is an essential factor in the task of unsupervised anomaly detection, where the

goal is to embed the attributes into meaningful representations for the model.

2 Dataset

Table 1: Summary of datasets. Auditing data is different from Credit card dataset and arrhythmia

dataset in the number of categorical features.

Dataset Source Rows Num. features Cat. features Anomaly ratio

Car Insurance Kaggle 1000 16 17 0.25

Vehicle Insurance Oracle 15420 6 26 0.06

Vehicle Claims Ours 268255 5 13 0.21

Credit Card MLG-ULB 284807 30 0 0.001

KDD UCI 494021 34 7 0.2

Arrhythmia UCI 452 279 0 0.26

The principal component of a machine learning or deep learning model is

Data

.Experimental data

is collected in a laboratory or controlled environment where the researcher participates in the data

collection process and can alter the outcome with the changes in variables, whereas Observational

data, is where the researcher is an observer of the data collection process. The latter type of Data

is usually a byproduct of business activities and represents the facts that have happened in the

past (Cunningham, 2020). Auditing data commonly belong to the category of observational data.

Therefore, the prior knowledge of fraudulent insurance claims would exist in the case of experimental

data and help in labeling the dataset. In our work, we evaluate unsupervised methods on both

experimental and observational data. The datasets used in our work have labels available, which

makes our task of evaluating the models easier. The observational data for insurance claims are

small and medium size datasets that are downloaded from (Kaggle, 2018; Oracle, 2017). The

experimental dataset is a large dataset generated from metadata of the DVM car dataset (Huang et al.,

2021). Although the availability of labels helps with the analysis of our models, the goal is to reduce

dependence on them for training purposes in future scenarios because of the economic expense

incurred while labeling the data. The purpose of using different datasets with various sizes and

features provides us with a platform to analyze the behavior of trained models on the amount of data.

The arrhythmia and KDD datasets are used as benchmarks in unsupervised learning methods (Zong

et al., 2018),(Qiu et al., 2022) for anomaly detection and have fewer categorical attributes than those

found in auditing datasets. As observed in Table 1, auditing datasets have more categorical features,

which are important to understand the behavior of data points. Another important piece of information

in the above table is the anomaly ratio. We observe that it ranges from 0.001 to 0.26, i.e., the number

of positive class samples (or anomalies) in the complete dataset. It is an individual choice to consider

either positive or negative class as anomalous data, and in this paper, we consider positive labels as

anomalies.

2.1 Vehicle Claims

There are multiple tabular datasets available for anomaly detection. The available benchmark

datasets for anomaly detection are Credit Card fraud (MLG-ULB, 2008), KDD, Arrhythmia, and

Thyroid (Blake et al., 1998) are not suitable to conclude the performance of models on auditing data.

Therefore, a new synthetic dataset is created with the domain knowledge about insurance claims of

vehicles for the auditing scenario. (Huang et al., 2021) released the DVM-Car dataset, which consists

of 1.4 million car images from various advertising platforms and six tables consisting of metadata

about the cars in the dataset. Using a Basic table, mainly for indexing other tables, which includes

1,011 generic models from 101 automakers, a Sales table, sales of 773 car models from 2001 to 2020,

a Price table, basic price data of historical models, and a Trim table, includes 0.33 million trim-level

information such as selling price, fuel type and engine size, the new dataset is created with 268,255

rows. This dataset is referred as Vehicle Claims (VC) dataset.

A fraudulent sample will have an insurance amount claimed that is higher than the average claim

amount for the respective case. For example, a tyre is replaced by the customer due to a puncture, but

the claim amount reﬂects the cost of replacing two tyres. This is a fraudulent claim and an anomaly.

Our idea is to model this information in the dataset with the following steps. Firstly, categorical

features issue,issue

id are added. issue

id is a subcategory of the issue column. Secondly, the

repair

complexity column is added based on the maker of the vehicle. The common brands like

Volkswagen have complexity one whereas Ferrari has four. Thirdly,repair

hours and repair

cost

are calculated based on the issue,issue

id and repair

complexity. Every tenth row in repair

cost

and 20

row in repair

hours is an anomaly. Lastly, Labels are added for the purpose of veriﬁcation.

There are 56,749 anomalous points in the VC dataset. The importance of having anomalous values in

categorical features like door

num,Engin

size, etc. can be argued as, for insurance claims, numerical

features such as Price are more important, but the categorical attributes help to observe the amount

of bias added to the model by low importance features, and also the categorical anomalies will be

helpful for explainable anomaly detection models.

Issues

are randomly assigned from a list of existing known issues about the vehicles.

Issue_Id

is a

subcategory of certain issues. Warning Light has 8 sub-categories because a warning light can mean

anything ranging from low fuel or engine oil to a damaged braking system. Similarly, Engine Issue,

Transmission Issue, and Electrical Issue have 4,3, and 5 sub-categories. The list of issues contains

the values as follows.

The

repair_complexity

column contains the complexity of repairing a vehicle depending upon

the availability of the workshop and the price of the vehicle. Common automotive makers like

Volkswagen have complexity 1, whereas Ferrari has complexity 3. The complexity of the models and

the respective makers are listed in Table 4.

The

repair_hours

attribute is the required time to repair the vehicle which is calculated by multiplying

the repair_complexity of the vehicle with a predeﬁned number of hours for the issue and issue_id.

The

repair_cost

is the sum of

repair_hours

times 20 plus a fraction of the price of the vehicle. We

have assumed the hourly work rate of labor to be 20 units. The anomalies are introduced at every

10th

instance in

repair_cost

and lie between 3 and 6 standard deviations from the mean. Every

20th

row in

repair_hours

attribute is an anomaly and lies between 3 and 4 standard deviations from the

mean. This ratio can be modiﬁed to suit the problem statement. If the goal is anomaly detection in an

imbalanced setting, the anomalies can be diluted to a ratio of 0.01. Table 3 lists the number of hours

and cost of repair for respective Issue and Issue_Id.

Missing

values in the table are replaced by anomalous values. Our idea was to introduce categorical

anomalies in the dataset. In the real world, most of the time due to the ﬁlling of incorrect information

in healthcare or insurance forms, it is difﬁcult to ﬁnd out the reason for an anomalous sample. We

believe these missing values being replaced by anomalous values will be useful for explainable

models to determine the cell values due to which the data point behaves as an anomaly. The missing

values in each attribute of the original data are replaced as follows.

Color: Gelb, Reg_year: 3010, Body_type: Wood, Door_num : 0, Engin_size: 999.0L, Gearbox :

Hybrid, Fuel_type : Hydrogen

There are other attributes breakdown_date,repair_date that are not used in our evaluation which

contain the date of breakdown and repair date of the vehicle. The anomalous points in these attributes

are the ones where there is a large difference between the two dates. If the number of hours required

to repair is 9, the vehicle should be returned in 2 days, considering an 8-hour work day.

We use three insurance claims datasets to evaluate unsupervised models for auditing data. The

characteristics of the VC dataset are modeled on the real-world problem of fraudulent claims. Our

dataset consists of the essential features that are detrimental to distinguishing normal and anomalous

samples. The number of anomalies in our dataset can be adjusted to the suitability of the training

environment. The cardinality of the categorical attributes in the dataset is 1171 since it is a large

dataset, and each categorical attribute contains more unique values than CI and VI datasets. We will

observe that the VC dataset suffers from the curse of dimensionality

3 Data splitting and Evaluation metrics

In this section, we brieﬂy describe the training and evaluation strategy of our work. We select models

from different categories of unsupervised learning approaches and observe the performance of our

dataset. (Ruff et al., 2021), categorizes anomaly detection methods into three topics. First, Density

estimation and Probabilistic methods, which predict anomalies by estimating the distribution of

normal data, DAGMM (Zong et al., 2018), SOM-DAGMM (Chen et al., 2021a) belong to the energy-

based models under the density estimation category. Second, One Class Classiﬁcation approaches

aim to learn a decision boundary on the normal data distribution. Reconstruction Models, that predict

anomalies based on high reconstruction error of test data. RSRAE (Lai et al., 2020) belongs to the

category of deep autoencoders. Another deep learning variant is Self Organizing maps, which are

trained by competitive learning instead of backpropagation learning. SOM has been recently involved

in the ﬁeld of intrusion detection (Chen et al., 2021a;b) and fraud detection (Mongwe and Malan,

2020). Since SOM (Kohonen, 1990) is used in SOM-DAGMM (Chen et al., 2021a) to improve the

topological structure of the latent representation of the data; our aim is to investigate the results using

a simple SOM. In the original paper, DAGMM and SOM are trained on network intrusion dataset

(KDD) containing more numerical features the categorical features, RSRAE is the state-of-the-art

model image (Fashion MNIST) and document (20NewsGroups) datasets in the ﬁeld of unsupervised

anomaly detection. LOE trains the model with contaminated data, which is suitable for our approach.

The code and dataset are available on Github1.

In the literature for unsupervised anomaly detection, different data splits are used to train and evaluate

models. We need to decide whether to use only normal data or data with anomalous samples for

training. In the latest work, (Qiu et al., 2022) provides the training framework where the model is

1https://github.com/ajaychawda58/UADAD

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UnsupervisedAnomalyDetectionforAuditingDataandImpactofCategoricalEncodingsAjayChawdaDepartmentofComputerScienceTUKaiserslauternFinancialMathematics,FraunhoferITWMa_chawda19@cs.uni-kl.deStefanieGrimmFinancialMathematicsFraunhoferITWMKaiserslauternDEstefanie.grimm@itwm.fraunhofer.deMariusKloftDepartme...

展开>> 收起<<

Unsupervised Anomaly Detection for Auditing Data and Impact of Categorical Encodings Ajay Chawda.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Unsupervised Anomaly Detection for Auditing Data and Impact of Categorical Encodings Ajay Chawda

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: