Unsupervised Anomaly Detection for Auditing Data and Impact of Categorical Encodings Ajay Chawda

2025-05-06 0 0 1.65MB 16 页 10玖币
侵权投诉
Unsupervised Anomaly Detection for Auditing Data
and Impact of Categorical Encodings
Ajay Chawda
Department of Computer Science
TU Kaiserslautern
Financial Mathematics, Fraunhofer ITWM
a_chawda19@cs.uni-kl.de
Stefanie Grimm
Financial Mathematics
Fraunhofer ITWM
Kaiserslautern DE
stefanie.grimm@itwm.fraunhofer.de
Marius Kloft
Department of Computer Science
TU Kaiserslautern
Kaiserslautern DE
kloft@cs.uni-kl.de
Abstract
In this paper, we introduce the Vehicle Claims dataset, consisting of fraudulent
insurance claims for automotive repairs. The data belongs to the more broad
category of
Auditing
data, which includes also Journals and Network Intrusion
data. Insurance claim data are distinctively different from other auditing data
(such as network intrusion data) in their high number of categorical attributes. We
tackle the common problem of missing benchmark datasets for anomaly detection:
datasets are mostly confidential, and the public tabular datasets do not contain
relevant and sufficient categorical attributes. Therefore, a large-sized dataset is
created for this purpose and referred to as Vehicle Claims (VC) dataset. The
dataset is evaluated on shallow and deep learning methods. Due to the introduction
of categorical attributes, we encounter the challenge of encoding them for the
large dataset. As One Hot encoding of high cardinal dataset invokes the "curse
of dimensionality", we experiment with GEL encoding and embedding layer
for representing categorical attributes. Our work compares competitive learning,
reconstruction-error, density estimation and contrastive learning approaches for
Label,One Hot,GEL encoding and embedding layer to handle categorical values.
1 Introduction
In the context of auditing data, most of the anomaly detection methods are trained on private datasets.
These datasets contain personal information regarding individuals and companies, which can be used
for malicious purposes by hackers in the event the knowledge becomes public. To hide the sensitive
information, we need a dataset that models the behaviour of private auditing data but does not contain
personal information about individuals. The lack of a benchmark dataset in the context of auditing
data motivates us to create an anomaly benchmark. (Ruff et al., 2021) mentions three strategies
for anomaly benchmarks: k-classes-out, where we consider a few classes in multiclass data to be
normal and the rest as anomalous; Synthetic, where we use a supervised dataset and insert synthetic
anomalies; and Real World, where we label the dataset with the help by a human expert. In this paper,
we follow the Synthetic approach and create a synthetic dataset based on domain knowledge learned
from the fraudulent claims dataset of an automotive dealer.
NeurIPS 2022 Workshop on Synthetic Data for Empowering ML Research.
arXiv:2210.14056v2 [cs.LG] 26 Oct 2022
Fraud in finance, insurance, and healthcare is ubiquitous. An insurance claim for a vehicle can be
marked as an anomaly if it amounts to the sum of repairing an engine issue when in reality, the
claim is to fix a punctured tyre, or a transaction on a credit card of the same amount as the total
yearly bill for the purchase of a book will be an anomalous sample. These seemingly anomalous
data points might be fraudulent. However, they may be noisy normal observations naturally varying
from the default behavior and not fraudulent, which may become evident when studying them using
domain knowledge. An insurance claim of an unusually high amount might be genuine, and due to
the incorrect reason listed in the dataset, it results in a fraudulent claim. If there is prior knowledge
about the data being an anomaly, it will make our task easier. However, in a real-world scenario, data
does not come with labels, and labeling a dataset is an expensive task. Due to the unavailability of
labels, we focus our attention on unsupervised methods for anomaly detection.
In anomaly detection, our goal is to distinguish between normal and anomalous samples. We use
classic machine learning and modern deep learning methods for anomaly detection (AD) as baselines
in this paper. Classic unsupervised methods include Isolation forest (Liu et al., 2008), which is an
ensemble of binary decision trees and create shorter paths for anomalies, One-class support vector
machines (OC-SVM) (Schölkopf et al., 2001), which encompasses the normal samples with the
smallest possible hypersphere, and Local Outlier Factor (Breunig et al., 2000), which predicts outliers
based on the local neighborhood of the sample. These methods are used as shallow unsupervised
baselines. Other supervised baselines used in this paper are Random Forest (Breiman, 2001) and
Gradient Boosting (Friedman, 2001). Our work evaluates RSRAE (Lai et al., 2020) for reconstruction
error method, DAGMM, SOM-DAGMM (Zong et al., 2018; Chen et al., 2021a) for density estimation
method, SOM (Kohonen, 1990) for competitive learning and NeutralAD (Qiu et al., 2021) and
LOE (Qiu et al., 2022) for contrastive learning approaches in unsupervised anomaly detection.
Comparing deep unsupervised methods based on distinguishing concepts helps us analyze the suitable
methods for auditing datasets. Further, the encoding of categorical attributes is another challenge in
unsupervised learning. In the context of auditing data, there is literature associated with the evaluation
of models (Nonnenmacher and Gómez, 2021; Gomes et al., 2021), but there is a lack of information
on the representation of categorical features in unsupervised learning scenarios. Recent use of
embeddings (Dahouda and Joe, 2021; Guo and Berkhahn, 2016; Karingula et al., 2021) encourages
us to use embedding for encoding categorical features. We also use GEL (Golinko and Zhu, 2019)
encoding to mitigate issues due to the One Hot representation of the high cardinality dataset. The
labels are available for the datasets used in the paper, which helped us to report AUC and F1 scores.
In an entirely unsupervised setting, this would not be possible, and we need to look for alternatives.
The choice of evaluation metric for anomaly detection is important as noted by (Alvarez et al., 2022),
so we discuss the difference between AUC and evaluating under thresholds using F1 scores.
Our work solves the problem of a missing public dataset for auditing data where categorical attributes
are dominant and decisive in the classification process of normal and anomalous samples. We
demonstrate the performance of the synthetic dataset for various shallow and deep unsupervised
methods and show that it performs similarly to the existing datasets. We also observe that encoding
categorical attributes is an essential factor in the task of unsupervised anomaly detection, where the
goal is to embed the attributes into meaningful representations for the model.
2 Dataset
Table 1: Summary of datasets. Auditing data is different from Credit card dataset and arrhythmia
dataset in the number of categorical features.
Dataset Source Rows Num. features Cat. features Anomaly ratio
Car Insurance Kaggle 1000 16 17 0.25
Vehicle Insurance Oracle 15420 6 26 0.06
Vehicle Claims Ours 268255 5 13 0.21
Credit Card MLG-ULB 284807 30 0 0.001
KDD UCI 494021 34 7 0.2
Arrhythmia UCI 452 279 0 0.26
The principal component of a machine learning or deep learning model is
Data
.Experimental data
is collected in a laboratory or controlled environment where the researcher participates in the data
2
collection process and can alter the outcome with the changes in variables, whereas Observational
data, is where the researcher is an observer of the data collection process. The latter type of Data
is usually a byproduct of business activities and represents the facts that have happened in the
past (Cunningham, 2020). Auditing data commonly belong to the category of observational data.
Therefore, the prior knowledge of fraudulent insurance claims would exist in the case of experimental
data and help in labeling the dataset. In our work, we evaluate unsupervised methods on both
experimental and observational data. The datasets used in our work have labels available, which
makes our task of evaluating the models easier. The observational data for insurance claims are
small and medium size datasets that are downloaded from (Kaggle, 2018; Oracle, 2017). The
experimental dataset is a large dataset generated from metadata of the DVM car dataset (Huang et al.,
2021). Although the availability of labels helps with the analysis of our models, the goal is to reduce
dependence on them for training purposes in future scenarios because of the economic expense
incurred while labeling the data. The purpose of using different datasets with various sizes and
features provides us with a platform to analyze the behavior of trained models on the amount of data.
The arrhythmia and KDD datasets are used as benchmarks in unsupervised learning methods (Zong
et al., 2018),(Qiu et al., 2022) for anomaly detection and have fewer categorical attributes than those
found in auditing datasets. As observed in Table 1, auditing datasets have more categorical features,
which are important to understand the behavior of data points. Another important piece of information
in the above table is the anomaly ratio. We observe that it ranges from 0.001 to 0.26, i.e., the number
of positive class samples (or anomalies) in the complete dataset. It is an individual choice to consider
either positive or negative class as anomalous data, and in this paper, we consider positive labels as
anomalies.
2.1 Vehicle Claims
There are multiple tabular datasets available for anomaly detection. The available benchmark
datasets for anomaly detection are Credit Card fraud (MLG-ULB, 2008), KDD, Arrhythmia, and
Thyroid (Blake et al., 1998) are not suitable to conclude the performance of models on auditing data.
Therefore, a new synthetic dataset is created with the domain knowledge about insurance claims of
vehicles for the auditing scenario. (Huang et al., 2021) released the DVM-Car dataset, which consists
of 1.4 million car images from various advertising platforms and six tables consisting of metadata
about the cars in the dataset. Using a Basic table, mainly for indexing other tables, which includes
1,011 generic models from 101 automakers, a Sales table, sales of 773 car models from 2001 to 2020,
a Price table, basic price data of historical models, and a Trim table, includes 0.33 million trim-level
information such as selling price, fuel type and engine size, the new dataset is created with 268,255
rows. This dataset is referred as Vehicle Claims (VC) dataset.
A fraudulent sample will have an insurance amount claimed that is higher than the average claim
amount for the respective case. For example, a tyre is replaced by the customer due to a puncture, but
the claim amount reflects the cost of replacing two tyres. This is a fraudulent claim and an anomaly.
Our idea is to model this information in the dataset with the following steps. Firstly, categorical
features issue,issue
_
id are added. issue
_
id is a subcategory of the issue column. Secondly, the
repair
_
complexity column is added based on the maker of the vehicle. The common brands like
Volkswagen have complexity one whereas Ferrari has four. Thirdly,repair
_
hours and repair
_
cost
are calculated based on the issue,issue
_
id and repair
_
complexity. Every tenth row in repair
_
cost
and 20
th
row in repair
_
hours is an anomaly. Lastly, Labels are added for the purpose of verification.
There are 56,749 anomalous points in the VC dataset. The importance of having anomalous values in
categorical features like door
_
num,Engin
_
size, etc. can be argued as, for insurance claims, numerical
features such as Price are more important, but the categorical attributes help to observe the amount
of bias added to the model by low importance features, and also the categorical anomalies will be
helpful for explainable anomaly detection models.
Issues
are randomly assigned from a list of existing known issues about the vehicles.
Issue_Id
is a
subcategory of certain issues. Warning Light has 8 sub-categories because a warning light can mean
anything ranging from low fuel or engine oil to a damaged braking system. Similarly, Engine Issue,
Transmission Issue, and Electrical Issue have 4,3, and 5 sub-categories. The list of issues contains
the values as follows.
The
repair_complexity
column contains the complexity of repairing a vehicle depending upon
the availability of the workshop and the price of the vehicle. Common automotive makers like
3
Volkswagen have complexity 1, whereas Ferrari has complexity 3. The complexity of the models and
the respective makers are listed in Table 4.
The
repair_hours
attribute is the required time to repair the vehicle which is calculated by multiplying
the repair_complexity of the vehicle with a predefined number of hours for the issue and issue_id.
The
repair_cost
is the sum of
repair_hours
times 20 plus a fraction of the price of the vehicle. We
have assumed the hourly work rate of labor to be 20 units. The anomalies are introduced at every
10th
instance in
repair_cost
and lie between 3 and 6 standard deviations from the mean. Every
20th
row in
repair_hours
attribute is an anomaly and lies between 3 and 4 standard deviations from the
mean. This ratio can be modified to suit the problem statement. If the goal is anomaly detection in an
imbalanced setting, the anomalies can be diluted to a ratio of 0.01. Table 3 lists the number of hours
and cost of repair for respective Issue and Issue_Id.
Missing
values in the table are replaced by anomalous values. Our idea was to introduce categorical
anomalies in the dataset. In the real world, most of the time due to the filling of incorrect information
in healthcare or insurance forms, it is difficult to find out the reason for an anomalous sample. We
believe these missing values being replaced by anomalous values will be useful for explainable
models to determine the cell values due to which the data point behaves as an anomaly. The missing
values in each attribute of the original data are replaced as follows.
Color: Gelb, Reg_year: 3010, Body_type: Wood, Door_num : 0, Engin_size: 999.0L, Gearbox :
Hybrid, Fuel_type : Hydrogen
There are other attributes breakdown_date,repair_date that are not used in our evaluation which
contain the date of breakdown and repair date of the vehicle. The anomalous points in these attributes
are the ones where there is a large difference between the two dates. If the number of hours required
to repair is 9, the vehicle should be returned in 2 days, considering an 8-hour work day.
We use three insurance claims datasets to evaluate unsupervised models for auditing data. The
characteristics of the VC dataset are modeled on the real-world problem of fraudulent claims. Our
dataset consists of the essential features that are detrimental to distinguishing normal and anomalous
samples. The number of anomalies in our dataset can be adjusted to the suitability of the training
environment. The cardinality of the categorical attributes in the dataset is 1171 since it is a large
dataset, and each categorical attribute contains more unique values than CI and VI datasets. We will
observe that the VC dataset suffers from the curse of dimensionality
3 Data splitting and Evaluation metrics
In this section, we briefly describe the training and evaluation strategy of our work. We select models
from different categories of unsupervised learning approaches and observe the performance of our
dataset. (Ruff et al., 2021), categorizes anomaly detection methods into three topics. First, Density
estimation and Probabilistic methods, which predict anomalies by estimating the distribution of
normal data, DAGMM (Zong et al., 2018), SOM-DAGMM (Chen et al., 2021a) belong to the energy-
based models under the density estimation category. Second, One Class Classification approaches
aim to learn a decision boundary on the normal data distribution. Reconstruction Models, that predict
anomalies based on high reconstruction error of test data. RSRAE (Lai et al., 2020) belongs to the
category of deep autoencoders. Another deep learning variant is Self Organizing maps, which are
trained by competitive learning instead of backpropagation learning. SOM has been recently involved
in the field of intrusion detection (Chen et al., 2021a;b) and fraud detection (Mongwe and Malan,
2020). Since SOM (Kohonen, 1990) is used in SOM-DAGMM (Chen et al., 2021a) to improve the
topological structure of the latent representation of the data; our aim is to investigate the results using
a simple SOM. In the original paper, DAGMM and SOM are trained on network intrusion dataset
(KDD) containing more numerical features the categorical features, RSRAE is the state-of-the-art
model image (Fashion MNIST) and document (20NewsGroups) datasets in the field of unsupervised
anomaly detection. LOE trains the model with contaminated data, which is suitable for our approach.
The code and dataset are available on Github1.
In the literature for unsupervised anomaly detection, different data splits are used to train and evaluate
models. We need to decide whether to use only normal data or data with anomalous samples for
training. In the latest work, (Qiu et al., 2022) provides the training framework where the model is
1https://github.com/ajaychawda58/UADAD
4
摘要:

UnsupervisedAnomalyDetectionforAuditingDataandImpactofCategoricalEncodingsAjayChawdaDepartmentofComputerScienceTUKaiserslauternFinancialMathematics,FraunhoferITWMa_chawda19@cs.uni-kl.deStefanieGrimmFinancialMathematicsFraunhoferITWMKaiserslauternDEstefanie.grimm@itwm.fraunhofer.deMariusKloftDepartme...

展开>> 收起<<
Unsupervised Anomaly Detection for Auditing Data and Impact of Categorical Encodings Ajay Chawda.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:1.65MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注