collection process and can alter the outcome with the changes in variables, whereas Observational
data, is where the researcher is an observer of the data collection process. The latter type of Data
is usually a byproduct of business activities and represents the facts that have happened in the
past (Cunningham, 2020). Auditing data commonly belong to the category of observational data.
Therefore, the prior knowledge of fraudulent insurance claims would exist in the case of experimental
data and help in labeling the dataset. In our work, we evaluate unsupervised methods on both
experimental and observational data. The datasets used in our work have labels available, which
makes our task of evaluating the models easier. The observational data for insurance claims are
small and medium size datasets that are downloaded from (Kaggle, 2018; Oracle, 2017). The
experimental dataset is a large dataset generated from metadata of the DVM car dataset (Huang et al.,
2021). Although the availability of labels helps with the analysis of our models, the goal is to reduce
dependence on them for training purposes in future scenarios because of the economic expense
incurred while labeling the data. The purpose of using different datasets with various sizes and
features provides us with a platform to analyze the behavior of trained models on the amount of data.
The arrhythmia and KDD datasets are used as benchmarks in unsupervised learning methods (Zong
et al., 2018),(Qiu et al., 2022) for anomaly detection and have fewer categorical attributes than those
found in auditing datasets. As observed in Table 1, auditing datasets have more categorical features,
which are important to understand the behavior of data points. Another important piece of information
in the above table is the anomaly ratio. We observe that it ranges from 0.001 to 0.26, i.e., the number
of positive class samples (or anomalies) in the complete dataset. It is an individual choice to consider
either positive or negative class as anomalous data, and in this paper, we consider positive labels as
anomalies.
2.1 Vehicle Claims
There are multiple tabular datasets available for anomaly detection. The available benchmark
datasets for anomaly detection are Credit Card fraud (MLG-ULB, 2008), KDD, Arrhythmia, and
Thyroid (Blake et al., 1998) are not suitable to conclude the performance of models on auditing data.
Therefore, a new synthetic dataset is created with the domain knowledge about insurance claims of
vehicles for the auditing scenario. (Huang et al., 2021) released the DVM-Car dataset, which consists
of 1.4 million car images from various advertising platforms and six tables consisting of metadata
about the cars in the dataset. Using a Basic table, mainly for indexing other tables, which includes
1,011 generic models from 101 automakers, a Sales table, sales of 773 car models from 2001 to 2020,
a Price table, basic price data of historical models, and a Trim table, includes 0.33 million trim-level
information such as selling price, fuel type and engine size, the new dataset is created with 268,255
rows. This dataset is referred as Vehicle Claims (VC) dataset.
A fraudulent sample will have an insurance amount claimed that is higher than the average claim
amount for the respective case. For example, a tyre is replaced by the customer due to a puncture, but
the claim amount reflects the cost of replacing two tyres. This is a fraudulent claim and an anomaly.
Our idea is to model this information in the dataset with the following steps. Firstly, categorical
features issue,issue
_
id are added. issue
_
id is a subcategory of the issue column. Secondly, the
repair
_
complexity column is added based on the maker of the vehicle. The common brands like
Volkswagen have complexity one whereas Ferrari has four. Thirdly,repair
_
hours and repair
_
cost
are calculated based on the issue,issue
_
id and repair
_
complexity. Every tenth row in repair
_
cost
and 20
th
row in repair
_
hours is an anomaly. Lastly, Labels are added for the purpose of verification.
There are 56,749 anomalous points in the VC dataset. The importance of having anomalous values in
categorical features like door
_
num,Engin
_
size, etc. can be argued as, for insurance claims, numerical
features such as Price are more important, but the categorical attributes help to observe the amount
of bias added to the model by low importance features, and also the categorical anomalies will be
helpful for explainable anomaly detection models.
Issues
are randomly assigned from a list of existing known issues about the vehicles.
Issue_Id
is a
subcategory of certain issues. Warning Light has 8 sub-categories because a warning light can mean
anything ranging from low fuel or engine oil to a damaged braking system. Similarly, Engine Issue,
Transmission Issue, and Electrical Issue have 4,3, and 5 sub-categories. The list of issues contains
the values as follows.
The
repair_complexity
column contains the complexity of repairing a vehicle depending upon
the availability of the workshop and the price of the vehicle. Common automotive makers like
3