the minimization of a Loss Function, which is achieved by using the “Gradient Descent Optimization Algorithm”.
Since Logistic Regression first predicts the probability of belonging to the positive class, therefore it creates a
linear decision boundary (based on a threshold, set by the user), separating the two classes from one another. This
decision boundary can now be represented as a conditional probability [5]. Implementation of the Random Forest
algorithm involves the training stage construction of several decision trees [6], and predictions emanating from
these trees are averaged to arrive at a final prediction. Since the algorithm uses an average of results to make
the final prediction, the Random Forest algorithm is referred to as an ensemble technique. Decision Trees are
designed to optimally split the considered dataset into smaller and smaller subsets, in order to predict the value
being targeted [7, 8]. Some of the criterion used to calculate the purity or impurity of a node, include Entropy
and Gini Impurity. In summary, a decision tree splits the nodes on all the attributes present in the data, and then
chooses the split with the most Information Gain, with the Decision Tree model of Classification and Regression
Trees (CART), being used in this paper.
The firm based model uses the value of a firm to represent the event of default, with the default event being
represented by the boundary conditions of the process and the dynamics of the firm value [9]. In particular, we
refer to two well-established models. Firstly, we mention the Merton model based on the the seminal paper of
Black and Scholes [10], which is used to calculate the default probability of a reference entity. In the context,
the joint density function for the first hitting times is determined [11]. Secondly, we have the Black-Cox model,
which addresses some of the disadvantages of the Merton model [10]. In order to hedge against credit risk, the
usage of credit derivatives is a customary approach [12, 13, 14], with Credit Default Swaps (CDS) being the most
common choice of credit derivatives. This type of contracts entail the buyer of the CDS to transfer the credit risk
of a reference entity (“Loans” in this case constitute the reference entity) to the seller of the protection, until the
credit has been settled. In return, the protection buyer pays premiums (predetermined payments) to the protection
seller, which continues until the maturity of the CDS or a default, whichever is earlier [10]. The interested reader
may refer to [10] for the formula of CDS spread per annum, to be used later in this paper. Another widely used
credit derivative, albeit more sophisticated than the CDS are the Collateralized Debt Obligation (CDO), which is
a structured product, based on tranches [14]. CDOs can further classified into cash, synthetic and hybrid. The
interested reader may refer to [14] for a detailed presentation on pricing of synthetic CDOs.
2 METHODOLOGY
The goal of this exercise is the approximation of a classification model, on the data considered (from the
peer-to-peer lending company Lending Club), in order to predict as to whether an applicant (whose details are
contained in the considered database) is likely to default on the loan or not. Accordingly, to this end, it is necessary
to identify and understand the essential variables, and take into account the summary statistics, in conjunction
with data visualization. The dataset used for the estimation of PD, EAD, LGD and EL was obtained from Kaggle
[15]. The data contains details of all the applicants who had applied for a loan at Lending Club. There were
separate files obtained for the accepted and the rejected loans. The file “accepted loans” was only used, since the
observations were made on the applicants who ultimately paid the loan, and those who defaulted on the loan. The
data involved the details of the applicants for the loan, such as FICO score, loan amount, interest rate, purpose of
loan etc. Machine Learning (ML) algorithms were applied on this data to predict the PD after data exploration,
data pre-processing and feature cleaning.
The feature selection was performed on the considered data, bearing in mind the goals of predictive modelling.
2