Bank Loan Default Prediction |

Bank Loan Default PredictionTianhao WuBlockedUnblockFollowFollowingMar 14IntroductionLoans are usually one of the most important products of a bank.

Acting as a provider of loans is one of the main activities of financial institutions such as banks.

Practically, the funds of a bank are mainly used for lending activities.

On the other hand, a bank will face a huge loss when a loan turns default.

Therefore, banks always pay much attention to detect and predict the default behaviors of their customers.

(1).

Terminology explanation(a).

Bank loanA bank loan is the lending of money by a bank to other individuals, organizations, corporations, etc.

The recipient (i.

, the borrower) incurs a debt, and is usually liable to pay interest on that debt until it is repaid, and also to repay the principal amount borrowed.

Bank loans are good for financing investment in fixed assets (such as plant & machinery, land, and buildings).

The interest rate can be either fixed or variable.

(b).

DefaultThe term default means fail to meet the legal obligations (or conditions) of a loan, for example when a home buyer fails to make a mortgage payment, or when a corporation or government fails to pay a bond which has reached maturity.

(2).

Project objectiveIn order to prevent a loan from turning default, banks need to figure out how to make predictions based on customers’ behaviors.

Machine learning models appear to be one of the most effective solutions for predicting loans default.

Therefore, the objective of this project is to build supervised models for loans default predictions and to explore the impact of customer behavioral factors on making predictions further.

(3).

Project WorkflowData Preprocessing1.

Understanding the datasetThe data set consists of eight tables related to the clients and their accounts:The columns “status” in table “loan” is the target variable which shows the current status of a loan.

It has been classified into four categories:A.

Contract completed — Loan paid and closedB.

Contract completed — Loan not paidC.

Running contract — Customer making regular paymentsD.

Running contract — Customer in debtEach account has both static characteristics (e.

, the date of creation, the address of the branches) given in “account” and dynamic characteristics (e.

, payments debited or credited, balances) given in “permanent order” and “transaction”.

One client can have more accounts; more clients can manipulate with a single account; clients and accounts are related together in relation “disposition”.

“loan” and “credit card” describe some services which the bank offers to its clients; more credit cards can be issued to an account, at most one loan can be granted for an account.

“demographic data” gives some publicly available information about the districts (e.

, the unemployment rate); additional information about the clients can be deduced from this.

Loading data into MySQL databaseThese eight tables are loaded in MySQL database separately.

The data in these tables are not clean enough for modeling.

Almost all the date columns are not in the correct format.

Some columns contain unnecessary punctuations.

For further exploration, we need to cleanse the data.

Data ExplorationAfter cleansing the data in MySQL Workbench, we use python to connect to MySQL server and transform the data to Pandas DataFrame for exploration and visualization.

Because the objective is to make predictions on default, the loan table which has loan status should be the main table.

Therefore, we need to join all the other tables to the loan table based on the common account IDs.

Then, explore the whole data set to compare the relevance between loan status and other data.

Labeling the datasetBefore making the comparison, we need to verify what are the classes in loan status:The distribution of loan amount of each status class is:Loan Amount Distribution — Statue LevelWhere:1.

“A” stands for finished contracts, no problems.

“B” stands for finished contracts, loan not paid.

“C” stands for running contracts, OK so far.

“D” stands for running contracts, clients in debt.

Instead of building a multiclassification supervised model, a binary labeled model is more suitable for predicting a loan is turning default or not.

As a result, we label the two classes “A” and “C” as “0” which means the loans do not default; and label the other two classes “B” and “D” as “1” which represents the defaulted loans.

The default rate of the dataset is around 11.

14%:Loan Status ProportionAs the labels are converted, the next step is to compare the relevance between variables.

Variables exploration(1).

Loan(a).

Loan Monthly Payments vs.

Loan AmountWe plot the distribution comparing monthly loan payments and loan amount for each status.

We can see a huge difference between good and default loans so these two columns could be strong predictors for machine learning model.

Monthly Loan Payment vs.

Loan Amount(b).

Approved Year vs.

Loan AmountWe compare the approved year of loans and loan amount and split each year into two sections representing each status.

The number of good and defaults loans are quite different in each year except 1996.

The default rate shows a downtrend from 1993 to 1998 while there is a flat trend shown between 1994 and 1997.

Approved Year vs.

Loan Amount — Status Level(c).

Loan Duration vs.

Loan AmountSimilar to the previous plot, we compare the loan duration and loan amount.

We can see that loans with 12 months duration have the lowest default rate (around 8.

40%) whereas loans with 24 months duration have the highest default rate (around 12.

32%).

Loan Duration vs.

Loan Amount — Status Level(2).

GenderSometimes gender can be useful when making predictions, so we plot the gender distribution of good loans and defaulted loans as well as the default proportion of males and females.

The proportion of loans held by females are a bit more than those held by males, but the proportion of defaulted loans of females is significantly lower than that of males.

However, the similarity of default rate under each gender (about 13.

37% from male and about 11.

73% from female).

Unfortunately, it seems gender might not be very helpful for default predictions.

(a).

Loan Amount vs.

Default AmountLoan Amount vs.

Default Amount — Gender(b).

Default ProportionDefault Proportion — Gender(3).

Order AmountOrder amount is the permanent orders made by each debit account.

It reflects how active is an account.

There is a big difference between good and defaulted loans regarding order amount so it might be a good predictor for making predictions.

Order Amount(3).

Transaction Amount vs.

Transaction BalanceTransaction table records the majority of activities an account has been made.

It, therefore, may have the most important information for default predictions.

We take two columns out from the transaction table and plot the distribution for defaulted loans and not defaulted loans.

The transaction amount column records the amount of each transaction while the transaction balance represents the account balance after a transaction.

There is a certain area that does not contain any instance in the default heat map, which means the loan might not turn default with the transaction amount and balance within this range.

Consequently, these two variables could be very useful.

Transaction Amount vs.

Transaction Balance(4).

GeographyStatistically, geographical data always tell stories.

Therefore, it is necessary to explore the geographical data.

As illustrations, we take regions columns and districts columns to plot the default rate.

(a).

RegionThere are three major default rate range can be found at the region level.

“north Bohemia” has the lowest rate (1.

64% rate approximately) with a huge gap from the second lowest.

Default Rate — Region(b).

DistrictThe default rate shows an even wider difference between each district than between each region, where 0 default rate can be found in some districts.

Default Rate — District(6).

DemographicDemographic data also matters in analysis and making predictions.

Some of them can be strong predictors when making predictions on default.

For illustrations, we take a few logically related columns from the demographic table and plot the distribution for both good loans and defaulted loans regarding these variables.

(a).