P2P Lending Platform Data Analysis: Exploratory Data Analysis in R — Part 2Finding how P2P data Features are Linked to Loan QualityLorna YenBlockedUnblockFollowFollowingJan 4In the previous post, we found Prosper adopted a better credit risk metric — Prosper Rating, for Prosepr lending platform.
Knowing that Prosper Rating is composed of Prosper Score and Credit Bureau Score, we also found the Prosper Score plays a key role which makes the Prosper Rating have more discrimination than Credit Bureau Score itself.
In this post, I will investigate the relationship between Prosper Score and other features in this loan data to see how do these features link to Prosper Score, and how do these features can discriminate between completed and high risk loan in this P2P lending platform.
First Step: a Correlation MatrixIn this huge data set which contains 81 variables, I won’t start finding potential useful features one by one in order to avoid making this article too long.
Instead, I will start with a correlation matrix with ggcorrplot to get a quick overview of linear relationship between features before deeply exploration.
Through a correlation matrix we can focus on the variable of our interest — Prosper Score, and see if other variables correlate with it.
We can see Prosper Rating have strong negative relationship with Borrower Rate, it makes sense because Borrower Rate is mainly determined by Prosper Rating.
Besides, Prosper Rating has strong relationship with Prosper Score with correlation coefficient 0.
8 than Credit Score Average’s 0.
It shows again that maybe Prosper puts more weights on Prosper Score than Credit Score Average to model Prosper Rating from linear perspective.
Furthermore, we can see Prosper Score shows weak linear relationship with ScoreXChangeAtTimeOfListing, InquiriesLast6Months, BankcardUtilization, DebtToIncomeRatio and StatedMonthlyIncome with correlation coefficient range from +- 0.
2 to +-0.
Although these features present weak linear relationship with Prosper Score, I think there may still exist some information in these features.
By using the correlation matrix as a reference, I will pick these five features and then plot scatters to further investigate more trends between these features.
My investigation through plots includes two parts: one is to see how do these features be linked to Prosper Score; the other is to see if these features can discriminate between completed and high risk loan or not.
Score X Change At Time Of ListingThe ScoreXChangeAtTimeOfListing measures borrower’s credit score changes at the time the credit profile was pulled, which the change are relative to the borrower’s last Prosper loan.
The value of ScoreXChangeAtTimeOfListing could be positive or negative since it evaluates the value “change” of credit score.
Above graph shows that the linear trend between Prosper Score and ScorexChangeAtTimeOfListing is clear, with increasing trend as one point of Prosper Score increases.
I infer that Prosper Score see ScorexChangeAtTimeOfListing as a important signal and use it with a linear incremental to modal Prosper Score.
How does it linked to loan status?Group the loan status among Prosper Score and ScorexChangeAtTimeOfListing, we can see that the trend is clear.
High risk loans tend to have lower Prosper Score of 1 to 6 and lower ScorexChangeAtTimeOfListing range from -100 to 50.
Completed loans tend to have higher Prosper Score range from 6 to 10 and higher ScorexChangeAtTimeOfListing range from -50 to 100.
Seems like Prosper Score can discriminate completed and high risk loan in respect of different level in ScorexChangeAtTimeOfListing.
This exploration result is kind of interesting, since ScorexChangeAtTimeOfListing is considering the borrower’s credit score relative to the borrower’s last Prosper loan, which means the metric not only consider the credit score of borrower, but also consider the history of Prosper data.
Based on the concept, I was impressed that Prosper make the best of credit score and Prosper data, and it do present such a trend among completed and high risk loans.
Inquiries Last 6 MonthsThis variable evaluates number of inquiries in the past six months at the time the credit profile was pulled.
An inquiry means a request by an institution for credit information from a credit agency.
In general, high number of inquiries in a short period always occur when someone is frequently applies credit accounts, like credit card, which implies he or she has a highly funding demand.
In the scatter plot we can see the relationship is not clear, most of the loans located in the 1 to 2 times of InquiriesLast6Months across Prosper Score of 2 to 10.
But the trend still present with decreasing upper-bond of InquiriesLast6Months as Prosper Score increases.
Still, the trend of loans status among Prosper Score and InquiriesLast6Months are not clear.
We can see most of High risk loans are primarily located at lower level of Prosper Score, but they are located widely across each level of InquiriesLast6Months holding Prosper Score constant.
How about just investigating the difference of Loan Status in InquiriesLast6Months?From the graph above we can see there is no significant difference of InquiriesLast6Months between High Risk loans or Completed Loans.
The result is kind of reasonable since higher InquiriesLast6Months of a borrower does not directly imply he or she is with a bad credit.
It may have a potential risk, but not necessary, at least in this data set.
Bank card UtilizationBankcardUtilization measures borrower’s total amount of credit limit that’s being used at the time the credit profile was pulled.
The lower the ratio is, the more financial liquidity of one’s has.
From the box plot, the distribution of BankcardUtilization moves toward to down as Prosper Score increases, but the variance of BankcardUtilization is high across each Prosper Score, which makes the trend between Prosper Score and BankcardUtilization be not clear.
But a roughly inverse trend still exist, especially for the high and low end level of Prosper Score.
Maybe Prosper Score used some more complicated way in BankcardUtilization to their model.
The trend of loan status among Prosper Score and BankcardUtilization is not clear.
We can see most of High risk loans are primarily located at lower level of Prosper Score in 1 to 6, but they are located widely across each level of BankcardUtilization holding Prosper Score constant.
Still, the result does not amaze me a lot, since sometimes a high BankcardUtilization does not necessary for a bad credit individual.
He or she may just have a high debt credit at that time, but still will pay on time.
Above graph shows the distribution of BankcardUtilization in High Risk loan is slightly higher than Completed loan, but both of them are with high variance of BankcardUtilization, which makes both of them have no significant difference.
Debt To Income RatioIn general, higher DebtToIncomeRatio demonstrates that if an individual has too much debt for the amount of income he or she has.
Conversely, lower DebtToIncomeRatio means a good balance between debt and income.
Here we can see the graph above presents a clear linear inverse trend when DebtToIncomeRatio decreases as Prosper Score increases.
And above graph shows there is a clear trend of Loan Status and DebtToIncomeRatio across Prosper Score.
High risk loans tend to have lower Prosper Score in 1 to 6 and with 0.
5 to 1 of DebtToIncomeRatio.
Completed loans tend to have higher level of Prosper Score in 6 to 10 as well as DebtToIncomeRatio ranged 0.
1 to 0.
Seems like Prosper Score can discriminate completed and high risk loan in respect of different level in DebtToIncomeRatio as well.
This result looks reasonable since the DebtToIncomeRatio is a direct signal to measure the ability of repay of an individual.
Stated Monthly IncomeAs an Income data, the distribution of StatedMonthlyIncome presents a right-skewed shape from an histogram(not be shown here), so I drop top 1% of the StatedMonthlyIncome variable.
Above graph shows that the trend between Prosper Score and StatedMonthlyIncome presents a concave-up shape.
It’s reasonable since Income is also a direct metric to measure the ability of debt-paying, obviously the characteristic of Income data makes an exponential incremental level in the measurement of Prosper Score.
One more point of Prosper Score requires a more higher tier of Income level.
Above graph shows that there is a clear trend between Prosper Score and StatedMonthlyIncome and status of loan.
High risk loans tend to have lower level of Prosper Score in 1 to 6 and with lower level of StatedMonthlyIncome in $2,000 to $5,000.
Conversely, Completed loans tend to have higher level of Prosper Score in 7 to 10 and with higher level of StatedMonthlyIncome in $5,000 to $10,000.
Similar to DebtToIncomeRatio and ScoreXChangeAtTimeOfListing, the Prosper Score probably can discriminate completed or high risk loan in respect of different level of StatedMonthlyIncome as well.
Exploration SummaryWe have pick five features to see if they have trends with Prosper Score or not, and to see if they can discriminate between completed and high risk loan.
Exploration result shows:Both ScoreXChangeAtTimeOfListing and DebtToIncomeRatio present a linear trend with Prosper Score.
The former presents positive relationship; the latter presents negative relationship with Prosper Score.
Both ScoreXChangeAtTimeOfListing and DebtToIncomeRatio also have clear patterns in both Completed and High Risk across different level of Prosper Score.
High risk loan tend to have lower level of Prosper Score, and have lower ScoreXChangeAtTimeOfListing and higher DebtToIncomeRatio holding Prosper Score constant; Completed loan tend to have higher level of Prosper, and have higher ScoreXChangeAtTimeOfListing and lower DebtToIncomeRatio holding Prosper Score constant.
StatedMonthlyIncome also presents a trend with Prosper Score, but the trend is a increasing concave-up one, not linear.
It shows that one more point of Prosper Score requires more higher level of StatedMonthlyIncome.
StatedMonthlyIncome also have clear pattern in both completed and high risk across different level of Prosper Score.
High risk loan tend to have lower level of Prosper Score, and have lower StatedMonthlyIncome holding Prosper Score constant.
Both InquiriesLast6Months and BankcardUtilization do not present significant trend with Prosper Score and loan status.
So far, we have seen some interest patterns linked to Prosper Score and Loan Status, and we found ScoreXChangeAtTimeOfListing, DebtToIncomeRatio and StatedMonthlyIncome have clear trends with Prosper Score.
And considering the three features into different level of Prosper Score, we can see they also present clear patterns between Completed loan and High Risk loan holding Prosper Score constant.
Not surprised that the DebtToIncomeRatio and StatedMonthlyIncome present such trends, since the two are directly measurement of debt-paying ability.
But I was impressed that ScoreXChangeAtTimeOfListing presents a trend with Prosper Score and shows the ability of discrimination between completed and high risk loan as well.
Peer-to-peer lending is a novel concept, so is their credit risk measurement.
Throughout this project, we have been digging out some background knowledge about Prosper, using EDA basic tools to find out some features that are linked to the credit risk measurement of Prosper — Prosper Rating and Prosper score.
Unlike transitional measurement — credit bureau score, these features demonstrate not bad assessment ability between High Risk loans and Completed loans in the lending platform.
And these features also be modeled which take advantage of both credit bureau score and platform loan data.
I think such a interesting concept can be used for further Data Science application in Peer-to-peer lending platform, like predicting the default rate or bad loan.
Hope you enjoyed this concept-based exploratory data analysis!Note: For more detail exploration result, see my report on Rpubs and codes in GitHub!.