Preventing Discriminatory Outcomes in Credit Models

Preventing Discriminatory Outcomes in Credit ModelsValeria CortezBlockedUnblockFollowFollowingJun 5Machine learning is being deployed to do large-scale decision making, which can strongly impact the life of individuals.

By not considering and analysing such scenarios, we may end up building models that fail to treat societies equally and even infringe anti-discrimination laws.

There are several algorithmic interventions to identify unfair treatment based on what is considered to be fair.

In this article, we will visit these and explain their benefits and limitations with a case study.

????????‍???? You can find the complete python project in this Github repository.

❗️Warning: this article contains a lot info on confusion matrix and related metrics.

1) Sources of discriminationMeasurementWithin machine learning, our world is measured and stored in a dataset.

The greatest challenge about this practise is that it is subjective by nature.

Humans need to come up with optimal labels to categorise our world, which can lead to selection bias.

Not only the target variable may not be correct for all groups, but also the information collected to describe it.

On the technical side, datasets can also have incorrect, incomplete and/or outdated information (read Police across the US are training crime-predicting AIs on falsified data).

It is also important to remember that as the world changes, so do the labels that describe it.

This can lead to outdated representation of our surrounding.

Finally, we also have promotion of historical biases, whereby stereotypes and stigmas are preserved as the norm.

LearningA model learns from the data to detect patterns and make generalisations of it, which may include disparities, distortion and bias.

Even if we do not explicitly provide sensitive information to avoid bias, there are several proxies that can approximate it (read Personality Tests Are Failing American Worker).

At the same time, models work better as more relevant information is supplied.

Minority groups are by default in disadvantage, given that they will usually have less data for the model to learn from (read Amazon scraps secret AI recruiting tool that showed bias against women).

ActionAn action is a decision making coming from a model, such as granting a loan or displaying an ad.

When a model is calibrated, it is most likely that it will produce different error rates for different groups.

The lack of analysis in this group can lead to unfair treatment (We’ll revise this later during the case study — also, read Machine Bias by ProPublica).

FeedbackSome models get ‘feedback’ from the user interaction to refine the model.

Here, we can encounter problems with discrimination in form of self-fulfilling predictions.

When data is biased, unfair predictions will often end up being validated.

Humans in this case will be influenced so that their reactions validate the unfair predictions (read Police across the US are training crime-predicting AIs on falsified data).

2) Algorithmic interventionsIn this section, we will review five different practises that have been considered to measure unfairness in supervised machine learning.

Fairness through unawarenessThis is the case when we run a model by simply excluding protected classes.

This concept is very inefficient in ensuring fairness as there are plenty of proxies that can predict such attributes.

Demographic parityDemographic parity requires that each group qualifies for loans at the same frequency.

This also means that the decision should be independent from the protected attribute A.

This can be described as in the equation below.

For credit models, this can be interpreted as the fraction of qualified loans being the same across all classes in the group (equal positive rate).

Equal opportunityEqual opportunity suggests that the fraction of correctly classified members in the “advantaged” outcome, namely Y=1, should be the same across all groups.

For instance, we could consider the “advantaged” outcome when an entry is considered as “not defaulting on a loan”.

For all protected classes within A, the following must hold true.

This can be formulated as in the equation below:This concept usually results in better utility, but as with demographic parity, it can punish a certain class with less accurate false positives.

In practise, this would require that the true positive rate across the two groups A=1 and A=0 to be the same.

Equalised oddsEqualised odds, as with equal opportunity, also requires that the fraction of correctly classified members in the “advantaged” outcome to be the same across all classes in the group.

However, it also requires that the fraction of misclassified members in the “advantaged” outcome to be the same across all classes in the corresponding groups.

This can be formulated as in the equation below:This would then mean that not only the true positive rate should be the same for each group, but also the false positive rate.

3) Case studyFor this case study, I made use of a public loan book from Bondora, a P2P lending platform based in Estonia.

I looked into two different protected groups: gender and age.

Bondora provides lending to less credit-worthy customers, with the presence of much higher default rates than seen in traditional banks.

This means that the interests collected are significantly higher.

On average, the loan amount for this dataset was around €2,100 with a payment duration of 38 months and interest rate of 26.

30%.

For traditional banks, the cost of a false positive (misclassifying a defaulting loan) is many times greater than reward of a true positive (correctly classifying a non-defaulting loan).

Given the higher interest rates collected by Bondora compared to banks, I will assume for illustration purposes that the reward to cost ratio is much smaller at 1 to 2.

This will be used to find the best thresholds to maximise profits while meeting all requirements for each algorithmic intervention.

I then developed a classification model that predicts whether a loan is likely to be paid back or not using the technique Gradient Boosted Decision Trees.

With the results of the model predictions, I then analysed the following scenarios:Maximise profit uses different classification thresholds for each group and only aims at maximising profit.

Fairness through unawareness uses the same classification threshold for all groups while maximising profit.

Demographic parity applies different classification thresholds for each group, while keeping the same fraction of positives in each group.

Equal opportunity uses different classification thresholds for each group, while keeping the same true positive rate in each group.

Equalised odds applies different classification thresholds for each group, while keeping the same true positive rate and false positive rate in each group.

Brief data explorationI selected all loans from Bondora’s platform that were granted to residents in Estonia.

This totals 21,450 entries from date accessed.

Loans granted to female represented 39% of all entries while the percentage of loans granted to males was 61%.

The fraction of loans received by people under 40 was higher at 63% compared to 37% of loans received by people over 40.

The overall default rate was around 42%.

If we look into the default rate by gender, we can notice that we had nearly the same default rate for both groups at around 44%.

If we look into age, the default rate is slightly higher for people under 40 at around 45% compared to 42% for people over 40.

ROC and AUCWe will now look at the ROC curves for each group’s classes.

The differences that we will notice among the curves (e.

g.

male vs female) are due to differences in True Positive Rates and False Positive Rates.

Lower curves, consequently with lower AUC values, suggests that the predictions are less accurate for those classes.

If we look at ROC curve for gender in the graph above, we can notice that the line for females is in most occasions below the curve of males.

We can therefore see a slightly lower AUC value of 0.

760 for females compared to 0.

767 for males.

If we look at the same graph, but this time for age group, we can notice that the line for people over 40 is usually above the line for people under 40.

It is important to note that in the Bondora’s loan book the percentage of people under 40 was much higher than the percentage of people above 40.

That would mean that alternative features to describe people under 40 may be necessary.

For the discrepancies identified in each group, we may start considering that the set of features used may be more representative for males and/or people over 40 years old.

The dataset may be biased towards men, usually older, as it is historically the data that has been more accessible to study and work with.

Algorithmic interventions (Gender)When looking at the breakdown of the true positive rate and false positive rate of each gender, we can notice that the profit level varies very little with exception to Maximise Profits intervention.

In that case, we can see that the model better classifies positives for males (73%) than females (70%).

However, it also grants a slightly higher proportion of loans to people who will default to males (33.

6%) than females (32%).

When looking at profits, this clearly drops under the Equalised Odds algorithmic intervention.

This occurs due to more restrictive thresholds used.

The thresholds for the female and male group were 0.

56 and 0.

55 respectively while the average threshold used across all interventions was 0.

49 for females and 0.

48 for males.

Algorithmic interventions (Age Group)If we now look into age group, we can see that when we pick a common threshold for all groups in group unawareness, we tend to better classify loans to people who can pay back those over 40 than those under 40.

The false positive rate remains about the same.

Under demographic parity, people under 40 (13%) are at a slightly higher risk of receiving a loan they cannot pay back compared to people over 40 (11%).

The opposite occurs for maximise profit.

On the other hand, equal opportunity as well as equalised odds provide a better balance with regards to the trade-off between the true positive rate and false negative rate of each category in the group.

Nevertheless, profits decrease significantly in equalised odds due to very restrictive thresholds leading to small false positive rates.

The thresholds used for females was 0.

83 and for males 0.

81 while the average threshold used across all interventions was 0.

69 for females and 0.

68 for males.

4) ConclusionIn this article, we revised how discrimination can occur within supervised machine learning.

We then looked into different algorithmic interventions to identify whether certain classes within protected groups were treated unfairly.

We revised the pitfalls of each intervention and evaluated them with a concrete example using a public loan book from Bondora, a P2P lending platform.

When analysing the results for gender and age based model using Gradient Boosted Decision Trees, we could immediately notice that profits were maximised when different thresholds without restrictions were used for each class (e.

g.

female/male).

Nevertheless, there were different proportion of true positive rates across groups.

Equal opportunity as well as equalised odds provided a better balance with regards to the trade-off between the true positive rate and false negative rate of each class in the protected group.

However, in Equalised odds we could see the profit be compromised significantly.

This simple post-processing analysis has the benefit that it does not involve modifying complex models.

It can be used as a measure of unfairness that help stakeholders act when a specific class is treated unfairly.

It also incentives to ensure that information is collected in a fair manner and that the features used to describe the target variable are representative for each class of the group.

Finally, this analysis does not challenge the causal reasoning of the model.

The criteria used is simply observational and depends only on the joint distribution of predictor, protected attribute, features, and outcome.

Can I see some code?I’ve made my code public on Github.

I’d love to hear your feedback and recommendations to improve it ????????.

ReferencesDataBondora’s loan book.

Available at: https://www.

bondora.

com/en/public-reports [Accessed August 18, 2018]Main LiteratureBarocas, S.

, Hardt, M.

& Narayanan, A.

, 2018.

Fairness and machine learning.

Available at: http://fairmlbook.

org/ [Accessed August 29, 2018].

Dwork, C.

et al.

, 2012.

Fairness Through Awareness.

In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference.

ITCS ’12.

New York, NY, USA: ACM, pp.

214–226.

Hardt, M.

et al.

, 2016.

Equality of opportunity in supervised learning.

In Advances in neural information processing systems.

pp.

3315–3323.

Pedreshi, D.

, Ruggieri, S.

& Turini, F.

, 2008.

Discrimination-aware Data Mining.

In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

KDD ’08.

New York, NY, USA: ACM, pp.

560–568.

.

. More details

Leave a Reply