If you didn’t catch part 1 you can find it below:Suicide in the 21st Century (Part 1)Suicide is not contagious, we need to talk about it.
towardsdatascience.
comAs mentioned, part 2 will incorporate machine learning, or, more specifically, machine learning with K-Means in python.
Before we get into that, here’s a quick recap of part 1 if you missed it.
RecapIn part 1, we mainly did data preprocessing and some EDA (Exploratory Data Analysis).
The data used was the worldwide suicide rates dataset from Kaggle.
We cleaned the data then also added an additional column, HappinessScore, taken from the World Happiness Report, published every year by the UN.
We then did some basic visual data analysis in the form of plots, some of which can be seen below:Now we’ve recapped, we can move on to some Machine Learning!K-MeansK-Means is a widely used and relatively simple unsupervised clustering machine learning technique.
The algorithm works iteratively and attempts to partition the dataset into K pre-defined subgroups (or clusters) whereby each data point belongs to a single group.
It aims to make the inter-cluster data points as similar as possible, whilst also trying to keep the clusters as far apart as possible.
The algorithm works as follows:Initialization — Firstly, K-Means randomly chooses K data points from the data set as initial centroids (with centroids being the centre of a cluster).
Note that the algorithm does not yet know the correct positions for the clusters.
Cluster Assignment — Calculates the Euclidian distance between each data point and cluster centres, then assigns the data point to the cluster whose distance from the middle of the cluster is the minimum of all the cluster centres.
Moving Centroids — Recalculates the new cluster centre using the following formula where Ci represents the number of data points in ith cluster.
Recalculating Cluster CentresPreparationBefore implementing Machine Learning of any sort, it is essential to make sure the data is prepared correctly.
If not, the algorithms are extremely likely to spit out errors.
For the first step of preparation, an extra column was added to the previously used data frame: GiniIndex.
The Gini Index is a measure of income distribution, developed by Corrado Gini in 1912; the index (or coefficient) is used as a gauge of economic inequality.
The coefficient is measured with a value between 0 and 1, where 0 represents perfect equality, and 1 represents perfect inequality.
Data for the Gini Index was pulled from the CIA public library and added in the same way as previously used for HappinessScore; creating a list of all Gini Index values and then read into the new column, again converting to float values.
As previously mentioned, K-Means uses Euclidean distance, which can become problematic when faced with features that are very different in scale/units.
For example, GDP is often in the tens of thousands, and happiness score generally being a float less than 10, meaning that the K Means will be extremely sub optimal.
To avoid this obstacle, a Min-Max scaler is applied to the data frame.
In this scaling, the minimum is subtracted from all values.
These values are then divided by the difference between Min and Max, resulting in the full dataset having values between 0 and 1.
The two string columns — Country and Continent — are dropped, as MinMaxScaling cannot be applied to strings.
The scaler is then applied, resulting in a new data frame — dfscaled — with features Suicides/100kPop, GdpPerCapita($), HappinessScore, and GiniIndex.
from sklearn import preprocessingdfcontpre = dfcont.
drop('Country', axis=1)dfcontpre = dfcontpre.
drop('Continent', axis=1)minmax_processed = preprocessing.
MinMaxScaler().
fit_transform(dfcontpre)dfscaled = pd.
DataFrame(minmax_processed, index=dfcontpre.
index, columns=dfcontpre.
columns)Before implementing the algorithm, it is important to choose an optimal number of K clusters.
This could theoretically be achieved through trial and error; however it is more efficient to plot an elbow curve.
The idea of this method is to run K-Means on the dataset for a range of values k (in this case, 1–20) and for each k iteration calculate and plot the sum of squared error (SSE).
Ideally this line graph will look like an arm, with the ‘elbow’ showing the optimal number k for clustering in the dataset.
Firstly, let’s define a nice colour palette to useflatui = ["#6cdae7", "#fd3a4a", "#ffaa1d", "#ff23e5", "#34495e", "#2ecc71"]sns.
set_palette(flatui)sns.
palplot(sns.
color_palette()Our Palette for future plotsNow we can plot the elbow curve:from sklearn.
cluster import KMeansfrom sklearn.
decomposition import PCA#plotting elbow curve for ideal number of clustersNc = range(1, 20)kmeans = [KMeans(n_clusters=i) for i in Nc]score = [kmeans[i].
fit(Y).
score(Y) for i in range(len(kmeans))]pl.
plot(Nc,score)plt.
xlabel('Number of Clusters')plt.
ylabel('Score')plt.
title('Elbow Curve')plt.
show()Plotted elbow curveIt can be seen from the elbow plot that 3 will be the optimal number of clusters, and although the elbow is not overly distinctive, it is enough to decide the optimal k clusters.
Now we have decided the optimal number of clusters to use, we can start implementing K-Means.
ImplementationSuicides/100kPop vs GdpPerCapita($)Firstly, the data of the two columns are placed into 1D NumPy arrays.
These arrays are then zipped together to form a 2D NumPy array with a sample as shown below:print X[0:5]A sample of the scaled 2D NumPy array (Note that all values are between 0 and 1)K-Means is then run on the 2D array X, using three clusters as previously mentioned.
A scatterplot is then created with X and Y axis being Suicides/100kPop and GdpPerCapita($), respectively.
This scattered data is then separated by colour, using kmeans.
labels_, meaning that they are separating according to their cluster assigned by k means.
The colours are then mapped using cmap choosing the ‘viridus’ colour scheme (The colours contrast better than our customly define palette).
Finally, the cluster centres are plotted using kmeans.
cluster_centres_, producing the following plot:#k-means plot suicide rate vs gdp, 3 clusterskmeans = KMeans(n_clusters = 3, random_state = 0) kmeans.
fit(X) plt.
figure(figsize=(20,16))plt.
scatter(X[:, 0], X[:, 1],c = kmeans.
labels_,cmap='viridis', s = 300)plt.
scatter(kmeans.
cluster_centers_[:, 0], kmeans.
cluster_centers_[:, 1], s = 300, c = 'red')GDP vs Suicide Rates using K-MeansThis plot shows an interesting result.
It can be seen that there is no data where x and y are both high, which, when referring back to Figure 14, makes sense.
We can therefore categorise these clusters as follows:Deep purple: High GDP low suicide risk countriesYellow: Low GDP low suicide risk countriesTeal: Low GDP high suicide risk countriesIt is fairly surprising that K-Means managed to cluster these groups efficiently, mainly because when looking there is a fairly low correlation between GDP Per Capita and Suicides/100k Population.
We can also run the following code, if we wish to see which countries were assigned to which cluster:cluster_map = pd.
DataFrame()cluster_map['data_index'] = dfscaled.
index.
valuescluster_map['cluster'] = kmeans.
labels_cluster_map.
head(50)Suicides/100kPop vs.
HappinessScoreApplying K-Means to Suicides/100kPop and HappinessScore is managed in much the same way; two one-dimensional NumPy arrays are zipped into a single two- dimensional array, which is then worked on by K-Means to create the necessary clusters.
The colours are then mapped and cluster centroids added.
#1d numpy arrays zipped to 2df1 = dfscaled['Suicides/100kPop'].
valuesf2 = dfscaled['HappinessScore'].
valuesX = np.
array(list(zip(f1, f2)))#k-means suicide rate vs happiness scorekmeans = KMeans(n_clusters = 3, random_state = 0) kmeans.
fit(X) plt.
figure(figsize=(20,16))plt.
scatter(X[:, 0], X[:, 1],c = kmeans.
labels_,cmap='viridis', s = 300)plt.
scatter(kmeans.
cluster_centers_[:, 0], kmeans.
cluster_centers_[:, 1], s = 300, c = 'red')HappinessScore vs Suicide Rates using K-MeansAgain, three fairly distinct clusters can be seen from the algorithm, which can be categorized as follows:Yellow: Low-risk ‘Happy’ countriesDeep purple: Low-risk ‘Happy’ countriesTeal: High-risk ‘Unhappy’ countriesAgain, the data being efficiently clustered is surprising when looking at the low correlation (-0.
24)(from part 1)Suicides/100kPop vs.
GiniIndexApplying K-Means to Suicides/100kPop and GiniIndex is undertaken using the same approach as previously.
#1d numpy arrays zipped to 2df1 = dfscaled['Suicides/100kPop'].
valuesf2 = dfscaled['GiniIndex'].
valuesX = np.
array(list(zip(f1, f2)))#plot k-means suicide rate vs gini indexkmeans = KMeans(n_clusters = 3, random_state = 0) kmeans.
fit(X) plt.
figure(figsize=(20,16))plt.
scatter(X[:, 0], X[:, 1],c = kmeans.
labels_,cmap='viridis', s = 300)plt.
scatter(kmeans.
cluster_centers_[:, 0], kmeans.
cluster_centers_[:, 1], s = 300, c = 'red')GiniIndex vs Suicide Rates using K-MeansAgain, the algorithm presents three distinct clusters that are very similar to previous results, with no data points in the top-right (both high X, and high Y values).
This data can be categorised as follows:Deep Purple: Low-risk ‘Unequal’ countriesTeal: Low-risk ‘Equal’ countriesYellow: High-risk ‘Equal’ countriesThis is a very surprising result as a high Gini Index score means higher wealth inequality, and there are no countries with both a very high Gini Index score and a very high suicide rate.
This could be explained by a variety of factors such as countries with an inequality of wealth possibly having a suicide rate much higher than the reported figure.
Although the representation of the meaning of this data is surprising in itself, it can be explained through the data correlation.
Correlating the data again whilst including the new column shows a correlation of -0.
17 between Suicides/100kPop and GiniIndex, meaning that there is a slight inversely proportional relationship, i.
e.
as GiniIndex increases, Suicides/100kPop decreases.
To show the importance that these features have on the outcome of Suicides/100kPop, permutation importance through random forest can be applied.
Permutation ImportanceFeature importance is a method that calculates which features have the biggest impact on predictions.
There are many ways to implement this; however, permutation importance is one of the fastest and widely used.
Permutation importance works on the idea that feature importance can be measured by looking at how much the score or accuracy decreases when a feature becomes unavailable.
Theoretically, this could be tested by removing a feature, re-training the estimator, then checking the resulting score, and repeating with each feature.
This would take a lot of time as well as being rather computationally intensive.
Instead, a feature is ‘removed’ from the test part of the dataset.
Estimators would expect the feature to be present so fully removing the feature would result in errors; therefore the value of a feature is replaced by noise resulting from shuffling all values of the feature.
It should also be noted that permutation importance struggles with a large number of columns as it can become very resource-intensive.
This is not an issue, however, with the dataset used in this post, as there are very few columns.
Therefore, the necessary modules were imported into the project — permutation importance, and random forest regressor.
A new data frame was created, dfrfr, from dfscaled whilst removing the index column.
The target variable (Suicides/100kPop) is stored in y, whilst the rest of the columns are stored in X.
Permutation importance using random forest was then implemented, with the resulting weights and features displayed.
#import modules for permutation importance, and show outputimport eli5from eli5.
sklearn import PermutationImportancefrom sklearn.
ensemble import RandomForestRegressordfrfr = dfscaled.
drop("index", axis = 1)rfr = RandomForestRegressor(random_state=42)y = dfrfr[['Suicides/100kPop']].
valuesX = dfrfr.
drop('Suicides/100kPop',axis=1).
valuesperm = PermutationImportance(rfr.
fit(X,y), random_state=42).
fit(X, y)eli5.
show_weights(perm, feature_names = dfrfr.
drop('Suicides/100kPop',axis=1).
columns.
tolist())Permutation Importance using Random ForestThis result makes sense when referring back to feature correlation and seeing that HappinessScore had the highest correlation to Suicides/100kPop, followed by GdpPerCapita($), and then finally GiniIndex.
ConclusionsOverall, although the data did not show the impacts of certain features that was expected, it still shows a very interesting conclusion of these impacts, such as the seeming irrelevance of income inequality on the rate of suicide across countries — something which was previously assumed to have a significant effect.
The data is limited by a lack of correlation in features that were expected to be heavily linked, i.
e.
HappinessScore and Suicides/100k, as general unhappiness was assumed to have an increase on suicide rates.
It is also limited by the lack of data available from certain countries and certain years, meaning that main analysis had to be undertaken on data from 2015 rather than 2018/19.
In future analysis, it would be ideal to have a feature that is verified to have a large impact on suicide rates across countries.
This would allow for more accurate plots and also a more meaningful K-Means clustering.
Full code for this post can be found on my github below:HarryBitten/Suicide-Rates-AnalysisContribute to HarryBitten/Suicide-Rates-Analysis development by creating an account on GitHub.
github.
comThanks for reading!.Please feel free to clap the article if you enjoyed it!.I’m new to all this but I intend to keep posting lots of fun projects mainly in data science/machine learning.. More details