Predicting Income with Feature Limitation

Predicting Income with Feature LimitationJulia TaussigBlockedUnblockFollowFollowingApr 30Written in collaboration with Julia Taussig, Anna Haas, Alanna Besaw, Patrick Cavins, and William HolderFigure 1: Illustration of tradeoffs between product delivery, product quality, and cost.

The image is courtesy of Rene T.

Domingo, www.

rtdonline.

com/BMA/MM/qcd.

htm.

Problem StatementNow, we’ve all heard you can’t always get what you want, and as delightful as it is to rebel against something so painfully proven to be accurate, our semi-informed optimism demands we must try.

Today, we were given a task: to predict if a person’s income is > $50,000 (50k USD) given certain profile information, and more specifically to generate predicted probabilities of income being > 50k USD for each row in a test set.

We were given a constraint that we could only use a maximum of 20 features to create a model and to predict whether people made > 50k USD.

People on project/product teams often have to balance quality, cost, and time.

We were posed this data science problem with a limitation of feature number to simulate a ‘cheap’ model constraint, and we were given 7 hours to work together (utilizing various communication techniques since three team members were in Denver, and two team members were in Seattle) to optimize a model and to submit predictions.

The product brick above shows that if we have constrained time and input features, quality of the product can suffer.

We did our best to create a high-quality model despite these constraints, and this is a good practice given that typically there is a need for balancing quality, cost, and time in our industry.

Exploratory Data Analysis (EDA) and Feature Engineering and CleaningData that was used for this analysis was extracted by Barry Becker from the 1994 U.

S.

Census Bureau database found at http://www.

census.

gov/ftp/pub/DES/www/welcome.

html.

The data and data dictionary can be found at Kaggle’s “Adult Census Income:Predict whether income exceeds $50K/yr based on census data” competition page at https://www.

kaggle.

com/uciml/adult-census-income.

There was a train dataset and a test dataset.

The train dataset was explored.

There were 32561 rows (no null values, thankfully) and 14 columns.

Six columns had integer datatype data (age, fnlwgt, education-num, capital-gain, capital-loss, and hours-per-week), and the rest of the columns had object datatype data (workclass, education, marital-status, occupation, relationship, sex, native-country, and wage).

The Pandas get_dummies function was used to generate dummies for every categorical feature, and the result was a dataframe with over 100 columns.

The list below includes tables with information about each categorical variable.

The workclass column had 9 different values (the “?” value was not removed because it may have become more informative with more context during analysis).

Figure 2: The table on the left shows the most to least frequent workclass values, the table in the middle shows the workclass values with the highest to lowest proportion of wage > 50k USD, and the table to the right shows the workclass values with the highest to lowest correlation to wage > 50k USD.

The education column had 16 different values.

Figure 3: The table on the left shows the most to least frequent education values, the table in the middle shows the education values with the highest to lowest proportion of wage > 50k USD, and the table to the right shows the education values with the highest to lowest correlation to wage > 50k USD.

The marital-status column had 7 different values.

Figure 4: The table on the left shows the most to least frequent marital-status values, the table in the middle shows the marital-status values with the highest to lowest proportion of wage > 50k USD, and the table to the right shows the marital-status values with the highest to lowest correlation to wage > 50k USD.

The relationship column had 6 different values.

Figure 5: The table on the left shows the most to least frequent relationship values, the table in the middle shows the relationship values with the highest to lowest proportion of wage > 50k USD, and the table to the right shows the relationship values with the highest to lowest correlation to wage > 50k USD.

The occupation column had 15 different values (the “?” value was not removed because it may have become more informative with more context during analysis).

Figure 6: The table on the left shows the most to least frequent occupation values, the table in the middle shows the occupation values with the highest to lowest proportion of wage > 50k USD, and the table to the right shows the occupation values with the highest to lowest correlation to wage > 50k USD.

The sex column had two values.

Note that the dataset is very unbalanced toward men since men account for approximately ⅔ of the people represented in the dataset.

It’s sad to see the large decrease in correlation to wage > 50k USD from men to women, but hopefully the data looks very different now, over two decades after this data was collected.

Figure 7: The table on the left shows the most to least frequent sex values, the table in the middle shows the sex values with the highest to lowest proportion of wage > 50k USD, and the table to the right shows the sex values with the highest to lowest correlation to wage > 50k USD.

The native-country column had 42 values.

The 15 most frequently found values in the dataset are shown in the table below to the left.

The 15 native-countries with the highest proportion of wage >50k USD are shown in the table in the middle.

The 15 native-countries with the highest correlation to wage >50k USD are shown to the right.

The “?” value was not removed because it may have become more informative with more context during analysis.

Figure 8: The table on the left shows the top 15 most to least frequent native-country values, the table in the middle shows the top 15 native-country values with the highest proportions of wage > 50k USD, and the table to the right shows the top 15 native-country values with the highest to lowest correlation to wage > 50k USD.

The target column, the wage column had two potential values, wages of ≤50k USD and wages of >50k USD.

Note that the dataset is imbalanced toward wages of ≤ 50k USD since more than 75% of the people in the dataset have wages ≤ 50k USD.

The lists below display the frequency of each wage value in the dataset and the proportion of each wage value in the dataset.

Figure 9: The Table on the left shows the frequency of wage values ≤ 50k USD and wage values > 50k USD, and the table on the right shows the proportion of wage values ≤ 50k USD and wage values > 50k USD.

The workclass_self-emp-not-inc (self employed without income) had the highest correlation with wage >50k USD (correlation of approx.

0.

139) while workclass_private had the lowest correlation with wage >50k USD (correlation of approx.

-0.

0785).

These features had the strongest magnitude of correlation with wage >50k USD so they were used in further analysis and modeling.

The education feature had 16 possible values.

The education number feature that assigned numbers to education level had a range of values from 1 (preschool) to 16 (doctorate), and it did not account for similarities between some education levels and proportion of wages >50k USD.

For example, Assoc-acdm and Assoc-voc had similar proportions of wages >50k USD, so it made sense to group them together.

The education feature was changed to a numerical feature by grouping up education levels with similar proportions of wages >50k USD which led to replacing values of Preschool with 0, 1st-4th with 1, 5th-6th with 2, 7th-8th with 3, 9th, 10th, and 11th to 4, 12th with 5, HS-grad with 6, Some-college with 7, Assoc-acdm and Assoc-voc with 8, Bachelors with 9, Masters with 10, Prof-school with 11, and Doctorate with 12.

This new education feature was called edu_scale, and its correlation to wage >50k USD is 0.

342 which is greater than the original education number feature correlation to wage >50k USD which is approx.

0.

335.

Marital status and relationship values were inspected, and a new feature was created.

It was called is_married and had values of 1 if the marital status was Married-civ-spouse or Married-AF-spouse and values of 0 otherwise.

Occupation values were investigated.

The four occupations with the largest correlation to wage >50k USD were selected to be used in future modeling, and they are listed below.

occupation_ Exec-managerial (correlation to wage >50k USD: 0.

215)occupation_ Prof-specialty (correlation to wage >50k USD: 0.

186)occupation_ Protective-serv (correlation to wage >50k USD: 0.

0281)occupation_ Tech-support (correlation to wage >50k USD: 0.

0257)In retrospect, some occupations with the lowest correlation to wage >50k USD could have been used to improve the model.

The sex feature was made into a dummy variable.

Values of male were given a numerical value of 1 while values of female were given a numerical value of 0.

The native-country feature was made into a dummy variable called is_USA since being native to the USA had the highest correlation to wage >50k USD (correlation: 0.

0345).

In retrospect, it would have been interesting to include native-country values with the lowest correlation to wage >50k as well.

Feature combinations (two features multiplied with each other such as edu_scale and is_married) and their correlation with wage >50k USD were inspected.

The only feature combination that had a considerably high correlation with wage >50k USD was edu_scale * is_married (the correlation was approx.

0.

530).

This feature combination had a correlation to wage >50k USD that was greater than the correlation to wage >50k USD of either of the features used to build this feature combination (edu_scale and is_married).

Therefore, the feature combination edu_scale * is_married was included in further analysis and modeling.

Numerical features were also evaluated.

See the heatmap below (generated using Python, Pandas, and Seaborn).

The education_num feature was replaced with the edu_scale, as described earlier.

The fnlwgt feature had a low magnitude of correlation to wage >50k USD, so it was not used often in further EDA or modeling.

The age, hours-per-week, capital-gain, and capital-loss features had high correlations to wage >50k USD, so they were included in further EDA and modeling.

Figure 10: The heatmap above shows the correlation of the dataset’s numerical features to wage >50k USD.

The heatmap was generated using Python with Pandas and Seaborn functions.

The distributions for numerical features were analyzed using seaborn’s pairplot function.

In general, the distributions were not normally distributed.

The features of focus in further modeling and EDA were: sex, capital-gain, capital-loss, hours-per-week, is_USA, edu_scale * is_married, workclass_private, workclass_selfemp_notinc, age, occupation_ Exec-managerial, occupation_ Prof-specialty, occupation_ Tech-support, and occupation_ Protective-serv.

ModelingAfter data cleaning and feature engineering was complete, the team began evaluating models which included model-based EDA to increase understanding of feature affects on the data.

The team settled on the list of 13 features shown above (successfully less than the maximum of 20 features).

Several different models and ensemble methods were evaluated to determine what model performed best.

The following classification models were built and analyzed: logistic regression, k-nearest neighbors (KNN), decision tree, random forest, and support vector machine.

Along with these models, we tried boosting methods such as AdaBoost, Gradient Boosting, and XGBoost.

We chose to use accuracy as the measure to pick our model because the problem we were trying to solve did not favor reduction of false negatives versus false positives and vice versa.

The majority of models developed by the team produced similar results, with an accuracy averaging around 0.

84.

Upon initial evaluation, the decision tree performed best with an accuracy of 0.

853.

With this knowledge, we used gridsearch to find the optimal hyperparameters for the decision tree model.

Gridsearching yielded a slight improvement with a new accuracy score of 0.

856.

With the selected hyperparameters we tried an ensemble method and achieved the highest accuracy score yet, 0.

862 (using AdaBoost).

The final model we based our predictions on had a 0.

874 accuracy on the training data and an accuracy of 0.

867 on the test data, only showing a very slight indication of overfitting (the final model we built was an optimized decision tree using AdaBoost).

ConclusionWe have discussed building data science models in the context of balancing quality, cost, and time.

While this was constraining in many ways (since the time and number of features were constrained), we learned to carefully evaluate data and to generate models with close train and test accuracy scores as a team.

The final model we based our predictions on had a training accuracy score of 0.

874 and a test accuracy score of 0.

867.

We learned that it is indeed possible to balance quality, cost, and time, and we were able to deliver a good model as a team given strong teamwork and communication.

Sources:Figure 1 (Image): Domingo, Rene T.

“The QCD Approach to Management.

” http://www.

rtdonline.

com/BMA/MM/qcd.

htmThe data that was used for this analysis was extracted by Barry Becker from the 1994 U.

S.

Census Bureau database found at http://www.

census.

gov/ftp/pub/DES/www/welcome.

html.

The data and data dictionary can be found at Kaggle’s “Adult Census Income:Predict whether income exceeds $50K/yr based on census data” competition page: https://www.

kaggle.

com/uciml/adult-census-income.

.

. More details

Leave a Reply