College acceptance predictions and major-salary analysis

These are the questions that linger in every college hunter’s mind.

In this post we explored the dynamics behind each question and try to design a model to predict the acceptance a specific school.

In addition we did some analysis on the college majors and the salaries associated.

Data explorationWe have multiple sets of data for our analysis.

The data include a set for our acceptance predictions analysis and the college salary analysis.

Acceptance Prediction dataOur data set contains 1535 observations and many attributes or columns.

The data comes from many university applications office and each row contains the the college name, graduation rates for different degrees and many other important attributes .

A snapshot of the data is posted bellow.

The data has many attributes that are not useful for our analysis therefore its need to be cleaned before any serious operation.

Many useless attributes are eliminated and the data in centralized on the most relevant attributes.

our problem statement is to predict the acceptance rate.

We run into some data challenges, we only have a couple of attributes that are related to the acceptance in school.

Modeling Acceptance RateOur goal with this dataset is to predict the acceptance rate of a particular college, given features such as the degrees the school offers, the tuition, SAT/ACT scores, etc.

Before we attempt to model the acceptance rate, let’s look at a histogram of acceptance rates for our dataset.

As we can see, the data is roughly normally distributed about the mean of this dataset, which is approximately 64.


Cross Validation SetWe have used cross validation with five splits, and a test set size of 25% of the total data.

Using cross validation will allow us to determine if our models are overfitting if they perform well on training data, but poorly on testing data.

ModelsThe models we have chosen to use are as follows:Decision Tree RegressorRandom Forest RegressorGradient Boosting RegressorDummy RegressorPerformance MetricsTo measure the performance of these models, we have used mean squared error, and mean absolute error.

Model EvaluationDummy RegressorThe Dummy Regressor is a model provided by scikit-learn to use as a reference point to compare your other models.

We used a Dummy Regressor that always predicted the median of the dataset regardless of the features.

The results of the Dummy Regressor are shown below.

Mean Squared ErrorMean: 413.

2STD: 32.

1Mean Absolute ErrorMean 15.

7:STD: 0.

8We can see that these two performance metrics tell different stories.

Mean squared error penalizes outliers more harshly, leading to a higher value.

Mean absolute error does not penalize outliers any differently from other data points.

The benefit of using mean absolute error, is the units of the mean absolute error are the same as what we are trying to predict.

Using a Dummy Regressor that always predicts the median, it will on average be off by 15.


Now that we know how a Dummy Regressor performs on our data, let’s look out how the other models do.

Decision Tree RegressorMean Squared ErrorMean: 306.

5STD: 28.

2Mean Absolute ErrorMean: 13.

4STD: 0.

5Random Forest RegressorMean Squared ErrorMean: 183.

4STD: 7.

5Mean Absolute ErrorMean: 10.

4STD: 0.

3Gradient Boosting RegressorMean Squared ErrorMean: 198.


1Mean Absolute ErrorMean: 10.

8STD: 0.

4Out of the models we chose to model this dataset, a Random Forest Regressor performed the best.

College DataFor this analysis, we have 3 different data sets to play with.

The first dataset concerns the degrees that pays back, followed by the salaries per school and finally the distribution of salaries per region.

Each dataset tells a particular story that we will try to dissect in details.

To make the processing easy, we abbreviated attributes descriptions when it comes all of the three datasets.

The dataset is then smaller, easier to read.

Among the attribute we have, sms (Starting Median Salary), mcms (Mid-Career Median Salary), delta_sms_mcms(Percent change from Starting to Mid-Career Salary) and mc10ps, mc25For correspond to 10th Percentile Salary, 25th Percentile Salary …For the degrees that pays back we started by exploring the data and its structure.

It appears that the data has eight attributes and structured as shown bellow.

For median starting salary, physician assistant majors are leading with a big margin.

The following majors include engineering and other stem majors.

the majors with the least median starting salary include Religion, Education and language.

The graph bellow illustrates this dynamic.

starting median salary per majorswhen it comes to mid career median salary, engineering majors are leading while Physician assistant salary stagnate.

Mid career median salary per majorSo which major pays off right away and which major is good in long term?.Based on our analysis ,Economics does not pay off until mid-career.

After mid-career, it is however one of the top majors.

Finance has some similarities with economics.

There's a boost in salary for finance majors a few years in their seniority.

Nursing, Physicians Assistant – very high Starting Salary.

However, salary increase is below average in the longer term.

Engineering majors maintain a top salary and substantial increases throughout their careers.

When it comes to colleges that pays well, we noticed that Engineering top schools have the highest starting median salaries followed by many well known Ivy league schools.

A few state Universities from the states of New York, California, Pennsylvania and Texas made it among the well payed schools.

In the overall, Ivy League schools graduates maintain top salaries throughout their career.

Engineering schools comes in second in term of maintaining their salary levels throughout.

Liberal arts schools are the least performing schools.

Some well founded state universities enjoys a decent income throughout their careers.

The whole dynamic is shown in the graph bellow.

success per regionLastly we check through the region of study affect the student career success.

Northeastern region appears to produce most successful student in the country.

The state of California region comes next and the last three region are all competing at the same level.

In conclusion, The acceptance predictions came out with an accuracy a little lower than expected but the overall analysis convey a lot and helps make a better choice for where to go to college based on the salary we aim for.

For the whole Analysis on the project you can visit our GitHub repository.

. More details

Leave a Reply