Predicting “Bikeability” in U.S. Cities

After gathering and cleaning data from the aforementioned sources, I needed to join the disparate information.

One main source of headache was the different naming conventions for the same city (Washington, D.

C.

and Washington DC, for example).

I ran the following function on all datasets to make sure cities with multiple naming conventions (or misspellings) were harmonized across datasets.

def fix_cities(df): df.

loc[df['city'] == 'Nashville', 'city'] = 'Nashville-Davidson' df.

loc[df['city'] == 'Louisville', 'city'] = 'Louisville-Jefferson' df.

loc[df['city'] == 'Lexington', 'city'] = 'Lexington-Fayette' df.

loc[df['city'] == 'OklahomaCity', 'city'] = 'Oklahoma City' df.

loc[df['city'] == 'Salt LakeCity', 'city'] = 'Salt Lake City' df.

loc[df['city'] == 'SanFrancisco', 'city'] = 'San Francisco' df.

loc[df['city'] == 'VirginiaBeach', 'city'] = 'Virginia Beach' df.

loc[df['city'] == 'Washington,D.

C.

', 'city'] = 'Washington, D.

C.

' df.

loc[df['city'] == 'Washington Dc', 'city'] = 'Washington, D.

C.

' df.

loc[df['city'] == 'Washington', 'city'] = 'Washington, D.

C.

' df.

loc[df['city'] == 'New York', 'city'] = 'New York City'Next, since there are some cities with the same name across different states, I needed to join dataframes on a combination of city and state.

However, some datasets included the full state name while others included the two-letter abbreviation only.

I created a dictionary to map state names to state abbreviations.

states_abbrev = {'Alaska': 'AK', 'Alabama': 'AL', 'Arkansas': 'AR', 'Arizona': 'AZ','California': 'CA', 'Colorado': 'CO', 'Connecticut': 'CT','District of Columbia': 'DC','Delaware': 'DE','Florida': 'FL','Georgia': 'GA','Hawaii': 'HI','Iowa': 'IA', 'Idaho': 'ID', 'Illinois': 'IL', 'Indiana': 'IN','Kansas': 'KS', 'Kentucky': 'KY', 'Louisiana': 'LA', 'Massachusetts': 'MA','Maryland': 'MD', 'Maine': 'ME', 'Michigan': 'MI', 'Minnesota': 'MN', 'Missouri': 'MO', 'Mississippi': 'MS', 'Montana': 'MT', 'North Carolina': 'NC','North Dakota': 'ND', 'Nebraska': 'NE', 'New Hampshire': 'NH', 'New Jersey': 'NJ','New Mexico': 'NM', 'Nevada': 'NV', 'New York': 'NY', 'Ohio': 'OH','Oklahoma': 'OK', 'Oregon': 'OR', 'Pennsylvania': 'PA', 'Puerto Rico': 'PR','Rhode Island': 'RI', 'South Carolina': 'SC', 'South Dakota': 'SD', 'Tennessee': 'TN','Texas': 'TX', 'Utah': 'UT', 'Virginia': 'VA', 'Vermont': 'VT','Washington': 'WA', 'Wisconsin': 'WI', 'West Virginia': 'WV', 'Wyoming': 'WY'}I had previously “pickled” each individual dataframe, so my next step was to read in each pickle, make sure city names and states were harmonized, and merge dataframes by city and state.

This gave me one master dataset to pickle for later when I was ready to start my exploratory data analysis (EDA) and feature engineering.

%pylab inline%config InlineBackend.

figure_formats = ['retina']import pandas as pdimport seaborn as snsdf = pd.

read_pickle('city_traffic.

pkl')df['city'] = df['city'].

str.

lower()df['state'] = df['state'].

str.

strip()df['city'] = df['city'].

str.

strip()fix_cities(df)files = ['city_busdensity.

pkl','city_percip.

pkl','city_poulations.

pkl', 'city_taxes.

pkl','city_temp.

pkl','city_walkscore.

pkl']for file in files: df_pkl = pd.

read_pickle(file) if 'city' in df_pkl.

columns: df_pkl['city'] = df_pkl['city'].

str.

strip() df_pkl['city'] = df_pkl['city'].

str.

lower() if 'state_long' in df_pkl.

columns: df_pkl['state'] = df_pkl['state_long'].

map(states_abbrev) #Get state abreviation for these df_pkl['state'] = df_pkl['state'].

str.

strip() fix_cities(df) if 'state' in df_pkl.

columns: df = pd.

merge(df, df_pkl, on = ['city','state'], how = 'outer') print(file,'success') else: print(file, 'no state column')df.

to_pickle('city_data.

pkl')Part II: EDA and feature engineeringI loaded back in my pickled dataframe and dropped everything except for my target value (bike_score) and features of interest.

df = pd.

read_pickle('city_data.

pkl')df = df[['bike_score','walk_score','transit_score', 'congestion', 'bus_density','pop_density', 'population', 'gdp_per_cap', 'state_tax', 'city_tax', 'total_tax', 'avg_percip', 'avg_temp']]Next I created pairplots to view distributions of my variables, as well as bivariate relationships within my dataframe.

sns.

pairplot(df, height=1.

2, aspect=1.

5);While most features appeared to be normally distributed, population, population density (‘pop_density’), and business density (‘bus_density’) were notable exceptions, and I wondered if they might benefit from log transformations.

log_vars = ['population','pop_density','bus_density']for v in log_vars: df['log_'+v] = log(df[v])I plotted histograms of these three features before and after their log transformations.

The code and plots are shown below.

f, axes = plt.

subplots(3, 2)f.

subplots_adjust(hspace = 0.

5)sns.

distplot(df.

population, ax=axes[0][0], kde=False, bins='auto')sns.

distplot(df.

log_population, ax=axes[0][1], kde=False, bins='auto')sns.

distplot(df.

pop_density, ax=axes[1][0], kde=False, bins='auto')sns.

distplot(df.

log_pop_density, ax=axes[1][1], kde=False, bins='auto')sns.

distplot(df.

bus_density, ax=axes[2][0], kde=False, bins='auto')sns.

distplot(df.

log_bus_density, ax=axes[2][1], kde=False, bins='auto')for i in range(0,3): for j in range(0,2): axes[i][j].

set_xticks([]) axes[i][j].

set_yticks([])Histograms of select features, before and after log transformationAs can be seen from the histograms, the distribution of population appeared more normally distributed after the log transformation.

Population and business density, on the other hand, still did not appear to be normally distributed.

Rather, these features appeared to have large outliers (for example, New York City) that caused them to skew right.

Nonetheless, the relationship between the transformed density features and the target value appeared more linear after the log transformation, so I decided to leave them in to see if they were in fact significant.

Bike score as a function of urban population density and log urban population densityBike score as a function of urban business density and log urban business densityNext, I explored correlations among my features to account for any collinearity.

Heatmap of feature correlationscorrs = {}cols = list(df.

iloc[:,1:].

columns)for x in cols: cols = [w for w in cols if w != x] for w in cols: corrs[(x, w)] = df[x].

corr(df[w])results = [(x, v.

round(2)) for x, v in corrs.

items() if corrs[x] > 0.

5]results = sorted(results, key=lambda x: x[1], reverse=True)resultsCorrelations for pairs of featuresThe first three correlations were expected (the log of a variable should be highly correlated with the variable itself).

Total tax was simply the sum of city and state taxes, so I chose total_taxes as the tax feature to include in the model.

The bottom four pairs of features with correlations between 0.

52 and 0.

56 were of concern so I created interaction terms for these.

interact = [x[0] for x in results[4:]]interacts = []for i in range(len(interact)): col1 = interact[i][0] col2 = interact[i][1] col_interact = col1+'_x_'+col2 interacts.

append(col_interact) df[col_interact] = df[col1]*df[col2]Next, I checked variance inflation factors of various combinations of features to ensure that any collinearity among my final features was low.

Luckily enough, all of my VIF’s were below the “magic number” of 3.

from statsmodels.

stats.

outliers_influence import variance_inflation_factordf['constant'] = 1X = df[['congestion', 'population', 'gdp_per_cap', 'walk_score_x_transit_score','total_tax', 'avg_percip', 'avg_temp', 'log_population','bus_density_x_pop_density', 'constant']]vif = pd.

Series([variance_inflation_factor(X.

values, i) for i in range(X.

shape[1])],index=X.

columns)vif.

sort_values(ascending = False)Variance inflation factors of model featuresWith my data primed for analysis, I pickled this version of the dataframe before continuing to hpyerparameter tuning and model selection.

df.

to_pickle('regression_data.

pkl')Part III: Hyperparameter tuning and model selectionFinally, I was ready to create my model and see which features determine “bikeability” in a city!from sklearn.

model_selection import train_test_splitfrom sklearn.

linear_model import LinearRegression, Lasso, LassoCV, Ridge, RidgeCVfrom sklearn.

metrics import r2_score from sklearn.

preprocessing import StandardScaler, PolynomialFeaturesfrom sklearn.

pipeline import Pipelinefrom sklearn.

metrics import mean_absolute_error as maefrom sklearn.

model_selection import cross_val_score, cross_val_predictfrom sklearn.

model_selection import GridSearchCVdf = pd.

read_pickle('regression_data.

pkl')First, I defined ‘y’ as my target variable (bike_score) and ‘X’ as a matrix of urban features.

y = df['bike_score']X = df[['constant','congestion', 'transit_score', 'gdp_per_cap','total_tax', 'avg_percip', 'avg_temp', 'log_population','bus_density_x_pop_density']]Next, I split my data into training and testing sets.

I trained my model on 80% of the data and reserved 20% for testing.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.

2,random_state = 88)a.

Linear regressionI first fit a “vanilla” linear regression model on 5 cross-validation folds of data (meaning that one-fifth of my data was always being held out for scoring).

I got an average mean absolute error (MAE) of close to 8 across all validation folds.

lr=LinearRegression()scores = cross_val_score(lr, X_train, y_train, cv = 5, scoring='neg_mean_absolute_error')print (np.

mean(scores)*(-1))b.

Ridge regularizationKnowing that there was a lot of multicollinearity among my features, I next fit a ridge regression, which adds a penalty for multicollinearity, on my (standardized) features.

std_scale = StandardScaler()X_train_scaled = std_scale.

fit_transform(X_train)X_test_scaled = std_scale.

transform(X_test)ridge = Ridge(random_state=88)I used GridSearchCV from scikit learn to find the alpha parameter that minimizes the average mean absolute error across five cross-validation folds.

grid_r1 = {'alpha': np.

array([100,75,50,25,10,1,0.

1,0.

01,0.

001,0.

0001,0])}ridge_grid = GridSearchCV(ridge, grid_r1, cv=5, scoring='neg_mean_absolute_error')ridge_grid.

fit(X_train_scaled, y_train)print("tuned hpyerparameters :(best parameters)", ridge_grid.

best_params_)print("MAE score :",-1*(ridge_grid.

best_score_))The best alpha was 100, which gave an average MAE of 7.

5 across the five folds of the training set — a slight improvement from the “vanilla” linear regression.

c.

Lasso regularizationLasso regularization penalizes for multicollinearity and keeps only significant features with non-zero coefficient.

Given this, I decided to throw the “kitchen sink” of features into this model.

Once again, my first step was to use GridsearchCV to determine the optimal alpha parameter.

An alpha of 1.

0 returned an MAE of 6.

6, already a significant improvement from the ridge model.

y = df['bike_score']XL = df.

drop(['bike_score','constant'], axis=1)X_train, X_test, y_train, y_test = train_test_split(XL, y, test_size=0.

2, random_state = 88)std_scale = StandardScaler()X_train_scaled = std_scale.

fit_transform(X_train)X_test_scaled = std_scale.

transform(X_test)grid_l1 = {'alpha': np.

array([100,75,50,25,10,1,0.

1,0.

01,0.

001,0.

0001,0])}lasso = Lasso(random_state=88)lasso_grid = GridSearchCV(lasso, grid_l1, cv=5, scoring='neg_mean_absolute_error')lasso_grid.

fit(X_train_scaled, y_train)print("tuned hpyerparameters :(best parameters) ",lasso_grid.

best_params_)print("MAE score :",-1*(lasso_grid.

best_score_))As I mentioned above, lasso regularization tends to take several feature coefficients to zero and keep only a select set of significant features.

Indeed, only walk score, walk score/transit score interaction, congestion, and average precipitation returned non-zero coefficients.

I decided to keep all but walk score in the final model (to account for multicollinearity between walk_score and walk_score*transit_score).

coefficients = sorted(list(zip(X_test.

columns, lm.

coef_.

round(2))), key=lambda x: x[1], reverse=True)coefficientsFeature coefficients with lasso regularizationBelow is the code for my final model, which fits a Lasso Regression with alpha 1.

0 on the three significant features in the training data (walks score/transit score interaction, congestion, and average precipitation).

Finally, I used the model to predict bike scores in the test set (remember that 20% of the data that I put aside?).

The MAE on the test data (the mean absolute error between model predictions and actual target values) was 6.

3, meaning that my model was able to predict city bike scores within 6.

3 points of the actual score on average.

y = df['bike_score']XL2 = df[['walk_score_x_transit_score','congestion','avg_percip']]X_train, X_test, y_train, y_test = train_test_split(XL2, y, test_size=0.

2, random_state = 88)std_scale = StandardScaler()X_train_scaled = std_scale.

fit_transform(X_train)X_test_scaled = std_scale.

transform(X_test)lasso = Lasso(random_state=88, alpha=1.

0, fit_intercept=True)lm2 = lasso.

fit(X_train_scaled, y_train)y_pred = lm2.

predict(X_test_scaled)mae(y_pred, y_test)ResultsBased on the model coefficients, more “bikeable” cities are those with a precedent for alternative modes of transit (public transit and walking), more congestion (greater incentive to find an alternative mode to private vehicles), and less precipitation (rain/snow).

According to the lasso model, features such as local tax rates, business density, and population density had no significance on a city’s “bikeability”.

coefficients = sorted(list(zip(X_test.

columns, lm2.

coef_.

round(2))), key=lambda x: x[1], reverse=True)coefficientsFeature coefficients, final regression model with lasso regularizationThe low R-squared and adjusted R-squared of my model suggest that much of the variance of a city’s bikeability is not explained by my selected features.

A next step will be to gather additional features that might explain bikeability, such as the quality of bike-related infrastructure and overall hilliness.

def r2_adj(X,Y,r_squared): adjusted_r_squared = 1 – (1-r_squared)*(len(y)-1)/(len(y)- X.

shape[1]-1) return(adjusted_r_squared)r2 = lm2.

score(X_test_scaled, y_test)r2a = r2_adj(X_test_scaled, y_test, r2)print('R-squared', r2.

round(2), 'Adjusted r-squared', r2a.

round(2))Once I am able to refine the model and further reduce the MAE through additional feature selection and feature engineering, future work will be to incorporate international cities for which walkscore.

com does not currently provide bike scores.

With urbanization increasing worldwide and the threat of climate change looming over us, it is imperative that cities find ways to reduce private vehicle congestion and emissions.

Bike and scooter share programs can contribute to this endeavor and it is my hope that this project will shed some light on those factors that determine viable pre-conditions for implementation of such programs.

I am always striving to improve my work, so suggestions are always welcome!.Please feel free to provide feedback in the comments.

.

. More details

Leave a Reply