Solving A Data Science Challenge – The Visual Way

OK, you got it.

The GraphThe Code# Used a def so that if you wish to add interactivity you can do that easily later on.

def plot(min_hour,max_hour,n): #boundaries of the main rectangle upper_right = [51.

1741,-113.

8925] lower_left = [50.

8672,-114.

2715] # Creating a grid of nxn from the given cordinate corners grid = get_geojson_grid(upper_right, lower_left , n) # Holds number of points that fall in each cell & time window if provided counts_array = [] # Adding the total number of visits to each cell for box in grid: # get the corners for each cell upper_right = box["properties"]["upper_right"] lower_left = box["properties"]["lower_left"]# check to make sure it's in the box and between the time window if time window is given mask = ((sliceDF.

Lat <= upper_right[1]) & (sliceDF.

Lat >= lower_left[1]) & (sliceDF.

Lng <= upper_right[0]) & (sliceDF.

Lng >= lower_left[0]) & (sliceDF.

Hour >= min_hour) & (sliceDF.

Hour <= max_hour))# Number of points that fall in the cell and meet the condition counts_array.

append(len(sliceDF[mask]))# creating a base map m = folium.

Map(zoom_start = 10, location=[latitude, longitude])# Add GeoJson to map for i, geo_json in enumerate(grid): relativeCount = counts_array[i]*100/4345 color = plt.

cm.

YlGn(relativeCount) color = mpl.

colors.

to_hex(color) gj = folium.

GeoJson(geo_json, style_function=lambda feature, color=color: { 'fillColor': color, 'color':"gray", 'weight': 0.

5, 'dashArray': '6,6', 'fillOpacity': 0.

8, }) m.

add_child(gj) colormap = branca.

colormap.

linear.

YlGn_09.

scale(0, 1) colormap = colormap.

to_step(index=[0, 0.

3, 0.

6, 0.

8 , 1]) colormap.

caption = 'Relative density of fleet activity per cell' colormap.

add_to(m)return m# limiting time window for our data to 8 am – 5 pm and also grid is 20 x 20 plot(8,17,20)The second part of this post is aiming to show you how to use Foursquare APIs to get some geospatial information about different neighborhoods, group neighborhoods in clusters and eventually combine the results to reach to our conclusion.

I have worked quite a bit with Google’s APIs and later on was introduce to Foursquare when I started digging data science and can be awesome.

So for those of you who are not familiar with Foursquare I highly recommend checking it out.

It’s worth it.

The community data was shown above.

For now we ignore the labels the author has used and assume we don’t have them.

Our aim is to cluster those neighborhoods ourselves and find a suitable area for our retail shop (warehouse).

For this, we use Foursquare explore API but feel free to check the list of all their APIs might come handy in your projects.

By now, you should know how to plot pretty maps, so lets make one from the community data using the original labels just to see what’s going on.

Next we attempt to get n number of most common venues for each neighborhood and we will feed that into our k-means clustering code to group the neighborhoods in clusters.

To get a list of common venues for a neighborhoods using explore Foursquare API you would do something like below.

# Using Foursquare's explore API get 10 most common venues around # the latitude, longitude provided within 500 m radius.

# You'll get the CLIENT_ID, CLIENT_SECRET and VERSION after signing up for Foursquare.

(Pay attention to API call limits.

)url = "https://api.

foursquare.

com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius=500&limit=10".

format(CLIENT_ID,CLIENT_SECRET,VERSION,neighborhood_lat,neighborhood_lng)# results come back in format of JSON results = requests.

get(url).

json()resultsWe can extend this to all the neighborhoods.

The data frame below shows a few rows of the results.

Example of data frame including venues with their latitude and longitudeOne hot encodingWhy?.Because we have strings as labels for each neighborhood and need a way to digitizing them so that we can use them in our classification algorithm.

“One hot encoding” basically parses your labels and assigns dummy values to each as well as creates new columns per each label and using 1 or 0 to determine weather that row of table has that feature or not.

So, for instance, Spruce Cliff has cafe, but may not have a gym and so on.

The snippet below shows how to “one hot encode” your results:# one hot encodingcalgary_onehot = pd.

get_dummies(calgary_venues[['Venue Category']], prefix="", prefix_sep="")# add neighborhood column back to dataframecalgary_onehot['Neighbourhood'] = calgary_venues['Neighborhood']# move neighborhood column to the first columnfixed_columns = [calgary_onehot.

columns[-1]] + list(calgary_onehot.

columns[:-1])calgary_onehot = calgary_onehot[fixed_columns]print("calgary_onehot shape is " , calgary_onehot.

shape)calgary_onehot.

head()The resulting table is something like this:One hot encoded dataframeFor gaining a better insight on the nature of each neighborhood, we can group these result and fined the most common venues per neighborhood.

We can then attempt to label each neighborhood, so for instance, a neighborhood with more coffee shops and grocery stores is most likely a residential area while a neighborhood with more construction zones or factories might be an industrial neighborhood.

We will create a pandas dataframe from the results and include 10 most common venues for each neighborhood.

num_top_venues = 10indicators = ['st', 'nd', 'rd']def return_most_common_venues(row, num_top_venues): row_categories = row.

iloc[1:] row_categories_sorted = row_categories.

sort_values(ascending=False) return row_categories_sorted.

index.

values[0:num_top_venues]# create columns according to number of top venuescolumns = ['Neighborhood']for ind in np.

arange(num_top_venues): try: columns.

append('{}{} Most Common Venue'.

format(ind+1, indicators[ind])) except: columns.

append('{}th Most Common Venue'.

format(ind+1))# create a new dataframeneighborhoods_venues_sorted = pd.

DataFrame(columns=columns)neighborhoods_venues_sorted['Neighborhood'] = calgary_grouped['Neighbourhood']neighborhoods_venues_sorted.

rename(columns={'Neighborhood':"NAME"},inplace=True)for ind in np.

arange(calgary_grouped.

shape[0]): neighborhoods_venues_sorted.

iloc[ind, 1:] = return_most_common_venues(calgary_grouped.

iloc[ind, :], num_top_venues)neighborhoods_venues_sorted.

head()and the result is something like this:Common venues per each neighborhood, cropped to fit better here but the code above find the 10 most common venuesClustering NeighborhoodsNow we are at a point to cluster our neighborhoods based on the one hot encoded data frame we have.

In this case I used kmeans-clustering from Sklearn package and to be able to compare the results later on with the original cluster labels in our community data I chose to use n=4 as the number of clusters.

# set number of clusterskclusters = 4calgary_grouped_clustering = calgary_grouped.

drop('Neighbourhood', 1)# run k-means clusteringkmeans = KMeans(n_clusters=kclusters, random_state=0).

fit(calgary_grouped_clustering)# check cluster labels generated for each row in the dataframeneighborhoods_venues_sorted['labels'] = kmeans.

labels_neighborhoods_venues_sorted.

head()label column shows the clustersLet’s merge our results with the original dataframe which includes the geolocations and make a pretty plot with some selectors so we can filter through the clusters and see what’s going on.

The graph:Clustering results plotted with a filtering control panelThe code to achieve this:calgary_merged['labels'] = calgary_merged['labels'].

astype(int)map_clusters = folium.

Map(location=[latitude, longitude], zoom_start=11)# set color scheme for the clustersx = np.

arange(kclusters)ys = [i + x + (i*x)**2 for i in range(kclusters)]colors_array = cm.

rainbow(np.

linspace(0, 1, len(ys)))rainbow = [colors.

rgb2hex(i) for i in colors_array]for cluster in range(0,kclusters): group = folium.

FeatureGroup(name='<span style="color: {0};">{1}</span>'.

format(rainbow[cluster-1],cluster)) for lat, lon, poi, label in zip(calgary_merged['latitude'], calgary_merged['longitude'], calgary_merged['CLASS_CODE'], calgary_merged['labels']): if int(label) == cluster: label = folium.

Popup('ORIG.

'+ str(poi) + 'Cluster ' + str(cluster), parse_html=True) folium.

CircleMarker( (lat, lon), radius=5, popup=label, color=rainbow[cluster-1], fill=True, fill_color=rainbow[cluster-1], fill_opacity=0.

7).

add_to(group) group.

add_to(map_clusters)folium.

map.

LayerControl('topright', collapsed=False).

add_to(map_clusters)map_clusters.

save(outfile= "map_clusters.

html")map_clustersComparing clusters with the original labelsOK, all that is fine and dandy so what?.The whole point of this study was to compare the clusters with the labels and try to identify a suitable location close to the center of minimum distance for a retail store or rather a warehouse.

So lets group by our cluster labels and the original labels and have a look at the confusion matrix.

Keep in mind the original labels are not necessarily the true labels and are merely subjective labels made by the author of data set.

So this should give us some ideas as to how similar are the distributions of our labels compared to the original ones and nothing more.

The final result of our finding is shown below.

The red circle is enclosing the center of medium distance (Median Center) as well as two of the neighborhoods which were determined to be most likely under development or industrial (the most common venues were construction, big retain stores)Most ideal area for a mid-size retail store.

ConclusionGPS data from the fleet and the city community data were used to support this finding and form a basis for this conclusion.

Keep in mind I was trying to show you how you can quickly and without digging too much in details find an approximate solution for a problem of this nature.

Finding the median center using the actual routing data is much more complex and perhaps a next step towards modeling and finding a more precise answer would be to use relative weights for each point.

ReferencesGeospatiality(Open Source Geospatial Python) The 'What is it?' Also known as the Center of Minimum Distance, the Median Center is a…glenbambrick.

comFolium – Folium 0.

8.

3 documentationbuilds on the data wrangling strengths of the Python ecosystem and the mapping strengths of the library.

Manipulate…python-visualization.

github.

ioData 101s: Spatial Visualizations and Analysis in Python with FoliumThis is the first post of a series I am calling Data 101s, a series focusing on breaking down the essentials of how to…towardsdatascience.

com.. More details

Leave a Reply