DBSCAN Algorithm: Complete Guide and Application with Python Scikit-Learn

Below are the 2 steps —Density at a point P: Number of points within a circle of Radius Eps (ϵ) from point P.

Dense Region: For each point in the cluster, the circle with radius ϵ contains at least minimum number of points (MinPts).

The Epsilon neighborhood of a point P in the database D is defined as (following the definition from Ester et.

al.

)N (p) = {q ∈ D | dist(p, q) ≤ ϵ}….

(1)Following the definition of dense region , a point can be classified as a Core Point if |N (p)|≥ MinPts.

The Core Points, as the name suggests, lie usually within the interior of a cluster.

A Border Point has fewer than MinPts within its ϵ-neighborhood (N) , but it lies in the neighborhood of another core point.

Noise is any data point that is neither core nor border point.

See the picture below for better understanding.

Core and Border Points in a Database D.

Green data point is Noise.

(Source: Author, Real Reference [2])One problem of this approach could be — ϵ-neighborhood of the border points contain significantly less number of points than the core points.

Since MinPts is a parameter in the algorithm, setting it to a low value to include the border points in the cluster can cause problem to eliminate the noise.

Here comes the concept of density-reachable and density-connected points.

Directly Density Reachable: Data-point a is directly density reachable from a point b if —|N (b)|≥ MinPts; i.

e.

b is a core point.

a ∈ N(b) i.

e.

a is in the epsilon neighborhood of b.

Considering a border point and a core point, we can understand that notion of directly density reachable is not symmetric, because even though the core point falls in the epsilon neighborhood of border point, the border point doesn’t have enough MinPts, and thus fail to satisfy both conditions.

Density Reachable: Point a is density reachable from a point b with respect to ϵ and MinPts, if —Density reachable is transitive in nature but, just like direct density reachable, it is not symmetric.

Density Connected: There can be cases when 2 border points will belong to the same cluster but they don’t share a specific core point, then we say that they are density connected if, there exists a common core point, from which these borders points are density reachable.

As you can understand that density connectivity is symmetric.

Definition from the Ester et.

al.

paper is given below —“A point a is density connected to a point b with respect to ϵ and MinPts, if there is a point c such that, both a and b are density reachable from c w.

r.

t.

to ϵ and MinPts.

”Two border points a, b are density connected through the core point c.

2.

Steps of DBSCAN Algorithm:With the definitions above, we can go through the steps of DBSCAN algorithm as below —The algorithm starts with an arbitrary point which has not been visited and its neighborhood information is retrieved from the ϵ parameter.

If this point contains MinPts within ϵ neighborhood, cluster formation starts.

Otherwise the point is labeled as noise.

This point can be later found within the ϵ neighborhood of a different point and, thus can be made a part of the cluster.

Concept of density reachable and density connected points are important here.

If a point is found to be a core point then the points within the ϵ neighborhood is also part of the cluster.

So all the points found within ϵ neighborhood are added, along with their own ϵ neighborhood, if they are also core points.

The above process continues until the density-connected cluster is completely found.

The process restarts with a new point which can be a part of a new cluster or labeled as noise.

From the definitions and algorithm steps above, you can guess two of the biggest drawbacks of DBSCAN algorithm.

If the database has data points that form clusters of varying density, then DBSCAN fails to cluster the data points well, since the clustering depends on ϵ and MinPts parameter, they cannot be chosen separately for all clusters.

If the data and features are not so well understood by a domain expert then, setting up ϵ and MinPts could be tricky and, may need comparisons for several iterations with different values of ϵ and MinPts.

Once the fundamentals are cleared a little, now will see an example of DBSCAN algorithm using Scikit-learn and python.

3.

Example of DBSCAN Algorithm with Scikit-Learn:To see one realistic example of DBSCAN algorithm, I have used Canada Weather data for the year 2014 to cluster weather stations.

First let’s load the data —The data-frame consists of 1341 rows and 25 columns and to understand what column names represent, let’s take a look below for the most important features —Since, I wanted to use different temperatures as few of the main features as a first try to cluster weather stations, first, let’s drop the rows that contain NaN values in the column ‘Mean Temperature (Tm)’, ‘Minimum Temperature (Tn)’, ‘Maximum Temperature (Tx)’.

After dropping the rows containing NaN values in the above mentioned columns, we are left with 1255 samples.

Even though that’s almost ~7% of data loss, but given that we still have more than 1000 samples, let’s go ahead with the clustering.

Since we want to do spatial clustering and view the clustering in a map projection, along with different temperatures (‘Tm’, ‘Tn’, ‘Tx’), ‘Lat’, ‘Long’ should also be taken as features.

Here, I have used Basemap Toolkit, a library to plot 2D data for plotting maps in Python.

As mentioned in the Basemap documentation that “Basemap does not do any plotting on its own, but provides the facilities to transform coordinates to one of 25 different map projections”.

One of the important properties of Basemap is — calling a Basemap class instance with the arguments latitude/longitude (in degrees, as in our data-frame), to get x/y map projection coordinates.

Let’s plot the weather stations using Basemap to get familiarized with it.

Start by importing the necessary librariesfrom mpl_toolkits.

basemap import Basemapimport matplotlibfrom PIL import Imageimport matplotlib.

pyplot as pltfrom pylab import rcParams%matplotlib inlinercParams['figure.

figsize'] = (14,10)We are ready to call the Basemap class now —Let me explain the code block in brief.

I started off calling a Basemap class instance with Mercator projection (projection= ‘merc’), low resolution (resolution= ‘l’), and the boundaries of the map domain are given by the 4 parameters llcrnrlon, llcrnrlat, urcrnrlon, urcrnrlat, where llcrnrlon denotes longitude of lower left hand corner of the selected map domain and so on.

Drawcoastlines, drawcountries do exactly what the names suggest, drawlsmask draws a high resolution land-sea mask as an image with land and ocean colors specified to orange and sky-blue.

Latitude and Longitude are converted to x/y map projection coordinates using the command below —xs, ys = my_map(np.

asarray(weather_df.

Long), np.

asarray(weather_df.

Lat))These map projection coordinates will be used as features to cluster the data points spatially along with the temperatures.

First let’s take a look below the weather stations —Weather Stations in Canada, plotted using Basemap.

3.

1.

Clustering the Weather Data (Temperatures & Coordinates as Features)For clustering data, I’ve followed the steps shown in scikit-learn demo of DBSCAN.

Choosing temperatures (‘Tm’, ‘Tx’, ‘Tn’) and x/y map projections of coordinates (‘xm’, ‘ym’) as features and, setting ϵ and MinPts to 0.

3 and 10 respectively, gives 8 unique clusters (noise is labeled as -1).

Feel free to change these parameters to test how much clustering is affected accordingly.

Let’s visualize these clusters using Basemap —8 Unique Clusters in Canada Based on Few Selected Features in the Weather Data.

ϵ and MinPts set to 0.

3 and 10 Respectively.

Finally, I included precipitation (‘P’) in the features and repeated the same clustering steps with ϵ and MinPts set to 0.

5 and 10.

We see some differences from the previous clustering and, thus it gives us an idea about problem of clustering unsupervised data even using DBSCAN when we lack the domain knowledge.

4 Unique Clusters in Canada Based on Selected Features (now included precipitation compared to previous case) in the Weather Data.

ϵ and MinPts set to 0.

5 and 10 Respectively.

You can try to repeat the process including some more features, or, change the clustering parameters, to get a better overall knowledge.

To conclude this post, we went through some basic concepts of DBSCAN algorithm and tested this algorithm to cluster weather stations in Canada.

The detail codes and all the pictures will be available in my Github.

A direct comparison with K-Means clustering can be made to understand even better the differences between these algorithms.

Hopefully, this will help you to get started with one of the most used clustering algorithm for unsupervised problems.

Share your thoughts and ideas and, stay strong.

Cheers !References:[1] “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”; Martin Ester et.

al.

KDD-96 Proceedings.

[2] Density Based Clustering Methods; Gao,J.

, Associate Professor Buffalo University.

Presentation Link.

[3] Link to Github!.

. More details

Leave a Reply