What is Clustering?
Clustering, often known as cluster analysis, is an unsupervised machine learning task. Using a clustering algorithm entails providing the algorithm with a large amount of unlabeled data and allowing it to locate whatever groupings in the data it can. The names given to these groups are clusters. A cluster is a collection of data points that are related to one another based on their proximity to other data points. Clustering is mainly used for things like feature engineering or pattern discovery.
Density-based spatial clustering of applications with noise (DBSCAN)
Density-based spatial clustering of applications with noise is a type of density based clustering techniques. In density based clustering, data is grouped by areas of high concentrations of data points surrounded by areas of low concentrations of data points. Basically the algorithm finds the places that are dense with data points and calls those clusters.
DBSCAN is a base algorithm for density-based clustering. It can discover clusters of different shapes and sizes from a large amount of data, which is containing noise and outliers. DBSCAN clustering’s most appealing feature is its robustness against outliers. This Algorithm requires only two parameter namely minPoints and epsilon. The minimum number of points (a threshold) clustered together for a region is minpoint, where as epsilon is the distance measure that will be used to locate the points in the neighborhood of any point. DBSCAN creates a circle of epsilon radius around every data point and classifies them into Core point, Border point, and Noise.
The parameter estimation is a problem for every data mining task. For choosing good parameters we need to understand how they are used and have at least a basic previous knowledge about the data set that will be used.
Epsilon: A considerable portion of the data will not be clustered if the eps value used is too tiny. Outliers are those that do not meet the required number of points to form a dense zone. If the value specified is too high, however, clusters will merge and the bulk of items will be in the same cluster. The eps should be chosen based on the dataset’s distance (which may be determined using a k-distance graph), but modest eps values are preferred in general.
MinPoints: minPoints can be calculated as minPoints ≥ D + 1 from a number of dimensions (D) in the data set as a general rule. For data sets with noise, larger values are usually better since they form more significant clusters. The minPoints value must be at least 3, but the larger the data set, the higher the minPoints value should be.
Classification of data points
Core Point : A selected point is considered a core point if it has at least a minimal number of points (MinPts) within its epsilon-neighborhood including itself, black spots in above figure are core points that have at least MinPts=4 in their immediate vicinity.
Border Point: A border point is a chosen point that is in the vicinity of a core point but is not itself a core point. Blue spots are marked as border points in above figure. If we have a border point, it signifies that the point is close to or on the edge of a dense zone.
Noise Point: point: A selected point that is neither a core point nor a boundary point. This means that these points are outliers that are not associated with a dense cluster. In above figure, the red dots are marked as noise dots.
Initially, the algorithm starts by randomly and uniformly selecting a point from the set of data points. Checks if the selected point is the core point. Again, a point is considered core if it contains at least minpoints, a minimum number of points in its epsilon neighborhood. Next, search for the connected components of all the core points, ignoring the noise points. Assign each non-core point to the nearest cluster if the cluster is its epsilon neighbor. If not, assign it as noise. The algorithm stops when it explores all the points one by one and classifies them as either core, border or noise point.
Implementation of DBSCAN using Python
# Importing the python libraries import numpy as np import pandas as pd # Importing the dataset dataset = pd.read_csv('Mall_Customers.csv') X = dataset.iloc[:, [3, 4]].values # Using the elbow method to find the optimal number of cluster from sklearn.cluster import DBSCAN dbscan=DBSCAN(eps=3,min_samples=4) # Fitting the model with the data in dataset model=dbscan.fit(X) labels=model.labels_ from sklearn import metrics #identifying the points which makes up our core points sample_cores=np.zeros_like(labels,dtype=bool) sample_cores[dbscan.core_sample_indices_]=True #Calculating the number of clusters n_clusters=len(set(labels))- (1 if -1 in labels else 0) # Measuring the performance of clustering technique print(metrics.silhouette_score(X,labels))
So, that’s how easy it is to implement DBSCAN using sci-kit learn.
Advantages of DBSCAN:
- DBSCAN algorithm is robust to outliers (noise points).
- DBSCAN is great at separating high density clusters from low density clusters.
- Unlike K-means, DBSCAN does not require number of clusters to be specified prior execution.
- DBSCAN supports non-globular structures as well.
Disadvantages of DBSCAN:
- DBSCAN does not work well for clusters of varying density. Although it clusters high density regions separating them from low density points, it struggles a lot in case of clusters of similar density.
- DBSCAN algorithm is not deterministic in the sense that it forms different clusters on different trials. The reason is that the border point that is reachable from more than one core point can join either cluster, depending on the order of processing the data points.
- Sometimes, choosing the value of ‘epsilon’ can be difficult especially when the data is in higher dimensions.
In this article, we’ve looked at detailed explanation of DBSCAN density-based clustering algorithm, its two important parameters and advantages and disadvantages of using it. Then, we looked at its implementation in python using scikit-learn.
I hope this article will help you in understanding DBSCAN clustering algorithm and its basic implementation.