Machine Learning has gained popularity in the last couple of years and has witnessed an exponential rise in its usage. It gives a computer/machine to act without being explicitly programmed. Unsupervised learning is a technique to model the underlying structure or distribution in the data. It enables us to learn more about the data without providing any pre-assigned labels or scores for the training data. In this blog, we will explore K-Means-Algorithm as it is one of the most popular unsupervised learning techniques due to its simplicity and efficiency. It is an iterative and unsupervised clustering algorithm used in machine learning.
Quick Introduction to K-Means.
A cluster is a group of data points with similarities in their features. The k-means clustering algorithm assigns data points to categories, or clusters, by finding the mean distance between data points. It then iterates through this technique in order to perform more accurate classifications over time.
To put it simply, K-Means clustering intends to partition n objects into k clusters in which each object belongs to the cluster with the nearest mean.
Clusters formation with K-Means
The objective of K-Means clustering is to minimize total intra-cluster variance.
- Specify the number of clusters( k ) that we need to be identify from the data.
- Select k points at random as initial cluster centers.
- Assign all the objects/points to their closest cluster center by calculating the Euclidean distance between them (the point and centers ).
- Calculate the centroid or mean of all objects in each cluster.
- Repeat steps 2, 3 and 4 until the same points gets assign to each cluster in consecutive rounds.
Note : The points choosen as initial cluster centers hugely dominate the final result of the cluster.
So, it’s really recommended to run this algorithm multiple times and keep track of the different clusters it forms. Always use different randomly chosen initial cluster centers or you can simply call them starting points.
What should be the value of k?
The algorithm explained above finds clusters for the number k that we chose. So, here the big question is how do we decide on that number?
The most traditional and straightforward method is to try with different values of k starting with 1 ( the worst-case scenario ) which will form only a single cluster. Each time we add a new cluster, the total variance within each cluster reduces. The variance will become 0 when there is only one point in the cluster.
When the reduction in variance for the increasing value of k is plotted as a 2D graph, it takes the shape of an elbow and that is why this method of finding k is known as the elbow method.
The point after which the variance doesn’t go down quickly is called the elbow point. It represents the best value of k with satisfactory outcomes of clustering.
Other methods :
- Average silhouette method
- Gap statistic method
The K-means clustering is a simple, fast, and efficient algorithm. We can use it to classify data points into categories when you have little available information about your data. However, there are many famous algorithms available for unsupervised learning such as:
- Apriori algorithm
- K-NN (k nearest neighbors)
- Principal Component Analysis
Recommendation: Firstly, understand your data before deciding which technique or algorithm to use for solving your problem.