In this blog we will be demonstrating the functionality of applying the full ML pipeline over a set of documents which in this case we are using 10 books from the internet.
So lets start with first thing first..
What is Clustering ?
Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics.
Clustering when applied on the textual data , then it is known as Document Clustering.
The basic difference between Clustering and Classification ?
Clustering algorithms in computational text analysis groups documents into what are called subsets or clusters where the algorithm’s goal is to create internally coherent clusters that are distinct from one another. Classification on the other hand, is a form of supervised learning where the features of the documents are used to predict the “type” of documents.
The basic difference is that in the clustering we are forming the clusters by simply telling that we want these many clusters and now go and cook them ! While in the case of classification we are first training the model to classify our data into different set of classes that we have already defined , and then by using that trained model we classify our data.
A basic example for a clustering algorithm would be LDA and for Classification would be SVM.
How LDA Actually Works ?
For this I would recommend you to through these :
In a nut shell what it does is :
You tell the algorithm how many topics you think there are. You can either use an informed estimate (e.g. results from a previous analysis), or simply trial-and-error. In trying different estimates, you may pick the one that generates topics to your desired level of interpretability, or the one yielding the highest statistical certainty (i.e. log likelihood). In our example above, the number of topics might be inferred just by eyeballing the documents.
The algorithm will assign every word to a temporary topic. Topic assignments are temporary as they will be updated in Step 3. Temporary topics are assigned to each word in a semi-random manner (according to a Dirichlet distribution, to be exact). This also means that if a word appears twice, each word may be assigned to different topics. Note that in analyzing actual documents, function words (e.g. “the”, “and”, “my”) are removed and not assigned to any topics.
Step 3 (iterative)
The algorithm will check and update topic assignments, looping through each word in every document. For each word, its topic assignment is updated based on two criteria:
- How prevalent is that word across topics?
- How prevalent are topics in the document?
What we are going to do in this blog ?
We are going to perform these steps for the document clustering, these steps are:
1. Spark RegexTokenizer : For Tokenization
4. Spark LDA : For Clustering of documents.
So let’s get started with the Code:
Now comes your pipeLine which should look like this:
So currentlly, what we are doing is that we are applying the algorithm on just sample books and trying to figure out the topic from the cluster:
So our LDA output looks something like this :
and so on…
We can make this more efficient by tuning in the parameters of LDA, and hence getting a beeter set of related terms.
The next task on which I am working on is that finding the core of the topic i.e assigning the label to the topic by finding the core of this cluster !
If anyone has any comments or any suggestions on how to find out the Topic Label by using the related terms, I would be happy to hear from you. Currently what I have in mind is Finding Coallocations using PMI approach , but for this i didnt found any good package in scala there is one in NLTK in python, but maybe something better can come up.
Any comments or suggestions are welcomed here or on twitter : @shiv4nsh
- Spark Examples.