Spark: Introduction to Datasets

As I have already discussed in my previous blog Spark: RDD vs DataFrames about the shortcomings of RDDs and how DataFrames overcome them. Now we’ll try to have a look at the shortcomings of DataFrames and how Dataset APIs can overcome them. DataFrames:- A DataFrame is a distributed collection of data, which is organized into named columns. Conceptually, it is equivalent to the relational tables with Continue Reading

Spark Streaming vs. Structured Streaming

Fan of Apache Spark? I am too. The reason is simple. Interesting APIs to work with, fast and distributed processing, unlike map-reduce no I/O overhead, fault tolerance and many more. With this much, you can do a lot in this world of Big data and Fast data. From “processing huge chunks of data” to “working on streaming data”, Spark works flawlessly in all. In this Continue Reading

Spark: RDD vs DataFrames

Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations.One use of Spark SQL is to execute SQL queries. When running SQL from within another Continue Reading

Apache Spark 2.4: Adding a little more Spark to your code

Continuing with the objectives to make Spark faster, easier, and smarter, Apache Spark recently released its fifth release in the 2.x version line i.e Spark 2.4. We were lucky enough to experiment with it so soon in one of our projects. Today we will try to highlight the major changes in this version that we explored as well as experienced in our project. In our Continue Reading

CAP Theorem for the distributed systems

A few days back I completed the certification for the 1st course of the Lightbend Reactive Architecture Advanced i.e. Building Scalable Systems. I found this course very helpful and informative to get the idea of Reactive architecture. So if you have not started yet, please go there and lets become reactive. There are few foundational courses as well to build the foundation of reactive architecture. Continue Reading

Tuning a Spark Application

Having trouble optimizing your Spark application? If yes, then this blog will surely guide you on how you can optimize it and what parameters should be tuned so that our spark application gives the best performance. Spark applications can cause a bottleneck due to resources such as CPU, memory, network etc. We need to tune our memory usage, data structures tuning, how RDDs need to Continue Reading

HDFS: A Conceptual View

There has been a significant boom in distributed computing over the past few years. Various components communicate with each other over network inspite of being deployed on different physical machines. A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored on the local client machine. The DFS makes it convenient to share information Continue Reading

Spark: Why should we use SparkSession ?

Spark 2.0 is the next major release of Apache Spark. This brings major change for the level of abstraction for the spark API and libraries. The release has the major change for the ones who want to make use of all the advancement in this release, So in this blog post, I’ll be discussing Spark-Session. Need Of Spark-Session


Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. Naive Bayes classifier is a straightforward and powerful algorithm for the classification task. Even if we are working on a data set with millions of records with some attributes, it is suggested to try Continue Reading

MachineX: Logistic Regression with KSAI

Logistic Regression, a predictive analysis, is mostly used with binary variables for classification and can be extended to use with multiple classes as results also. We have already studied the algorithm in deep with this blog. Today we will be using KSAI library to build our logistic regression model. Setup

MachineX: Association Rule Learning with KSAI

In many of my previous blogs, I have posted about Association Rule Learning, what it’s about and how it is performed. In this blog, we are going to use Association Rule Learning to actually see it in action, and for this purpose, we are going to use KSAI, a machine learning library purely written in Scala. So, let’s begin. Adding KSAI to your project You Continue Reading

MachineX: A tour to KSAI – Neural Networks

In this blog we would look into how we can use KSAI; A machine learning library purely written in Scala using most of its feature and functional aspects of programming, you can read more about the library at KSAI Wiki, alternatively you can even fork the project from here, KSAI has a rich set of algorithms that address some of the vital problems in classification, Continue Reading

