Apache Spark 2.4: Adding a little more Spark to your code

Continuing with the objectives to make Spark faster, easier, and smarter, Apache Spark recently released its fifth release in the 2.x version line i.e Spark 2.4. We were lucky enough to experiment with it so soon in one of our projects. Today we will try to highlight the major changes in this version that we explored as well as experienced in our project.

CAP Theorem for the distributed systems

A few days back I completed the certification for the 1st course of the Lightbend Reactive Architecture Advanced i.e. Building Scalable Systems. I found this course very helpful and informative to get the idea of Reactive architecture. So if you have not started yet, please go there and lets become reactive. There are few foundational courses as well to build the foundation of reactive architecture.

Tuning a Spark Application

Having trouble optimizing your Spark application? If yes, then this blog will surely guide you on how you can optimize it and what parameters should be tuned so that our spark application gives the best performance. Spark applications can cause a bottleneck due to resources such as CPU, memory, network etc. We need to tune our memory usage, data structures tuning, how RDDs need to

HDFS: A Conceptual View

There has been a significant boom in distributed computing over the past few years. Various components communicate with each other over network inspite of being deployed on different physical machines. A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored on the local client machine. The DFS makes it convenient to share information

Spark: Why should we use SparkSession ?

Spark 2.0 is the next major release of Apache Spark. This brings major change for the level of abstraction for the spark API and libraries. The release has the major change for the ones who want to make use of all the advancement in this release, So in this blog post, I’ll be discussing Spark-Session. Need Of Spark-Session


Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. Naive Bayes classifier is a straightforward and powerful algorithm for the classification task. Even if we are working on a data set with millions of records with some attributes, it is suggested to try

MachineX: Logistic Regression with KSAI

Logistic Regression, a predictive analysis, is mostly used with binary variables for classification and can be extended to use with multiple classes as results also. We have already studied the algorithm in deep with this blog. Today we will be using KSAI library to build our logistic regression model. Setup

MachineX: Association Rule Learning with KSAI

In many of my previous blogs, I have posted about Association Rule Learning, what it's about and how it is performed. In this blog, we are going to use Association Rule Learning to actually see it in action, and for this purpose, we are going to use KSAI, a machine learning library purely written in Scala. So, let's begin. Adding KSAI to your project You

MachineX: A tour to KSAI – Neural Networks

In this blog we would look into how we can use KSAI; A machine learning library purely written in Scala using most of its feature and functional aspects of programming, you can read more about the library at KSAI Wiki, alternatively you can even fork the project from here, KSAI has a rich set of algorithms that address some of the vital problems in classification,

MachineX: KNN algorithm using KSAI

Classification is a well-known area of machine learning. the K-Nearest neighbor algorithm is a simple algorithm that keeps all available cases and classifies new cases based on the similarity with existing cases. KNN has been used in pattern recognition as a non-parametric technique. in this algorithm, a case is classified by a majority of votes of its neighbors. if K=1 then the cases are assigned

MachineX: An Introduction to KSAI, a machine learning library

Take a closer look at Linkedin or any media platform for a couple of minutes, you'll find that the hot topic in the technology section nowadays is Machine Learning and Artificial Intelligence. Why Machine learning and artificial intelligence? Well needless to say it is transforming the world like anything. People are doing good in business by predicting different aspects, doctors are doing good in medical

DynamoDB Core Components

Amazon DynamoDB: Core Components

  DynamoDB is a part of Amazon Web Services. It is a NoSQL database, which supports key-value and document data structures. In this blog, we will be discussing Core components of DynamoDb. Features of DynamoDb: It is a fully managed NoSQL database. It can store & retrieve any amount of data, and can serve any amount of traffic. To maintain fast performance, it distributes data

CuriosityX: RDDs – The backbone of Apache Spark

In our last blog, we tried to understand about using the spark streaming to transform and transport data between Kafka topics. After reading that many of the readers asked us to give a brief description of RDDs in Spark which we used. So, this blog is totally dedicated to the RDDs in Spark. So let's start with the very basic question that comes to our mind

