cluster computing

Kafka Tuning: Consistency vs Availability

Reading Time: 3 minutes tuning distributed kafka cluster to attain consistency along with high availability of the system.

Congregating Spark files on S3

Reading Time: 2 minutes We all know that Apache Spark is a fast and general engine for large-scale data processing and it is because of its speed that Spark was able to become one of the most popular frameworks in the world of big data. Working with Spark is a pleasant experience as it has a simple API for Scala, Java, Python and R. But, some tasks, in Spark, are still tough rows Continue Reading

Simplifying Sorting with Spark DataFrames

Reading Time: 2 minutes In our previous blog post, Using Spark DataFrames for Word Count, we saw how easy it has become to code in Spark using DataFrames. Also, it has made programming in Spark much more logical rather than technical. So, lets continue our quest for simplifying coding in Spark with DataFrames via Sorting. We all know that Sorting has always been an inseparable part of Analytics. Whether it is E-Commerce or Applied Continue Reading

Introduction to Machine Learning with Spark (Clustering)

Reading Time: 2 minutes In this blog, we will learn how to group similar data objects using K-means clustering offered by Spark Machine Learning Library. Prerequisites The code example needs only Spark Shell to execute. What is Clustering Clustering is like grouping data objects in some random clusters (with no initial class of group defined) on the basis of similarity or the natural closeness to each other. The “closeness” Continue Reading

Tuning apache spark application with speculation

Reading Time: 2 minutes What happen if spark job will be slow its a big question for application performance so we can optimize the jobs in spark with speculation, Its basically start a copy of job in another worker if the existing job is slow.It will not stop the slow execution of job both the workers execute the job simultaneously. To make our job speculative we need to set Continue Reading

Apache Spark Cluster Internals: How spark jobs will be computed by the spark cluster

Reading Time: 2 minutes In this blog we are explain how the spark cluster compute the jobs. Spark jobs are collection of stages and stages are collection of tasks. So before the deep dive first we see the spark cluster architecture. In the above cluster we can see the driver program it is a main program of our spark program, driver program is running on the master node of Continue Reading

Setup a Apache Spark cluster in your single standalone machine

Reading Time: 2 minutes If we want to make a cluster in standalone machine we need to setup some configuration. We will be using the launch scripts that are provided by Spark, but first of all there are a couple of configurations we need to set first of all setup a spark environment so open the following file or create if its not available with the help of template Continue Reading