Big Data

Introduction to Machine Learning with Spark (Clustering)

Reading Time: 2 minutes In this blog, we will learn how to group similar data objects using K-means clustering offered by Spark Machine Learning Library. Prerequisites The code example needs only Spark Shell to execute. What is Clustering Clustering is like grouping data objects in some random clusters (with no initial class of group defined) on the basis of similarity or the natural closeness to each other. The “closeness” Continue Reading

Using Spark DataFrames for Word Count

Reading Time: 2 minutes As we all know that, DataFrame API was introduced in Spark 1.3.0, in March 2015. Its goal was to make distributed processing of “Big Data” more intuitive, by organizing distributed collection of data (known as RDD) into named columns. This enabled both, Engineers & Data Scientists, to use Apache Spark for distributed processing of “Big Data”, with ease. Also, DataFrame API came with many under the hood optimizations Continue Reading

Demystifying Asynchronous Actions in Spark

Reading Time: 3 minutes What if we want to execute 2 actions concurrently on different RDD’s, Spark actions are always synchronous. Like if we perform two actions one after other they always execute in sequentially like one after other. Let see example val rdd = sc.parallelize(List(32, 34, 2, 3, 4, 54, 3), 4) rdd.collect().map{ x => println("Items in the lists:" + x)} val rddCount = sc.parallelize(List(434, 3, 2, 43, Continue Reading

Tuning apache spark application with speculation

Reading Time: 2 minutes What happen if spark job will be slow its a big question for application performance so we can optimize the jobs in spark with speculation, Its basically start a copy of job in another worker if the existing job is slow.It will not stop the slow execution of job both the workers execute the job simultaneously. To make our job speculative we need to set Continue Reading

Spark with Spray Starter Kit

Reading Time: 3 minutes Over the last few months, Spark has gained a lot of momentum in Big Data world. It has won a lot competitions & surveys, like Daytona Gray Sort 100TB competition or becoming top level Apache Project and many more. Irrespective of whether it is a product which is a fast/general engine for large-scale data processing, Spark has found its use everywhere. The best part about Spark being that is it can Continue Reading

Apache Spark Cluster Internals: How spark jobs will be computed by the spark cluster

Reading Time: 2 minutes In this blog we are explain how the spark cluster compute the jobs. Spark jobs are collection of stages and stages are collection of tasks. So before the deep dive first we see the spark cluster architecture. In the above cluster we can see the driver program it is a main program of our spark program, driver program is running on the master node of Continue Reading

Gnip using Spark Streaming :- An Apache Spark Utility to pull Tweets from Gnip in realtime

Reading Time: 2 minutes We all are familiar with Gnip, Inc. which provides data from dozens of social media websites via a single API. It is also known as the Grand Central Station for social media web. One of its popular API is PowerTrack which provides Tweets from Twitter in realtime along with the ability to filter Twitter’s full firehose, giving its customers only what they are interested in. This Continue Reading

Stateful transformation on Dstream in apache spark with example of wordcount

Reading Time: 2 minutes Sometimes we have a use-case in which we need to maintain state of paired Dstream to use it in next Dstream . So we are taking a example of stateful wordcount in socketTextStreaming. Like in wordcount example if word “xyz” comes twice is in first Dstream or window, it reduce it and its value is 2 but its state will lost in the next Dstream Continue Reading

Setup a Apache Spark cluster in your single standalone machine

Reading Time: 2 minutes If we want to make a cluster in standalone machine we need to setup some configuration. We will be using the launch scripts that are provided by Spark, but first of all there are a couple of configurations we need to set first of all setup a spark environment so open the following file or create if its not available with the help of template Continue Reading

Scala in Business | Knoldus Newsletter – March 2015

Reading Time: 2 minutes Hello Folks We are back again with March 2015, Newsletter. Here is this Scala in Business | Knoldus Newsletter – March 2015 In this newsletter, you will get the business related news for Scala. How organisation are adopting Scala for their business, how Scala related technologies increasing the performance of application and how Scala is getting popular in the industry So, if you haven’t subscribed to Continue Reading

Dribbling with Filter.js: client-side JS filtering of JSON objects

Reading Time: < 1 minute Dribbling Filter.js Play framework with client-side JS filtering of JSON objects and rendering HTML snippets via jQuery. Big chunk to display? Interactive filtering? Most importantly it has to be really fast. Isn’t it like dribbling against Netherland! Big ground, lots of hooting and most importantly have to be fast and win   UI programming is an exciting ground to play, That’s why i chose reactive platform Continue Reading

Play with Spark: Building Spark MLLib in a Play Spark Application

Reading Time: 2 minutes In our last post of Play with Spark! series, we saw how to integrate Spark SQL in a Play Scala application. Now in this blog we will see how to add Spark MLLib feature in a Play Scala application. Spark MLLib is a new component under active development. It was first released with Spark 0.8.0. It contains some common machine learning algorithms and utilities, including classification, regression, clustering, Continue Reading

Play with Spark: Building Spark SQL in a Play Spark Application

Reading Time: 2 minutes In our last post of Play with Spark! series, we saw how to integrate Spark Streaming in a Play Scala application. Now in this blog we will see how to add Spark SQL feature in a Play Scala application. Spark SQL is a powerful tool of Apache Spark. It allows relational queries, expressed in SQL, HiveQL, or Scala, to be executed using Spark. Apache Spark has a new Continue Reading