Apache Spark

Meetup: An Overview of Spark DataFrames with Scala

Reading Time: < 1 minute Knoldus organized a Meetup on Wednesday, 18 Nov 2015. In this Meetup, an overview of Spark DataFrames with Scala, was given. Apache Spark is a distributed compute engine for large-scale data processing. A wide range of organizations are using it to process large datasets. Many Spark and Scala enthusiasts attended this session and got to know, as to why DataFrames are the best fit for building an application in Spark with Scala Continue Reading

Simplifying Sorting with Spark DataFrames

Reading Time: 2 minutes In our previous blog post, Using Spark DataFrames for Word Count, we saw how easy it has become to code in Spark using DataFrames. Also, it has made programming in Spark much more logical rather than technical. So, lets continue our quest for simplifying coding in Spark with DataFrames via Sorting. We all know that Sorting has always been an inseparable part of Analytics. Whether it is E-Commerce or Applied Continue Reading

MeetUp on “An Overview of Spark DataFrames with Scala”

Reading Time: < 1 minute Knoldus is organizing an one hour session on 18th Nov 2015 at 6:00 PM. Topic would be An Overview of Spark DataFrames with Scala. All of you are invited to join this session. Address:- 30/29, First Floor, Above UCO Bank, Near Rajendra Place Metro Station,  New Delhi, India Please click here for more details.

Using Spark DataFrames for Word Count

Reading Time: 2 minutes As we all know that, DataFrame API was introduced in Spark 1.3.0, in March 2015. Its goal was to make distributed processing of “Big Data” more intuitive, by organizing distributed collection of data (known as RDD) into named columns. This enabled both, Engineers & Data Scientists, to use Apache Spark for distributed processing of “Big Data”, with ease. Also, DataFrame API came with many under the hood optimizations Continue Reading

Demystifying Asynchronous Actions in Spark

Reading Time: 3 minutes What if we want to execute 2 actions concurrently on different RDD’s, Spark actions are always synchronous. Like if we perform two actions one after other they always execute in sequentially like one after other. Let see example val rdd = sc.parallelize(List(32, 34, 2, 3, 4, 54, 3), 4) rdd.collect().map{ x =&gt; println("Items in the lists:" + x)} val rddCount = sc.parallelize(List(434, 3, 2, 43, Continue Reading

Scala Days 2015 Pulse – Sentiment Analysis of the Sessions

Reading Time: 2 minutes As you might have read that the Pulse Application from Knoldus created a flutter at Scala Days 2015. In this post let us dig deeper into some findings. We are already getting some cool requests to open source the code which we would do soon. We would just like to strip out the dictionaries that we have used. We collected 5K+ tweets over the period of Continue Reading

Tuning apache spark application with speculation

Reading Time: 2 minutes What happen if spark job will be slow its a big question for application performance so we can optimize the jobs in spark with speculation, Its basically start a copy of job in another worker if the existing job is slow.It will not stop the slow execution of job both the workers execute the job simultaneously. To make our job speculative we need to set Continue Reading

Start/Deploy Apache Spark application programmatically using spark launcher

Reading Time: < 1 minute Sometimes we need to start our spark application from the another scala/java application. So we can use SparkLauncher. we have an example in which we make spark application and run it with another scala application. Let see our spark application code. import org.apache.spark.SparkConf import org.apache.spark.SparkContext object SparkApp extends App{ val conf=new SparkConf().setMaster(“local[*]”).setAppName(“spark-app”) val sc=new SparkContext(conf) val rdd=sc.parallelize(Array(2,3,2,1)) rdd.saveAsTextFile(“result”) sc.stop() } This is our simple spark Continue Reading

Stateful transformation on Dstream in apache spark with example of wordcount

Reading Time: 2 minutes Sometimes we have a use-case in which we need to maintain state of paired Dstream to use it in next Dstream . So we are taking a example of stateful wordcount in socketTextStreaming. Like in wordcount example if word “xyz” comes twice is in first Dstream or window, it reduce it and its value is 2 but its state will lost in the next Dstream Continue Reading

Shufflling and repartitioning of RDD’s in apache spark

Reading Time: 3 minutes To write the optimize spark application you should carefully use transformation and actions, if you use wrong transformation and action will make your application  slow. So when you are writing application some points should be remember to make your application more optimize. 1. Number of partitions when creating RDD By default spark create one partition for each block of the file in HDFS it is Continue Reading

Setup a Apache Spark cluster in your single standalone machine

Reading Time: 2 minutes If we want to make a cluster in standalone machine we need to setup some configuration. We will be using the launch scripts that are provided by Spark, but first of all there are a couple of configurations we need to set first of all setup a spark environment so open the following file or create if its not available with the help of template Continue Reading