Apache Spark

Streaming with Apache Spark Custom Receiver

Reading Time: 2 minutes Hello inqisitor. In previous blog we have seen about the predefined Stream receiver of Spark. In this blog we are going to discuss about Custom receiver of spark so that we can source the data from any . So if we want to use Custom Receiver than we should know first we are not going to use SparkSession as entry point , if there are Continue Reading

Introduction To Hadoop Map Reduce

Reading Time: 4 minutes In this Blog we will be reading about Hadoop Map Reduce. As we all know to perform faster processing we needs to process the data in parallel. Thats Hadoop MapReduce Provides us. MapReduce :- MapReduce is a programming model for data processing. MapReduce programs are inherently parallel, thus putting very large-scale data analysis into the hands of anyone with enough machines at their disposal.MapReduce works Continue Reading

Apache Spark 2.0 with Hive

Reading Time: < 1 minute Hello geeks , we have discussed about how to start programming with spark in scala. In this blog we will discuss about how we can use hive with spark 2.0. When you start to work with hive , at first we need HiveContext (inherits SqlContext)  , core-site.xml , hdfs-site.xml and hive-site.xml for spark. In case if you dont configure hive-site.xml then the context automatically creates metastore_db in the Continue Reading

Meetup: Stream Processing Using Spark & Kafka

Reading Time: < 1 minute Knoldus organized a Meetup on Friday, 9 September 2016. Topics which were covered in this meetup are: Overview of Spark Streaming. Fault-tolerance Semantics & Performance Tuning. Spark Streaming Integration with  Kafka. Meetup code sample available here Real time stream processing engine application code available here

Scala, Couchbase, Spark and Akka-http: A combinatory tutorial for starters

Reading Time: 5 minutes Couchbase and Apache Spark are best so far , for the in-memory computation. I am using akka-http because its new in the business. If you are not a big fan of akka-http and don’t think it is yet ready for production then you can take a look on this blog, which displays how to do the same task using Spray. If you are new to all Continue Reading

Apache Ignite

Sharing RDD’s states across Spark applications with Apache Ignite

Reading Time: 4 minutes Apache Ignite offers an abstraction over native Spark RDDs such that the state of RDDs can be shared across spark jobs, workers and applications which is not possible with native Spark RDDS. In this blog, we will walk through the steps on how to share RDDs between two spark Application. Preparing Ingredients To test the Apache Ignite with Apache Spark application we need at least one master Continue Reading

Controlling RDD Partitions in Apache Spark

Reading Time: 2 minutes In this blog, we will discuss What is RDD partitioning, why Partitioning is important and how to create and use spark Partitioners to minimize the shuffle operations across the nodes in a distributed Spark application. What is Partitioning? Partitioning is a transformation operation which is available on all key value pair RDDs  in Apache Spark. It is required when we try to group values on the basis Continue Reading

Is using Accumulators really worth ? Apache Spark

Reading Time: 2 minutes Before jumping right into the topic you must know what Accumulators are ? for that you can refer this blog. Now we know what and why of Accumulators lets jump to the main point. Description :- Spark automatically deals with failed or slow machines by re-executing failed or slow tasks. Example :- if the node running a partition of a map() operation crashes, Spark will rerun it Continue Reading

Email spam detection using apache spark mllib

Reading Time: 2 minutes In this blog we will see the real use case of spark mllib that is email spam detection. With the help of using the apache spark mllib component we will detect that email will goes in spam folder or primary folder. So now jump into the programming and see how it will implement. So first we will load the data from training from spam dataset Continue Reading

Introduction to Accumulators : Apache Spark

Reading Time: 2 minutes Whats the Problem  : Function like map() , filter() can use variables defined outside them in the driver program but each task running on the cluster gets a new copy of each variable, and updates from these copies are not propagated back to the driver. The Solution : Spark provides two type of shared variables. 1.    Accumulators 2.    Broadcast variables Here we are Continue Reading

Broadcast variables in Spark, how and when to use them?

Reading Time: 2 minutes As documentation for Spark Broadcast variables states, they are immutable shared variable which are cached on each worker nodes on a Spark cluster.  In this blog, we will demonstrate a simple use case of broadcast variables. When to use Broadcast variable? Think of a problem as counting grammar elements for any random English paragraph, document or file. Suppose you have the Map of each word as specific Continue Reading

UDF overloading in spark

Reading Time: 2 minutes UDF are User Defined Function which are register with hive context to use custom functions in spark SQL queries. For example if you want to prepend some string in any other string or column then you can create a following UDF def addSymbol(input:String, symbol:String)={ symbol+input } Now to register above function in hiveContext we need to register UDF as follows hiveContext.udf.register(“addSymbol”,(input:String,symbol:String)=>addSymbol(input,symbol)) Now you can use Continue Reading