Apache Spark

Neo4j With Scala: Awesome Experience with Spark

Reading Time: 4 minutes Lets start our journey again in this series. In the last blog we have discussed about the Data Migration from the Other Database to the Neo4j. Now we will discuss that how can we combine Neo4j with Spark? Before starting the blog here is recap : Getting Started Neo4j with Scala : An Introduction Neo4j with Scala: Defining User Defined Procedures and APOC Neo4j with Continue Reading

Spark – IoT : Combining Big Data Analysis with IoT

Reading Time: 3 minutes Welcome back , folks ! Time for some new gig ! I think that last series i.e. Scala – IOT was pretty amazing , which got an overwhelming response from you all which resulted in pumping up the idea of this new web-series Spark-IOT. So let’s get started, What was the motivation ? I have been active in the IoT community here, and I found Continue Reading

Streaming with Apache Spark Custom Receiver

Reading Time: 2 minutes Hello inqisitor. In previous blog we have seen about the predefined Stream receiver of Spark. In this blog we are going to discuss about Custom receiver of spark so that we can source the data from any . So if we want to use Custom Receiver than we should know first we are not going to use SparkSession as entry point , if there are Continue Reading

Introduction To Hadoop Map Reduce

Reading Time: 4 minutes In this Blog we will be reading about Hadoop Map Reduce. As we all know to perform faster processing we needs to process the data in parallel. Thats Hadoop MapReduce Provides us. MapReduce :- MapReduce is a programming model for data processing. MapReduce programs are inherently parallel, thus putting very large-scale data analysis into the hands of anyone with enough machines at their disposal.MapReduce works Continue Reading

Apache Spark 2.0 with Hive

Reading Time: 1 minute Hello geeks , we have discussed about how to start programming with spark in scala. In this blog we will discuss about how we can use hive with spark 2.0. When you start to work with hive , at first we need HiveContext (inherits SqlContext)  , core-site.xml , hdfs-site.xml and hive-site.xml for spark. In case if you dont configure hive-site.xml then the context automatically creates metastore_db in the Continue Reading

Meetup: Stream Processing Using Spark & Kafka

Reading Time: 1 minute Knoldus organized a Meetup on Friday, 9 September 2016. Topics which were covered in this meetup are: Overview of Spark Streaming. Fault-tolerance Semantics & Performance Tuning. Spark Streaming Integration with  Kafka. Meetup code sample available here Real time stream processing engine application code available here

Scala, Couchbase, Spark and Akka-http: A combinatory tutorial for starters

Reading Time: 5 minutes Couchbase and Apache Spark are best so far , for the in-memory computation. I am using akka-http because its new in the business. If you are not a big fan of akka-http and don’t think it is yet ready for production then you can take a look on this blog, which displays how to do the same task using Spray. If you are new to all Continue Reading

Apache Ignite

Sharing RDD’s states across Spark applications with Apache Ignite

Reading Time: 4 minutes Apache Ignite offers an abstraction over native Spark RDDs such that the state of RDDs can be shared across spark jobs, workers and applications which is not possible with native Spark RDDS. In this blog, we will walk through the steps on how to share RDDs between two spark Application. Preparing Ingredients To test the Apache Ignite with Apache Spark application we need at least one master Continue Reading

Controlling RDD Partitions in Apache Spark

Reading Time: 2 minutes In this blog, we will discuss What is RDD partitioning, why Partitioning is important and how to create and use spark Partitioners to minimize the shuffle operations across the nodes in a distributed Spark application. What is Partitioning? Partitioning is a transformation operation which is available on all key value pair RDDs  in Apache Spark. It is required when we try to group values on the basis Continue Reading

Is using Accumulators really worth ? Apache Spark

Reading Time: 2 minutes Before jumping right into the topic you must know what Accumulators are ? for that you can refer this blog. Now we know what and why of Accumulators lets jump to the main point. Description :- Spark automatically deals with failed or slow machines by re-executing failed or slow tasks. Example :- if the node running a partition of a map() operation crashes, Spark will rerun it Continue Reading

Email spam detection using apache spark mllib

Reading Time: 2 minutes In this blog we will see the real use case of spark mllib that is email spam detection. With the help of using the apache spark mllib component we will detect that email will goes in spam folder or primary folder. So now jump into the programming and see how it will implement. So first we will load the data from training from spam dataset Continue Reading

Introduction to Accumulators : Apache Spark

Reading Time: 2 minutes Whats the Problem  : Function like map() , filter() can use variables defined outside them in the driver program but each task running on the cluster gets a new copy of each variable, and updates from these copies are not propagated back to the driver. The Solution : Spark provides two type of shared variables. 1.    Accumulators 2.    Broadcast variables Here we are Continue Reading

Knoldus Pune Careers - Hiring Freshers

Get a head start on your career at Knoldus. Join us!