Apache Spark

Streaming in Spark, Flink and Kafka

Reading Time: 7 minutes There is a lot of buzz going on between when to use use spark, when to use flink, and when to use Kafka. Both spark streaming and flink provides exactly once guarantee that every record will be processed exactly once thereby eliminating any duplicates that might be available. Both provide very high throughput compared to any other processing system like storm, and the overhead of Continue Reading

Apache Spark: Reading csv using custom timestamp format

Reading Time: 3 minutes In this blog, we are considering a situation where I wanted to read a CSV through spark, but the CSV contains some timestamp columns in it. Is this going to be a problem while inferring schema at the time of reading csv using spark? Well, the answer may be No, if the csv have the timestamp field in the specific yyyy-MM-dd hh:mm:ss format. In this particular case, the spark csv reader can Continue Reading

Apache Spark: 3 Reasons Why You Should Not Use RDDs

Reading Time: 4 minutes Apache Spark, whenever we hear these two words, the first thing that comes to our mind is RDD , i.e., Resilient Distributed Datasets. Now, it has been more than 5 years since Apache Spark came into existence and after its arrival a lot of things got changed in big data industry. But, the major change was dethroning of Hadoop MapReduce. I mean Spark literally replaced MapReduce and this Continue Reading

Dealing With Deltas In Amazon Redshift

Reading Time: 5 minutes Hi, In this blog I would like to discuss a scenario of implementation of Deltas in Amazon Redshift using spark-redshift. Prior to that I would like to make you aware of Amazon Redshift, spark-redshift library and integration of Spark with Redshift. It is assumed that you have a fair knowledge of programming in Apache Spark and Spark SQL. You may refer to the documentation links Continue Reading

Apache Spark : Handle null timestamp while reading csv in Spark 2.0.0

Reading Time: 2 minutes Hello folks, Hope you all are doing good !!! In this blog, I will discuss a problem which I faced some days back. One thing to keep in mind that this problem is specifically related to Spark version 2.0.0. Other than this version, this problem does not occur. Problem : Spark code was reading CSV file. This particular CSV file had one timestamp column that might Continue Reading

apache spark

Getting Started with the Apache Spark

Reading Time: 2 minutes Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. It was originally developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010 as an Apache project. Spark has several advantages compared to other big data and Map Reduce technologies like Hadoop and Storm. Apache Spark is an improvement on the original Hadoop MapReduce Continue Reading

Introduction To HADOOP !

Reading Time: 4 minutes Here I am to going to  write a blog on Hadoop! “Bigdata is not about data! The value in Bigdata [is in] the analytics. ” -Harvard Prof. Gary King So the Hadoop came into Introduction! Hadoop is an open source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment. It is part of the Apache Continue Reading

Apache Spark : Spark Union adds up the partition of input RDDs

Reading Time: 2 minutes Some days back when I was doing union of 2 pair rdds, I found the strange behavior for the number of partitions. The output RDD got different number of partition than input Rdd. For ex: suppose rdd1 and rdd2, each have 2 no of partitions and after union of these rdds I was expecting same no of partitions for output RDD, but the output RDD got the Continue Reading

fetching data from different sources using Spark 2.1

Reading data from different sources using Spark 2.1

Reading Time: 2 minutes Hi all, In this blog, we’ll be discussing on fetching data from different sources using Spark 2.1 like csv, json, text and parquet files. So first of all let’s discuss what’s new in Spark 2.1. In previous versions of Spark, you had to create a SparkConf and SparkContext to interact with Spark whereas in Spark 2.1 the same effects can be achieved through SparkSession, without explicitly Continue Reading

Spark Cassandra Connector On Spark-Shell

Reading Time: 2 minutes Using Spark-Cassandra-Connector on Spark Shell Hi All , In this blog we will see how we can execute our spark code on spark shell using Cassandra . This is very efficient at testing or learning time , where we have to execute our code on spark shell rather than doing on any IDE . Here we will use spark version –  1.6.2  you can download Continue Reading

Application compatibility for different Spark versions

Reading Time: 3 minutes Recently spark version 2.1 was released and there is a significant difference between the 2 versions. Spark 1.6 has DataFrame and SparkContext while 2.1 has Dataset and SparkSession. Now the question arises how to write code so that both the versions of spark are supported. Fortunately maven provides the feature of building your application with different profiles. In this blog i will tell you guys how to Continue Reading

Twitter’s tweets analysis using Lambda Architecture

Reading Time: 3 minutes Hello Folks, In this blog i will explain  twitter’s tweets analysis with lambda architecture. So first we need to understand  what is lambda architecture,about its component and usage. According to Wikipedia, Lambda architecture is a data processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream processing methods. Now let us see  lambda architecture components and its detail.