Apache Spark

Deep Dive into Spark Cluster Managers

Reading Time: 5 minutes This blog aims to dig into the different Cluster Management modes in which you can run your spark application. Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program which is called the Driver Program. Specifically, to run on a cluster, the SparkContext can connect to several types of Cluster Managers, which allocate resources across Continue Reading

Having Issue How To Order Streamed Dataframe ?

Reading Time: 3 minutes A few days ago, i have to perform aggregation on streaming dataframe. And the moment, i apply groupBy for aggregation, data gets shuffled. Now the situation arises how to maintain order? Yes, i can use orderBy with streaming dataframe using Spark Structured Streaming, but only in complete mode. There is no way of doing ordering of streaming data in append mode and update mode. I Continue Reading

Difference between RDD , DF and DS in Spark

Reading Time: 3 minutes In this blog I try to cover the difference between RDD, DF and DS. much of you have a little bit confused about RDD, DF and DS. so don’t worry after this blog everything will be clear. With Spark2.0 release, there are 3 types of data abstractions which Spark officially provides now to use: RDD, DataFrame and DataSet. so let’s start some discussion about it. Continue Reading

kafka with spark

Integrating Kafka With Spark Structure Streaming

Reading Time: 2 minutes Kafka is a messaging broker system which facilitates the passing of messages between producer and consumer whereas Spark Structure streaming consumes static and streaming data from various sources like kafka, flume, twitter or any other socket which can be processed and analysed using high level algorithm for machine learning and finally pushed the result out to external storage system. The main advantage of structured streaming Continue Reading

Exploring Spark Structured Streaming

Reading Time: 6 minutes Hello Spark Enthusiasts, Streaming apps are growing more complex. And it is getting difficult to do with current distributed streaming engines. Why streaming is hard ? Streaming computations don’t run in isolation. Data arriving out of time order is a problem for batch-processed processing. Writing stream processing operations from scratch is not easy. Problem with DStreams: Processing with event-time: dealing with late data. Interoperate streaming Continue Reading

Spark Streaming vs Kafka Stream

Reading Time: 4 minutes The demand for stream processing is increasing a lot these days. The reason is that often processing big volumes of data is not enough. Data has to be processed fast, so that a firm can react to changing business conditions in real time. Stream processing is the real-time processing of data continuously and concurrently. Streaming processing” is the ideal platform to process data streams or Continue Reading

Streaming in Spark, Flink and Kafka

Reading Time: 7 minutes There is a lot of buzz going on between when to use use spark, when to use flink, and when to use Kafka. Both spark streaming and flink provides exactly once guarantee that every record will be processed exactly once thereby eliminating any duplicates that might be available. Both provide very high throughput compared to any other processing system like storm, and the overhead of Continue Reading

Apache Spark: Reading csv using custom timestamp format

Reading Time: 3 minutes In this blog, we are considering a situation where I wanted to read a CSV through spark, but the CSV contains some timestamp columns in it. Is this going to be a problem while inferring schema at the time of reading csv using spark? Well, the answer may be No, if the csv have the timestamp field in the specific yyyy-MM-dd hh:mm:ss format. In this particular case, the spark csv reader can Continue Reading

Apache Spark: 3 Reasons Why You Should Not Use RDDs

Reading Time: 4 minutes Apache Spark, whenever we hear these two words, the first thing that comes to our mind is RDD , i.e., Resilient Distributed Datasets. Now, it has been more than 5 years since Apache Spark came into existence and after its arrival a lot of things got changed in big data industry. But, the major change was dethroning of Hadoop MapReduce. I mean Spark literally replaced MapReduce and this Continue Reading

Dealing With Deltas In Amazon Redshift

Reading Time: 5 minutes Hi, In this blog I would like to discuss a scenario of implementation of Deltas in Amazon Redshift using spark-redshift. Prior to that I would like to make you aware of Amazon Redshift, spark-redshift library and integration of Spark with Redshift. It is assumed that you have a fair knowledge of programming in Apache Spark and Spark SQL. You may refer to the documentation links Continue Reading

Apache Spark : Handle null timestamp while reading csv in Spark 2.0.0

Reading Time: 2 minutes Hello folks, Hope you all are doing good !!! In this blog, I will discuss a problem which I faced some days back. One thing to keep in mind that this problem is specifically related to Spark version 2.0.0. Other than this version, this problem does not occur. Problem : Spark code was reading CSV file. This particular CSV file had one timestamp column that might Continue Reading

apache spark

Getting Started with the Apache Spark

Reading Time: 2 minutes Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. It was originally developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010 as an Apache project. Spark has several advantages compared to other big data and Map Reduce technologies like Hadoop and Storm. Apache Spark is an improvement on the original Hadoop MapReduce Continue Reading

Introduction To HADOOP !

Reading Time: 4 minutes Here I am to going to  write a blog on Hadoop! “Bigdata is not about data! The value in Bigdata [is in] the analytics. ” -Harvard Prof. Gary King So the Hadoop came into Introduction! Hadoop is an open source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment. It is part of the Apache Continue Reading