Apache Spark

Exploring Spark Structured Streaming

Reading Time: 6 minutes Hello Spark Enthusiasts, Streaming apps are growing more complex. And it is getting difficult to do with current distributed streaming engines. Why streaming is hard ? Streaming computations don’t run in isolation. Data arriving out of time order is a problem for batch-processed processing. Writing stream processing operations from scratch is not easy. Problem with DStreams: Processing with event-time: dealing with late data. Interoperate streaming Continue Reading

Spark Streaming vs Kafka Stream

Reading Time: 4 minutes The demand for stream processing is increasing a lot these days. The reason is that often processing big volumes of data is not enough. Data has to be processed fast, so that a firm can react to changing business conditions in real time. Stream processing is the real-time processing of data continuously and concurrently. Streaming processing” is the ideal platform to process data streams or Continue Reading

Streaming in Spark, Flink and Kafka

Reading Time: 7 minutes There is a lot of buzz going on between when to use use spark, when to use flink, and when to use Kafka. Both spark streaming and flink provides exactly once guarantee that every record will be processed exactly once thereby eliminating any duplicates that might be available. Both provide very high throughput compared to any other processing system like storm, and the overhead of Continue Reading

Apache Spark: Reading csv using custom timestamp format

Reading Time: 3 minutes In this blog, we are considering a situation where I wanted to read a CSV through spark, but the CSV contains some timestamp columns in it. Is this going to be a problem while inferring schema at the time of reading csv using spark? Well, the answer may be No, if the csv have the timestamp field in the specific yyyy-MM-dd hh:mm:ss format. In this particular case, the spark csv reader can Continue Reading

Apache Spark: 3 Reasons Why You Should Not Use RDDs

Reading Time: 4 minutes Apache Spark, whenever we hear these two words, the first thing that comes to our mind is RDD , i.e., Resilient Distributed Datasets. Now, it has been more than 5 years since Apache Spark came into existence and after its arrival a lot of things got changed in big data industry. But, the major change was dethroning of Hadoop MapReduce. I mean Spark literally replaced MapReduce and this Continue Reading

Dealing With Deltas In Amazon Redshift

Reading Time: 5 minutes Hi, In this blog I would like to discuss a scenario of implementation of Deltas in Amazon Redshift using spark-redshift. Prior to that I would like to make you aware of Amazon Redshift, spark-redshift library and integration of Spark with Redshift. It is assumed that you have a fair knowledge of programming in Apache Spark and Spark SQL. You may refer to the documentation links Continue Reading

Apache Spark : Handle null timestamp while reading csv in Spark 2.0.0

Reading Time: 2 minutes Hello folks, Hope you all are doing good !!! In this blog, I will discuss a problem which I faced some days back. One thing to keep in mind that this problem is specifically related to Spark version 2.0.0. Other than this version, this problem does not occur. Problem : Spark code was reading CSV file. This particular CSV file had one timestamp column that might Continue Reading

apache spark

Getting Started with the Apache Spark

Reading Time: 2 minutes Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. It was originally developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010 as an Apache project. Spark has several advantages compared to other big data and Map Reduce technologies like Hadoop and Storm. Apache Spark is an improvement on the original Hadoop MapReduce Continue Reading

Introduction To HADOOP !

Reading Time: 4 minutes Here I am to going to  write a blog on Hadoop! “Bigdata is not about data! The value in Bigdata [is in] the analytics. ” -Harvard Prof. Gary King So the Hadoop came into Introduction! Hadoop is an open source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment. It is part of the Apache Continue Reading

Apache Spark : Spark Union adds up the partition of input RDDs

Reading Time: 2 minutes Some days back when I was doing union of 2 pair rdds, I found the strange behavior for the number of partitions. The output RDD got different number of partition than input Rdd. For ex: suppose rdd1 and rdd2, each have 2 no of partitions and after union of these rdds I was expecting same no of partitions for output RDD, but the output RDD got the Continue Reading

fetching data from different sources using Spark 2.1

Reading data from different sources using Spark 2.1

Reading Time: 2 minutes Hi all, In this blog, we’ll be discussing on fetching data from different sources using Spark 2.1 like csv, json, text and parquet files. So first of all let’s discuss what’s new in Spark 2.1. In previous versions of Spark, you had to create a SparkConf and SparkContext to interact with Spark whereas in Spark 2.1 the same effects can be achieved through SparkSession, without explicitly Continue Reading

Spark Cassandra Connector On Spark-Shell

Reading Time: 2 minutes Using Spark-Cassandra-Connector on Spark Shell Hi All , In this blog we will see how we can execute our spark code on spark shell using Cassandra . This is very efficient at testing or learning time , where we have to execute our code on spark shell rather than doing on any IDE . Here we will use spark version –  1.6.2  you can download Continue Reading

Introduction to Structured Streaming

Reading Time: < 1 minute Hello!!  Knoldus had organized half an hour session on Structured Streaming briefing about the API changes, how it is different from the early Stream Computation paradigm (DStreams) and example API demonstration. Hope you will enjoy. Below are the slides and Video from the session. Slide: Video: