Category Archives: apache spark

Having Issue How To Order Streamed Dataframe ?


A few days ago, i have to perform aggregation on streaming dataframe. And the moment, i apply groupBy for aggregation, data gets shuffled. Now the situation arises how to maintain order? Yes, i can use orderBy with streaming dataframe using … Continue reading

Posted in Apache Kafka, apache spark, big data, Scala, Spark, Streaming | Tagged , , , , , , , , , , | Leave a comment

Difference between RDD , DF and DS in Spark


In this blog I try to cover the difference between RDD, DF and DS. much of you have a little bit confused about RDD, DF and DS. so don’t worry after this blog everything will be clear. With Spark2.0 release, … Continue reading

Posted in apache spark, Scala, Spark | Tagged , , , , , | 2 Comments

Integrating Kafka With Spark Structure Streaming


Kafka is a messaging broker system which facilitates the passing of messages between producer and consumer whereas Spark Structure streaming consumes static and streaming data from various sources like kafka, flume, twitter or any other socket which can be processed … Continue reading

Posted in Apache Kafka, apache spark, Scala, Streaming | Tagged , , | 1 Comment

Exploring Spark Structured Streaming


Hello Spark Enthusiasts, Streaming apps are growing more complex. And it is getting difficult to do with current distributed streaming engines. Why streaming is hard ? Streaming computations don’t run in isolation. Data arriving out of time order is a … Continue reading

Posted in apache spark, Scala, Streaming | Tagged , | Leave a comment

Spark Streaming vs Kafka Stream


The demand for stream processing is increasing a lot these days. The reason is that often processing big volumes of data is not enough. Data has to be processed fast, so that a firm can react to changing business conditions … Continue reading

Posted in Apache Kafka, apache spark, big data, Scala, Streaming | Tagged , | 1 Comment

Streaming in Spark, Flink and Kafka


There is a lot of buzz going on between when to use use spark, when to use flink, and when to use Kafka. Both spark streaming and flink provides exactly once guarantee that every record will be processed exactly once … Continue reading

Posted in Apache Flink, Apache Kafka, apache spark, Streaming | Tagged , , , | Leave a comment

Apache Spark: Reading csv using custom timestamp format


In this blog, we are considering a situation where I wanted to read a CSV through spark, but the CSV contains some timestamp columns in it. Is this going to be a problem while inferring schema at the time of reading the csv using spark? Well, … Continue reading

Posted in apache spark, big data, Functional Programming, Scala | 1 Comment

Apache Spark: 3 Reasons Why You Should Not Use RDDs


Apache Spark, whenever we hear these two words, the first thing that comes to our mind is RDDs, i.e., Resilient Distributed Datasets. Now, it has been more than 5 years since Apache Spark came into existence and after its arrival a lot … Continue reading

Posted in apache spark, big data, Scala, Spark | Tagged | 1 Comment

Dealing With Deltas In Amazon Redshift


Hi, In this blog I would like to discuss a scenario of Deltas implementation in Amazon Redshift using spark-redshift. Prior to that I would like to make you aware of Amazon Redshift, spark-redshift library and integration of Spark with Redshift. … Continue reading

Posted in Amazon, apache spark, AWS, AWS Services, database, Scala, Spark | Tagged , , | Leave a comment

Apache Spark : Handle null timestamp while reading csv in Spark 2.0.0


Hello folks, Hope you all are doing good !!! In this blog, I will discuss a problem which I faced some days back. One thing to keep in mind that this problem is specifically related to Spark version 2.0.0. Other … Continue reading

Posted in apache spark, big data, Scala, Spark | Leave a comment