Knoldus has organized a session on 08th February 2019. The topic was “Understanding Spark Structured Streaming”. Many people attended and enjoyed the session. In this blog post, I am going to share the slides & video of the session. Slides: Video: If you have any query, then please feel free to comment below.
In our last blog, we tried to understand about using the spark streaming to transform and transport data between Kafka topics. After reading that many of the readers asked us to give a brief description of RDDs in Spark which we used. So, this blog is totally dedicated to the RDDs in Spark. So let’s start with the very basic question that comes to our mind Continue Reading
In this blog I try to cover the difference between RDD, DF and DS. much of you have a little bit confused about RDD, DF and DS. so don’t worry after this blog everything will be clear. With Spark2.0 release, there are 3 types of data abstractions which Spark officially provides now to use: RDD, DataFrame and DataSet. so let’s start some discussion about it. Continue Reading
Apache Spark, whenever we hear these two words, the first thing that comes to our mind is RDD , i.e., Resilient Distributed Datasets. Now, it has been more than 5 years since Apache Spark came into existence and after its arrival a lot of things got changed in big data industry. But, the major change was dethroning of Hadoop MapReduce. I mean Spark literally replaced MapReduce and this Continue Reading
We all know that Apache Spark is a fast and general engine for large-scale data processing and it is because of its speed that Spark was able to become one of the most popular frameworks in the world of big data. Working with Spark is a pleasant experience as it has a simple API for Scala, Java, Python and R. But, some tasks, in Spark, are still tough rows Continue Reading
To write the optimize spark application you should carefully use transformation and actions, if you use wrong transformation and action will make your application slow. So when you are writing application some points should be remember to make your application more optimize. 1. Number of partitions when creating RDD By default spark create one partition for each block of the file in HDFS it is Continue Reading