RDD

CuriosityX: RDDs – The backbone of Apache Spark

In our last blog, we tried to understand about using the spark streaming to transform and transport data between Kafka topics. After reading that many of the readers asked us to give a brief description of RDDs in Spark which we used. So, this blog is totally dedicated to the RDDs in Spark. So let’s start with the very basic question that comes to our mind Continue Reading

Difference between RDD , DF and DS in Spark

In this blog I try to cover the difference between RDD, DF and DS. much of you have a little bit confused about RDD, DF and DS. so don’t worry after this blog everything will be clear. With Spark2.0 release, there are 3 types of data abstractions which Spark officially provides now to use: RDD, DataFrame and DataSet. so let’s start some discussion about it. Continue Reading

Apache Spark: 3 Reasons Why You Should Not Use RDDs

Apache Spark, whenever we hear these two words, the first thing that comes to our mind is RDD , i.e., Resilient Distributed Datasets. Now, it has been more than 5 years since Apache Spark came into existence and after its arrival a lot of things got changed in big data industry. But, the major change was dethroning of Hadoop MapReduce. I mean Spark literally replaced MapReduce and this Continue Reading

The Dominant APIs of Spark: Datasets, DataFrames and RDDs

While working with Spark often we come across the three APIs: DataFrames, Datasets and RDDs.  In this blog I will discuss the three in terms of use case, performance and optimization.  It is essential to keep in mind that there is seamless transformation available between the three DataFrames, Datasets and RDDs. Implicitly the RDD forms the apex of both DataFrame and Datasets. The inception of Continue Reading

Congregating Spark files on S3

We all know that Apache Spark is a fast and general engine for large-scale data processing and it is because of its speed that Spark was able to become one of the most popular frameworks in the world of big data. Working with Spark is a pleasant experience as it has a simple API for Scala, Java, Python and R. But, some tasks, in Spark, are still tough rows Continue Reading

Shufflling and repartitioning of RDD’s in apache spark

To write the optimize spark application you should carefully use transformation and actions, if you use wrong transformation and action will make your application  slow. So when you are writing application some points should be remember to make your application more optimize. 1. Number of partitions when creating RDD By default spark create one partition for each block of the file in HDFS it is Continue Reading

%d bloggers like this: