partitioning in apache spark

Apache Spark: Repartitioning v/s Coalesce

Reading Time: 3 minutes Does partitioning help you increase/decrease the Job Performance? Spark splits data into partitions and computation is done in parallel for each partition. It is very important to understand how data is partitioned and when you need to manually modify the partitioning to run spark applications efficiently. Now, diving into our main topic i.e Repartitioning v/s Coalesce What is Coalesce? The coalesce method reduces the number Continue Reading

CuriosityX: RDDs – The backbone of Apache Spark

Reading Time: 5 minutes In our last blog, we tried to understand about using the spark streaming to transform and transport data between Kafka topics. After reading that many of the readers asked us to give a brief description of RDDs in Spark which we used. So, this blog is totally dedicated to the RDDs in Spark. So let’s start with the very basic question that comes to our mind Continue Reading

Controlling RDD Partitions in Apache Spark

Reading Time: 2 minutes In this blog, we will discuss What is RDD partitioning, why Partitioning is important and how to create and use spark Partitioners to minimize the shuffle operations across the nodes in a distributed Spark application. What is Partitioning? Partitioning is a transformation operation which is available on all key value pair RDDs  in Apache Spark. It is required when we try to group values on the basis Continue Reading