Hey guys, welcome to series of spark blogs, this blog being the first blog in this series we would try to keep things as crisp as possible, so let’s get started.
So I recently get to start learning spark about believe me and now it has made me inquisitive about it, for a brief introduction of spark, I would say that it is a pretty efficient, blazingly fast processing framework which could also process on data that is being ingested in real time and since we are living in an age where billions and trillions unit of data is generated every day, may sound unconvincing but by the time you finish reading this blog we would have 2,20,000 TB of more data give that I think components like spark are the need of the hour, don’t really worry we don’t have to handle all of that data but a part, the good thing about about spark that it can easily scale, runs on commodity hardware , more on this here so since this blog is about Actions and Transformations in Spark let’s heads towards them now.
Before starting on actions and transformations let’s look have a glance on the data structure on which this operations are applied – RDD, Resilient Distributed Datasets are the basic building block for the spark programming, programs could be made fault tolerant using RDDs, also it can be operated upon in parallel which facilitates spark to us combined processing power of multiple node which are distributed across the cluster. Now that’s enough information about RDDs for this blog but I suggest you to read more here about it here.
So wherever we create an RDD either by parallelizing a collection of loading a data set from a file storage such as hadoop, hdfs even local file system there are certain types of operations that we can perform on those RDDs, which are categories into two broad categories Transformations and Actions.
Transformations are such type of operations which are when applied on an RDD it returns a new transformed RDD, the point which is more crucial to note here is transformations on RDDs are evaluated lazily which means that even though we have got a new transformed RDD, that data that is distributed across the nodes is not yet touched, you could also chain multiple RDD’s transformation since after every transformation you would get a new transformed RDD, now lets take a look at some of the most common transformation that are used,
This transformer take a function with the signature
U => V where
U denotes the type of elements that the target RDD contains, being element by element transformer it applies this function to each element
This transformer takes a function with signature
U => Boolean and it also an element by element transformer and the elements that for which we get true would exist in resulting RDD
This transformer is also an element by element transformer which handles the flattening of the resulting RDD where the transformation operation returns multiple elements for each element in the target RDD hence the signature is
U => TraversableOnce[V]
scala> sc.textFile("file:///home/freaks/Desktop/demo.txt") res7: org.apache.spark.rdd.RDD[String] = file:///home/freaks/Desktop/demo.txt MapPartitionsRDD at textFile at <console>:25 scala> res7.map(_.toUpperCase) res8: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD at map at <console>:26 scala> res8.collect res9: Array[String] = Array("HEY THIS IS SPARK ", "HERE IS A BLOG ON SPARK ", TRANSFORMATIONS, TRANSFORMATIONS IN SPARK ARE PRETTY HANDY, BY PRETTY HANDY WE MEAN THEY REALLY ARE) scala> res7.collect res10: Array[String] = Array("hey this is spark ", "here is a blog on spark ", transformations, transformations in spark are pretty handy, by pretty handy we mean they really are) scala> res7.map(_.toUpperCase).filter(_.contains("SPARK")) res11: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD at filter at <console>:26 scala> res11.collect res12: Array[String] = Array("HEY THIS IS SPARK ", "HERE IS A BLOG ON SPARK ", TRANSFORMATIONS IN SPARK ARE PRETTY HANDY) scala> res3 flatMap(_.split(" ")) res16: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD at flatMap at <console>:26 scala> res16.collect res18: Array[String] = Array(hey, this, is, spark, here, is, a, blog, on, spark, transformations, transformations, in, spark, are, pretty, handy, by, pretty, handy, we, mean, they, really, are)
There are some more types of important transformations those we would be discussing in our next blog, and the complete list of transformations is available here. Now lets move on to Actions
Action are a methods to access the actual data available in an RDD, the result of an action can be taken into the programmatic flow for the resulting data set is large enough to fit in the memory else we also have methods to write it in to various format in the file system at hand, wherever an action is called all the transformation occurs and the data from various nodes is pulled after the transformations, we may only use actions while we deal with a small dataset almost never in production applications, now let’s take a look at some of them
This is the most commonly used action it collects all the elements of the RDD in to a single Array, the above code shows that
This action aggregate the elements of the RDD, the signature for the function is
(U, U) => U
scala> res9.collect res16: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18, 20) scala> res9.reduce((x, y) => if(x > y) x else y) res17: Int = 20
count(), first(), take(n)
Similarly we have
take action that counts the number of elements, takes first element and take upto n elements from the RDD respectively, similarly we have the methods saveAsTextFile, saveAsSequenceFile, saveAsObjectFile to save them to filesystem
Sometimes we don’t want to return anything but wants for do an operation on each element of RDD so we can use foreach signature for that is
U => Unit().
A complete list to all the Actions is available here, so in our next blog we would look some more transformation along with some concepts like shuffling and repartitioning till then happy coding 🙂