How Does Spark Use MapReduce?

In this talk we will talk about a interesting scenario did spark use mapreduce or not?answer to the question is yes,it use mapreduce but only the idea not the exact implementation lets talk about a example to read a text file from spark what we all do is

spark.sparkContext.textFile("fileName")

but do you know how does it actually works try to control click on this method text file you will find this code

/**
 * Read a text file from HDFS, a local file system (available on all nodes), or any
 * Hadoop-supported file system URI, and return it as an RDD of Strings.
 */
def textFile(
    path: String,
    minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
  assertNotStopped()
  hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
    minPartitions).map(pair => pair._2.toString).setName(path)
}

as you can see it is calling the hadoopfile method of hadoop api with four parameters one is file path,other one is the input format,longwritable and the last one is text input format

so doesnt matter that you are reading a text file from local,s3,hdfs it will always use the hadoop api to read it

hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString).setName(path)

can you understand what this code is doing first parameter is path of file ,second parameter is input format which should be used to read this file,third and fourth parameters are similar to recordreader which isĀ  the offset of line and line itself

now you might be thinking that why we do not get this offset back when we are reading from a file reason for this is below

.map(pair => pair._2.toString)

so what is doing is that is mapping over all the key value pair but only collecting the values


knoldus-advt-sticker


 

1 thought on “How Does Spark Use MapReduce?

Leave a Reply

%d bloggers like this: