Hello associate! Hope you are doing well . Today I am going to share some of my programming experience with Apache Spark.
So if you are getting started with Apache Spark then this blog may helpfull for you.
Prerequisite to start with Apache Spark –
- MVN / SBT
To start with Apache Spark at first you need to either
Now, If you downloaded pre-built spark then you only need to extract the tar file at the location where you have the permission to read and write.
Else you ned to extract the source code and run the following command at SPARK_HOME directory to build the spark –
- Building with Maven and Scala 2.11
./dev/change-scala-version.sh 2.11 mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package
- Building with SBT
build/sbt -Pyarn -Phadoop-2.3 assembly
Now to start spark
goto the SPARK_HOME/bin Execute ./spark-shell
You will get following prompt :
hence Apache spark provedes you following two object by default on spark-shell :
- sc : SparkContext
- spark : SparkSession
Although you can also create your own SparkContext (if creating project apart with Spark-Shell ) :
val conf = new SparkConf().setAppName("Demo").setMaster("local") val sc = new SparkContext(conf)
Now you can load data with two type of Dataset :
Now You know that :
- A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case.
- DataFrame has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query.
- An RDD, on the other hand, is merely a Resilient Distributed Dataset that is more of a blackbox of data that cannot be optimized as the operations that can be performed against it are not as constrained.
- However, you can go from a DataFrame to an RDD via its rdd method, and you can go from an RDD to a DataFrame (if the RDD is in a tabular format) via the toDF method
Creating an object of RDD and load data to RDD dataset
val data = Array(1, 2, 3, 4, 5) val distData = sc.parallelize(data) distData: org.apache.spark.rdd.RDD[Int]
Either you can load data from a file
val distFile = sc.textFile("data.txt") distFile: RDD[String]
Here is a complete example of WordCount to understand RDD :
val textFile = sc.textFile("words.txt") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("count.txt")
Simillarly you can create DataFrame object :
val sqlContext = new SQLContext(sc) val df = sqlContext.read.json("emp.json")
Now you can Querry with DataFrame Object .
Example of DataFrame :
val sqlContext = new SQLContext(sc) val df = sqlContext.read.json("emp.json") df.printSchema() df.show() df.select("firstName").show() df.select(df("firstName"), df("age") + 1).show() df.filter(df("age") > 25).show() df.groupBy("age").count().show() println("\n\n\nUsing Collect Method") df.collect.toList.map(aRow=>println(aRow))
Here is Slide for the Same
Here is Youtube Video
Stay tuned for Spark with Hive