Knolx – A Step to Programming with Apache Spark

Table of contents
Reading Time: 3 minutes

Hello associate! Hope you are doing well . Today I am going to share some of my programming experience with Apache Spark.
So if you are getting started with Apache Spark then this blog may helpfull for you.

Prerequisite to start with Apache Spark –

  • MVN / SBT
  • Scala

To start with Apache Spark at first you need to either

download pre-built Apache Spark  or,

download source code and build on your local machine.

Now, If you downloaded pre-built  spark then you only need to extract the tar file at the location where you have the permission to read and write.

Else you ned to extract the source code and run the following command at SPARK_HOME directory to build the spark –

  • Building with Maven and Scala 2.11
./dev/change-scala-version.sh 2.11 mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package
  • Building with SBT
build/sbt -Pyarn -Phadoop-2.3 assembly

Now to start spark

goto the SPARK_HOME/bin
 Execute ./spark-shell

You will get following prompt :

Screenshot from 2016-07-24 01-34-19

hence Apache spark provedes you following two object by default on spark-shell :

  1. sc : SparkContext
  2. spark : SparkSession

Screenshot from 2016-07-24 01-36-47

Although you can also create your own SparkContext (if creating project apart with Spark-Shell ) :

val conf = new SparkConf().setAppName("Demo").setMaster("local[2]")
 val sc = new SparkContext(conf)

Now you can load data with two type of Dataset :

  1. RDD
  2. Dataframe

Now You know that : 

  • A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case.
  • DataFrame has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query.
  • An RDD, on the other hand, is merely a Resilient Distributed Dataset that is more of a blackbox of data that cannot be optimized as the operations that can be performed against it are not as constrained.
  • However, you can go from a DataFrame to an RDD via its rdd method, and you can go from an RDD to a DataFrame (if the RDD is in a tabular format) via the toDF method

Creating an object of RDD and load data to RDD dataset

 val data = Array(1, 2, 3, 4, 5)
 val distData = sc.parallelize(data)
 distData: org.apache.spark.rdd.RDD[Int]

Either you can load data from a file

val distFile = sc.textFile("data.txt")
 distFile: RDD[String]

Here is a complete example of WordCount to understand RDD : 

val textFile = sc.textFile("words.txt")
 val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)
 counts.saveAsTextFile("count.txt")

Simillarly you can create DataFrame object : 

val sqlContext = new SQLContext(sc)
 val df = sqlContext.read.json("emp.json")

Now you can Querry with DataFrame Object .

Example of DataFrame :

val sqlContext = new SQLContext(sc)
 val df = sqlContext.read.json("emp.json")
 df.printSchema()
 df.show()
 df.select("firstName").show()
 df.select(df("firstName"), df("age") + 1).show()
 df.filter(df("age") > 25).show()
 df.groupBy("age").count().show()

println("\n\n\nUsing Collect Method")
 df.collect.toList.map(aRow=>println(aRow))

Here is Slide for the Same

Here is Youtube Video

Reference :

http://spark.apache.org/

Stay tuned for Spark with Hive

Thanks
KNOLDUS-advt-sticker

Written by 

Software Consultant At Knoldus

2 thoughts on “Knolx – A Step to Programming with Apache Spark3 min read

Comments are closed.