Table of contents

Reading Time: 4 minutes

Let us first start with the very first question, What is Zeppelin?

It is a web-based notebook that enables interactive data analytics. Based on the concept of an interpreter that can be bound to any language or data processing backend, Zeppelin is a web-based notebook server.

This notebook is where you can do your data analysis. It is a Web UI REPL with pluggable interpreters including Scala, Python, Angular, SparkSQL etc. You can make beautiful data-driven, interactive and collaborative documents with SQL, Scala and more.

In this blog, I will be discussing about using Zeppelin with Spark.

Before moving further, let’s go with the Installation and Setup first :

Pre-requisite :

Spark Setup
Hadoop Setup

1) You can download the binary release from http://zeppelin.apache.org/download.html

2) Extract it to a folder.

3) Go to the conf folder inside Zeppelin.

4) Make a copy of the zeppelin-site.xml template and rename it to zeppelin.site.xml. (Inside this you can configure your zeppelin.server.port, by default it runs on 8080)

5) Make a copy of zeppelin-env.sh.template and rename it to zeppelin-env.sh. Here you can set your JAVA_HOME, SPARK_HOME, and HADOOP_CONF_DIR.

export JAVA_HOME=/path/to/java
export SPARK_HOME=/path/to/spark
export HADOOP_CONF_DIR=/path/to/hadoop/conf/directory

Getting started :

1) Starting Zeppelin: After configuring zeppelin, go to zeppelin root folder and run the below command :

./bin/zeppelin-daemon.sh start

(By default this will start zeppelin at 8080 port in your browser, you can see it by typing: localhost:8080 in your web browser.)

2) Stopping Zeppelin: To stop zeppelin you can run the below command.

./bin/zeppelin-daemon.sh stop

Before diving into details, I’ll clarify some terms that are heavily used with Zeppelin.

1) Paragraph: It is a minimum unit to be executed.

2) Note: It is defined as the set of paragraphs, and also a member of the notebook.

Every note has its own interpreter’s mapping and stores it into interpreterFactory. Thus one instance has only one notebook which has many notes. You can see a notebook on the home page.

Apache Spark Integration

Supports scala, pyspark and spark sql
SparkContext injected automatically
Supports 3^rd party dependencies
Full spark Interpreter configuration.

Understanding Interpreters in zeppelin

Interpreter is a JVM process that communicates to Zeppelin daemon using thrift. Each Interpreter process can have Interpreter Groups, and each interpreter instance belongs to this Interpreter Group.

There are 3 interpreter modes available in Zeppelin.

1) Shared Mode

In Shared mode, a SparkContext and a Scala REPL is being shared among all interpreters in the group. So every Note will be sharing single SparkContext and single Scala REPL. In this mode, if NoteA defines variable ‘a’ then NoteB not only able to read variable ‘a’ but also able to override the variable.

2) Scoped Mode

In Scoped mode, each Note has its own Scala REPL. So variable defined in a Note can not be read or overridden in another Note. However, still single SparkContext serves all the Interpreter Groups. And all the jobs are submitted to this SparkContext and fair scheduler schedules the job. This could be useful when user does not want to share Scala session, but want to keep single Spark application and leverage its fair scheduler.

3) Isolated Mode

In Isolated mode, each Note has its own SparkContext and Scala REPL.

The default mode of %spark interpreter is ‘Globally Shared’.

Let us now take a closer look at using zeppelin with spark using an example:

1) Create a new note from zeppelin home page with “spark” as default interpreter.

2) Before you start with the example, you will need to download the sample csv.

3) Transform csv into RDD

val employeeText: RDD[String] = sc.textFile("hdfs://localhost:54310/path_to/Employees.csv")

4) To further analyze the logs, it is better to bind them to a schema and use Spark’s powerful SQL query capabilities. A powerful feature of Spark SQL is that you can programmatically bind a schema to a Data Source and map it into Scala case classes which can be navigated and queried in a typesafe manner.

case class Employee(name: String, jobTitle : String, department: String, annualSalary: Double, annualSalaryMinusFurloughs: Double)

import org.apache.spark.rdd.RDD

val employeeRDD: RDD[Employee] = employeeText.map (employeeRecord =>
  employeeRecord.split(",")).map{ employee =>
  Employee (employee(0).replace("\"", "") + employee(1).replace("\"",""), employee(2), employee(3), employee(4).replace("$","").toDouble, employee(5).replace("$","").toDouble)
}

employeeRDD.toDF().write.saveAsTable("employee6")

5) Query on the table to get the result: The queries should be prefixed with %sql to invoke sql interpreter.

%sql select * from employee6
%sql select department, avg(annualSalary) from employee6 group by department

This will display you results that can be visualized in the form of the bar chart, pie chart, area chart, Line chart or a scatter chart.