The Databricks Spark exam has undergone a number of recent changes. Whereas before it consisted of both multiple choice (MC) and coding challenges (CC), it is now entirely MC based. I am writing this blog because all of the prep material available at the time I took the exam (May 2020) was for the previous version of the exam.
You have 120 minutes to answer 60 questions and must get at least 42/60 correct to pass. Click to see additional exam details and FAQ. You’re allowed to use the official spark documentation during the exam, but it is only accessible through a small window embedded in the exam tab, so you shouldn’t be too reliant on it. I was constantly having to resize and scroll the documentation window horizontally. Moreover, this format prevents you from using ctrl+F to search, so you really need to know where in the docs particular functions live if you want to make the most of it. This brings me to my first point:
Tip 1: Get Comfortable Using the Docs
The docs are the only external tool you can use during the exam, so it would be wise to know the location of common functions. The best advice I can give to help you gain familiarity with the docs is to clone the spark github repo to your local machine and use a grep -r search in the base of the cloned directory to find functions of interest (see below for details). The location in which the functions in the directory live will have roughly the same path as the corresponding function’s documentation.
For example, let’s say that you want to know how to find the documentation for withColumn, but don’t know where to look. First, you might try searching for withColumn in the search field for the docs:
Unfortunately, this doesn’t provide any results. Your next instinct might be just to google something like ‘spark withColumn docs’:
But again, none of the top links are helpful. You can always try refining your search, but I’ve found it more helpful to simply rely on grep. Here are the results of running
grep -r 'def withColumn'
from a terminal inside the base of the repo that I just cloned (the path is ~/dev/software/spark, since I ran git clone in my software repo):
Not only does this give you the path in which withColumn’s docs live (org.apache.spark.sql.Dataset, since you can ignore the sql/core/src/main/scala part of the path), it actually shows the function signature. In my case, knowing the signature is often the only reason I would have been going to the docs in the first place, so I might even stop here. If you want to dive deeper, however, you can simply head over to the docs and search for the path we just found:
Or you can go straight into the source code and read the comments there. In either case, getting in the habit of using grep – r ‘def methodName’ will help you gain familiarity with the docs and accelerate your learning speed.
Tip 2: Read the Definitive Guide
Go through the first 19 chapters of “Spark the Definitive Guide Big Data Processing Made Simple” by Bill Chambers and Matei Zaharia. Really follow along with all of the examples. Run spark locally and try out the code yourself. I took the exam in Scala, but the authors provide Scala, Python, and SQL code where relevant. This book was my main source of prep for the exam.
Tip 3: Practice Using DataFrames
As seen in the FAQs, the vast majority of the test is focused on DataFrame API Applications:
- Spark Architecture: Conceptual understanding (~17%)
- Spark Architecture: Applied understanding (~11%)
- Spark DataFrame API Applications (~72%)
Since you only need a 70% to pass, it is clear that focusing on the portion of the exam which comprises over 70% of the material would be a smart move. Get comfortable running spark locally and building DataFrames with sample data and just playing around with them. Find a dataset you’re interested in, download it, and try to put it into a DataFrame. Do data analysis of that data using the DataFrame API and see if you can make any cool observations about it. Once you have some interesting results, write them out to csv or parquet files.
Treating this as an interesting data science project instead of a set of facts to memorize will make the entire learning process much more enjoyable. This will make it easier to study without losing interest. If you’re a more visual learner, you might also benefit from some of the classes available to learn spark.
Tip 4: Take Some Online Classes
I took several online spark classes in preparation for the exam. Below is the only class I would recommend:
I also did the official Structured Streaming course offered by Databricks, but it was not relevant for passing the exam. It was a great course for learning about Structured Streaming, so you might still want to check it out.