4 Tips to Become a Databricks Certified Associate Developer for Apache Spark: June 2020

Reading Time: 4 minutes

The Databricks Spark exam has undergone a number of recent changes. Whereas before it consisted of both multiple choice (MC) and coding challenges (CC), it is now entirely MC based. I am writing this blog because all of the prep material available at the time I took the exam (May 2020) was for the previous version of the exam.

Exam Format

You have 120 minutes to answer 60 questions and must get at least 42/60 correct to pass. Click to see additional exam details and FAQ. You’re allowed to use the official spark documentation during the exam, but it is only accessible through a small window embedded in the exam tab, so you shouldn’t be too reliant on it. I was constantly having to resize and scroll the documentation window horizontally. Moreover, this format prevents you from using ctrl+F to search, so you really need to know where in the docs particular functions live if you want to make the most of it. This brings me to my first point:

Tip 1: Get Comfortable Using the Docs

The docs are the only external tool you can use during the exam, so it would be wise to know the location of common functions. The best advice I can give to help you gain familiarity with the docs is to clone the spark github repo to your local machine and use a grep -r search in the base of the cloned directory to find functions of interest (see below for details). The location in which the functions in the directory live will have roughly the same path as the corresponding function’s documentation.

For example, let’s say that you want to know how to find the documentation for withColumn, but don’t know where to look. First, you might try searching for withColumn in the search field for the docs:

Unfortunately, this doesn’t provide any results. Your next instinct might be just to google something like ‘spark withColumn docs’:

But again, none of the top links are helpful. You can always try refining your search, but I’ve found it more helpful to simply rely on grep. Here are the results of running

grep -r 'def withColumn'

from a terminal inside the base of the repo that I just cloned (the path is ~/dev/software/spark, since I ran git clone in my software repo):

Not only does this give you the path in which withColumn’s docs live (org.apache.spark.sql.Dataset, since you can ignore the sql/core/src/main/scala part of the path), it actually shows the function signature. In my case, knowing the signature is often the only reason I would have been going to the docs in the first place, so I might even stop here. If you want to dive deeper, however, you can simply head over to the docs and search for the path we just found:

Or you can go straight into the source code and read the comments there. In either case, getting in the habit of using grep – r ‘def methodName’ will help you gain familiarity with the docs and accelerate your learning speed.

Tip 2: Read the Definitive Guide

Go through the first 19 chapters of “Spark the Definitive Guide Big Data Processing Made Simple” by Bill Chambers and Matei Zaharia. Really follow along with all of the examples. Run spark locally and try out the code yourself. I took the exam in Scala, but the authors provide Scala, Python, and SQL code where relevant. This book was my main source of prep for the exam.

Tip 3: Practice Using DataFrames

As seen in the FAQs, the vast majority of the test is focused on DataFrame API Applications:

  • Spark Architecture: Conceptual understanding (~17%)
  • Spark Architecture: Applied understanding (~11%)
  • Spark DataFrame API Applications (~72%)

Since you only need a 70% to pass, it is clear that focusing on the portion of the exam which comprises over 70% of the material would be a smart move. Get comfortable running spark locally and building DataFrames with sample data and just playing around with them. Find a dataset you’re interested in, download it, and try to put it into a DataFrame. Do data analysis of that data using the DataFrame API and see if you can make any cool observations about it. Once you have some interesting results, write them out to csv or parquet files.

Treating this as an interesting data science project instead of a set of facts to memorize will make the entire learning process much more enjoyable. This will make it easier to study without losing interest. If you’re a more visual learner, you might also benefit from some of the classes available to learn spark.

Tip 4: Take Some Online Classes

I took several online spark classes in preparation for the exam. Below is the only class I would recommend:

I also did the official Structured Streaming course offered by Databricks, but it was not relevant for passing the exam. It was a great course for learning about Structured Streaming, so you might still want to check it out.

6 thoughts on “4 Tips to Become a Databricks Certified Associate Developer for Apache Spark: June 20205 min read

  1. Hi Peter Boyajian,

    Firstly, congrats on clearing the new version Databricks Spark certification.

    Am also interested to take this exam, so just wanted to get couple of doubts cleared here.
    1. Is preparing from the Spark Definitive Guide, as you mentioned above in your blog, suffice to clear the certification? Do Datasets and RDDs play any major role in the exam apart from Spark architecture and DF?
    2. Can you share you experience on the level of difficulty, am assuming here that this is not a adaptive test where difficulty changes based on our success/fail attempts.How difficult were the questions and what type of questions did you find difficult or tricky.

    Best Wishes.

    1. Hi Aravind,
      Thank you, I will try to address your questions in order.

      1. a) Yes, I believe that mastery of the material in the Spark Definitive Guide would be sufficient to clear the exam. With that said, however, it can be difficult to master the material without plenty of practice. Accordingly, I would recommend that you not only go through the examples in the book, but also attempt to go through an online spark course or two if you have the time.
      b) No, to my recollection, Datasets and RDDs did not play any other major role on the exam.

      2. The difficulty of the exam is largely a subjective metric which depends on your past experience. In my case, I had taken a couple spark related online courses (those mentioned in the blog) and then spent a few weeks going through the first 19 chapters of the Definitive Guide and I managed to pass on my first attempt (just barely). During the exam, I was not sure whether or not I would pass, so it was quite difficult from that perspective. If you have a stronger background in Spark, then it should be nothing to worry about. If your background is similar to mine then it might be worth spending a bit of extra time practicing until you really feel comfortable with the material. There were not many difficult reasoning questions, it was more a matter of ‘do you know these facts?’ rather than ‘can you figure this out?’. I remember being a little hung up on questions regarding whether a particular function should be categorized as a wide vs narrow transformation, for example.

      Let me know if you have any other questions!

  2. Hi Peter,

    Thanks for your blog. I am preapring for certification. Please help on following questions.

    1. Exam questions are Mcq’s, so is there any prgramming questions for spark?

    2. Any practice test, you can recomend, so we can refer before exam and get the understanding
    which topics need to improve.

    1. Hi Pankaj,

      1. There were no programming questions, but there were some questions where you had to correctly identify which code would fill in the blanks.
      2. I was unable to find any helpful practice tests for the exam.

      Good luck on the exam!

  3. Hi Peter,

    Thank you very much for providing your feedback and congratulation on passing the exam on first attempt.
    I am scheduled to give the exam on August 2nd and have some experience in the Spark. I just want to clarify the below point from you.

    Were there any questions on the Streaming, Machine learning and Graph.
    I have been reading other post where they are saying expect 20% for above three topics. As the exam format has evolved a lot and I could not find the breakdown so I just wanted to run by you.

    1. Hi Ankurkumar,

      There were not any questions on Streaming, Machine Learning, or Graph when I took the exam.

      Good luck with your studies and let me know if you have any other questions.

Comments are closed.