In this blog, We will learn how do we create the Databricks Deployment pipelines to deploy databricks components(Notebooks, Libraries, Config files and packages) via a Jenkins.
Note: We will use databricks CLI for the deployment that means one of the jenkins node must have the Databricks CLI installed.
We will use Databricks CLI to create Databricks deployments pipelines. We can easily import and export the notebook directory to or from the Databrick s workspace using the Databricks CLI and we can also copy the libraries to the DBFS and install it to the cluster using Databricks CLI.
- Databricks CLI
- Authentication Token to let CLI Work
- Nexus Plugin
- JQ (JSON Query)
- Git Credentials to clone the repo.
- Databricks CLI Credentials
443 Port must be open for the Databricks rest API calls on Jenkins node.
Databricks components that we need to deploy
- Config Files and Packages
Databricks Continuous Delivery Approach For Notebooks:
Consider a scenario in which there are two environment clusters, DEV and PROD with two users Alice and Bob who are working on the current project.
Let’s say Alice is working on the notebooks that are run through Databricks Job Scheduler. After developing code in her workspace DEV, Alice may export her code with Databricks workspace export_dir to her git repository and initiate a pull request.
Bob can then review and approve the PR, after which Alice can merge her changes to the master. This merge will trigger a Continuous Delivery job in which the production cluster will initiate a Databricks workspace import_dir, bringing all new changes in the notebooks into production.
Export Notebook directory from databricks to your local
databricks workspace export_dir <Databricks-Workspace-Path> <Local-Path> --overwrite
Import Notebook directory into databricks workspace
databricks workspace import_dir <Local-Path> <Databricks-Workspace-Path> --overwrite
Databricks Continuous Delivery Approach For Libraries:
Consider a scenario for a project demo in which a large number of developers are working on the demo project, let’s say Alice is currently working on the demo, she creates a new branch from develop branch and adds new functionality in her own branch and then she test that functionality on the Databricks dev environment.
After Testing is completed everything is working fine then Alice raised a PR for the develop branch, Reviewer review that PR and merge it on develop, Whenever new code comes on the develop branch, the PROD environment Jenkins pipeline will trigger automatically and it generates the jar from the develop branch, deploy to the databricks and install that library into the cluster.
dbfs cp <LIBRARY_PATH> <DATABRICKS-FILESYSTEM-PATH> --overwrite
databricks libraries install --cluster-id <CLUSTERID> --jar <LIBRARY_PATH>
- Remove Human intervention which makes mistake.
- Fully Automation Deployment pipeline.
- Remove the library inconsistency problem.
- No one touches the prod environments
Both notebook deployment pipelines and library deployment pipelines are isolated, Whenever new code is pushed to the master branch on any repository i.e notebook or library, it will trigger the deployment pipeline.
Using the Databricks CLI we can easily create Jenkins deployment pipelines. Databrikcs CLI must have permission to the Databricks workspace, DBFS, and Notebooks.
Thank you for sticking to the end. If you like this blog, please do show your appreciation by giving thumbs ups and share this blog and give me suggestions on how I can improve my future posts to suit your needs. Follow me to get updates on different technologies