Databricks Job API

man wearing blue crew neck top
Reading Time: 4 minutes

The Databricks Jobs API follows the guiding principles of the REST (Representational State Transfer)  architecture. We can use either Databricks personal access token or password for the Authentication and access to Databricks REST API. The Databricks Jobs API 2.1 supports jobs with multiple tasks.

All Databricks Jobs API are mentioned below:

Creating a New Job

Users can send requests to the server to create a new task. The Databricks Jobs  API uses  the  HTTP POST request method, which consists of a request body schema as follows:

SchemaData TypeDescription
nameStringThe default name is – “Untitled”
It is an optional name for the job.
job_clustersArrayIt is a list of job cluster specifications. In a shared job cluster we can’t declare libraries. You can declare it in task settings.
tagsObjectThe default tag is – “{}”
A map of tags associated with the job. At max 25 tags can be added to the job.
tasksArrayConsist of a list of task specifications to be executed by the job.
email_notificationsObjectIn case of success or failure, it will notify the particular email.
timeout_secondsintegerThe default behavior is to have no timeout. But we can specify the timeout for each job.
scheduleObjectIt is for running a job at a user-defined time.
access_control_listArrayConsist of a List of permissions to set on the job.

The request from Databricks Jobs API results in any of the four responses –

  1. 200: Indicates that the job was successfully created.
  2. 400: Indicates that the request was malformed.
  3. 401: Indicates that the request was unauthorized.
  4. 500: Indicates that the request was not handled correctly due to a server error.
Example - 

URL - https://<databricks-instance>/api/2.1/jobs/create

{
        "name": "test_job",
        "email_notifications": {
            "no_alert_for_skipped_runs": false
        },
        "webhook_notifications": {},
        "timeout_seconds": 0,
        "max_concurrent_runs": 1,
        "tasks": [
            {
            	"task_key": "test_notebook",
                "notebook_task": {
                    "notebook_path": "/aws/test",
                    "source": "WORKSPACE"
                },
                "job_cluster_key": "test_cluster",
                "timeout_seconds": 36000,
                "email_notifications": {}
            }
        ],
        "job_clusters": [
            {
                "job_cluster_key": "test_cluster",
                "new_cluster": {
                    "cluster_name": "",
                    "spark_version": "11.3.x-scala2.12",
                    "spark_conf":
                    "spark.databricks.delta.formatCheck.enabled": "false"  
                    },
                    "aws_attributes": {
                        "first_on_demand": 6,
                        "availability": "SPOT_WITH_FALLBACK",
                        "zone_id": "auto",
                        "spot_bid_price_percent": 100,
                        "ebs_volume_type": "GENERAL_PURPOSE_SSD",
                        "ebs_volume_count": 1,
                        "ebs_volume_size": 100
                    },

                    "node_type_id": "r6g.8xlarge",
                    "custom_tags": {
                        "Function": "sparkcluster",
                        "CreatedBy": "autocluster",
                        "ManagingTeamEmail": "xyz@gmail.com,
                        "CodeMaturity": "dev"

                    },
                    "enable_elastic_disk": true,
                    "runtime_engine": "STANDARD",
                    "autoscale": {
                        "min_workers": 3,
                        "max_workers": 10
                    }
                }
            }
        ],
        "format": "MULTI_TASK"
    }

Listing All the Jobs

We have a GET HTTP method for listing all jobs. We will get the result in JSON format including details of each job.

Example 
URL - https://<databricks-instance>/api/2.1/jobs/list

{
  "jobs": [
    {
      "job_id": 12589654,
      "creator_user_name": "xyz@gmail.com",
      "settings": {
        "name": "test_job",
        "tags": {
          "cost-center": "engineering",
          "team": "jobs"
        },
        "tasks": [
          {
            "task_key": "test1",
            "description": "Running test1 job",
            "depends_on": [],
            "existing_cluster_id": "0923-164208-meows279",
            "spark_jar_task": {
            "main_class_name": "com.databricks.test",
          },
            "libraries": [
              {
                "jar": "dbfs:/mnt/test/databricks/jdbc.jar"
              }
            ],
            "timeout_seconds": 86400,
            "max_retries": 3,
            "min_retry_interval_millis": 2000,
            "retry_on_timeout": false
          },
          {
            "task_key": "test2",
            "description": "Running test2 job",
            "depends_on": [],
            "job_cluster_key": "auto_scaling_cluster",
            },
            "libraries": [
              {
                "jar": "dbfs:/mnt/databricks/jdbc.jar"
              }
            ],
            "timeout_seconds": 86400,
            "max_retries": 3,
            "min_retry_interval_millis": 2000,
            "retry_on_timeout": false
          },
          {
            "task_key": "test3",
            "description": "Matches orders with user sessions",
            "depends_on": [
              {
                "task_key": "test1"
              },
              {
                "task_key": "test2"
              }
            ],
            "new_cluster": {
            "spark_version": "7.3.x-scala2.12",
            "node_type_id": "i3.xlarge",
            "spark_conf": {
            "spark.speculation": true
              },
              "aws_attributes": {
                "availability": "SPOT",
                "zone_id": "us-west-2a"
              },
              "autoscale": {
                "min_workers": 2,
                "max_workers": 16
              }
            },
            "notebook_task": {
              "notebook_path": "/Users/test/test_notebook",
              "source": "WORKSPACE",
              "base_parameters": {
              "name": "John Doe",
              "age": "35"
              }
            },
            "timeout_seconds": 86400,
            "max_retries": 3,
            "min_retry_interval_millis": 2000,
            "retry_on_timeout": false
          }
        ],
        "job_clusters": [
          {
            "job_cluster_key": "auto_scaling_cluster",
            "new_cluster": {
            "spark_version": "7.3.x-scala2.12",
            "node_type_id": "i3.xlarge",
            "spark_conf": {
            "spark.speculation": true
              },
              "aws_attributes": {
                "availability": "SPOT",
                "zone_id": "us-west-2a"
              },
              "autoscale": {
                "min_workers": 2,
                "max_workers": 16
              }
            }
          }
        ],
       
        "timeout_seconds": 86400,
        "schedule": {
          "quartz_cron_expression": "20 30 * * * ?",
          "timezone_id": "Europe/London",
          "pause_status": "PAUSED"
        },
        "max_concurrent_runs": 10,
        "format": "MULTI_TASK"
      },
      "created_time": 1601370337343
    }
  ],
  "has_more": false
}

You can also get the details of a single job through its job_id. Just you have to pass the job_id in the parameter and you will get the details of that particular job.

Updating and Resetting Jobs

We have a reset endpoint in Databricks Job API for updating the job. With the help of this endpoint, you can easily add, update, or remove specific settings of an existing job.

This is a POST HTTP request and the body schema is defined below –

job_id Integer job_id is the unique id associated with every job at the
time of creation. It is a mandatory field while updating the job.
new_settings ObjectThe new settings which you want to update.
fields_to_remove ArrayRemove top-level fields in the job settings. Removing nested fields is not supported. This field is optional.
Example

URL - https://<databricks-instance>/api/2.1/jobs/update

{
    "job_id": 123456789,
    "creator_user_name": "databricks_user",
    "run_as_user_name": "databricks_user",
    "run_as_owner": true,
    "new_settings": {
        "name": "test_job",
        "email_notifications": {
           "no_alert_for_skipped_runs": false
        },
        "timeout_seconds": 0,
        "max_concurrent_runs": 2,
        "tasks": [
            {
                "task_key": "test",
                "notebook_task": {
                    "notebook_path": "/User/test_notebook",
                    "source": "WORKSPACE"
                },
                "job_cluster_key": "test_job_cluster",
                "timeout_seconds": 0,
                "email_notifications": {}
            }
        ],
        "job_clusters": [
            {
                "job_cluster_key": "test_job_cluster",
                "new_cluster": {
                    "cluster_name": "",
                    "spark_version": "7.3.x-scala2.12",
                    "aws_attributes": {
                        "first_on_demand": 1,
                        "availability": "SPOT_WITH_FALLBACK",
                        "zone_id": "auto",
                        "spot_bid_price_percent": 100,
                        "ebs_volume_type": "GENERAL_PURPOSE_SSD",
                        "ebs_volume_count": 3,
                        "ebs_volume_size": 100
                    },
                    "node_type_id": "r5.4xlarge",
                    "enable_elastic_disk": false,
                    "autoscale": {
                        "min_workers": 30,
                        "max_workers": 70
                    }
                }
            }
        ],
        "format": "MULTI_TASK",
        "access_control_list": [
        {
        "user_name": "xyz@gmail.com",
        "permission_level": "CAN_MANAGE"
        }
        ]
    }
}

As a result, you will receive any of the four responses mentioned above.

Deleting a Job and Task

For deleting any job we have a data bricks POST API which will delete the job based on the job_id. In the body, you have to only pass the job_id in JSON format and we will get a result in any of the four responses mentioned above.

Example

URL - https://<databricks-instance>/api/2.1/jobs/delete

{
  "job_id": 123456789
}

Benefits of using Databricks Jobs API

  • Through Databricks Jobs API we can create, modify, list, delete, and check job runs using API requests without getting in touch with the UI.
  • It can be integrated with any language which can initiate a request
  • Integrating the Databricks Jobs API with other tools enables event-based triggers for Databricks jobs, creating more efficient runs.

Conclusion

In this blog, you learned about Databricks and the basic operations of Databricks Jobs API. You have got the idea of how to create, list, update and delete the job and the body/parameter required for every request. For more such blogs you can click here

References

https://hevodata.com/learn/databricks-jobs-api/
https://mcloud.devoteam.com/expert-view/taking-advantage-of-the-databricks-apis-to-trigger-and-monitor-jobs/