Databricks jobs

Reading Time: 2 minutes

Jobs

A job is a way to run non-interactive code in a Databricks cluster. For example, you can run an extract, transform, and load (ETL) workload interactively or on a schedule. You can also run jobs interactively in the notebook UI.


Your job can consist of a single task or can be a large, multi-task workflow with complex dependencies. Databricks manages the task orchestration, cluster management, monitoring, and error reporting for all of your jobs. You can run your jobs immediately or periodically through an easy-to-use scheduling system.

Features of Databricks Jobs

  • You can specify the type of task to run. In the Type drop-down, select NotebookJARSpark SubmitPython, or Pipeline.
  • You can pass parameters for your task. Each task type has different requirements for formatting and passing the parameters.
  • You can optionally allow multiple concurrent runs of the same job, to do so you can click Edit concurrent runs in the Job details panel. 
  • To optionally specify email addresses to receive alerts on job events, click Edit alerts in the Job details panel.
  • To optionally control permission levels on the job, click Edit permissions in the Job details panel.
  • In databricks, you have the feature of job scheduling also.

Task dependencies

You can define the order of execution of tasks in a job using the Depends on the drop-down. You can set this field to one or more tasks in the job.
Configuring task dependencies creates a Directed Acyclic Graph (DAG) of task execution, a common way of representing execution order in job schedulers. For example, consider the following job consisting of four tasks:

  • Task 1 is the root task and does not depend on any other task.
  • And, task 2 and task 3 depend on task 1 completing first.
  • Task 4 depends on task 2 and task 3 completing successfully.

Databricks runs upstream tasks before running downstream tasks, running as many of them in parallel as possible.

Pros of Databricks job scheduling

  • You can monitor job run results using the UI, CLI, API, and email alerts.
  • In job scheduling, you have the option to run your jobs immediately or periodically through an easy-to-use scheduling system.
  • You can specify the period, starting time, and time zone. Optionally select the Show Cron Syntax checkbox to display and edit the schedule in Quartz Cron Syntax.

Cons of Databricks job scheduling

  • You can create jobs only in a Data Science & Engineering workspace or a Machine Learning workspace.
  • A workspace is limited to 1000 concurrent job runs. A 429 Too Many Requests response is returned when you request a run that cannot start immediately.
  • The number of jobs a workspace can create in an hour is limited to 5000 (includes “run now” and “runs submit”). This limit also affects jobs created by the REST API and notebook workflows.

Conclusion

In this blog, we’ve learned about the databricks jobs and the feature which databricks provides. Also learned about task dependencies and the pros and cons of job scheduling.
Hope you enjoyed the blog. Thanks for reading.

References

https://www.topcoder.com/thrive/articles/job-processing-with-databricks
https://www.topcoder.com/thrive/articles/job-processing-with-databricks

Leave a Reply