In this blog we will go over the core concepts basic you must understand if you want to use Apache airflow.
In this article, you will learn:
- What is Airflow
- Architecture Overview
- Dag Run
- Execution Date
Airflow was started in October 2014 and developed by Maxime Beauchemin at Airbnb. It is a platform for programmatically authoring, scheduling, and monitoring workflows. It is completely open source and is specially useful in architecting and orchestrating complex data pipelines.
Benefits of Apache Airflow include:
- Ease of use—you only need a little python knowledge to get started.
- Open-source community—Airflow is free and has a large community of active users.
- Integrations—ready-to-use operators allow you to integrate Airflow with cloud platforms (Google, AWS, Azure, etc).
- Coding with standard Python—you can create flexible workflows using Python with no knowledge of additional technologies or frameworks.
- Graphical UI—monitor and manage workflows, check the status of ongoing and completed tasks.
The Airflow platform lets you build and run workflows, which are represented as Directed Acyclic Graphs(DAGs).A sample DAG is shown in the diagram below.SEO
A DAG contains Tasks (action items) and specifies the dependencies between them and the order in which they are executed. A Scheduler handles scheduled workflows and submits Tasks to the Executor, which runs them. The Executor pushes tasks to workers.
Apache Airflow Architecture Components:
- A scheduler, which handles both triggering scheduled workflows, and submitting Tasks to the executor to run.
- An executor, which handles running tasks. In the default Airflow installation, this runs everything inside the scheduler, but most production-suitable executors actually push task execution out to workers.
- A webserver, which presents a handy user interface to inspect, trigger and debug the behaviour of DAGs and tasks.
- A folder of DAG files, read by the scheduler and executor (and any workers the executor has)
- A metadata database, used by the scheduler, executor and webserver to store state.
Read about each of these components in details here.
A DAG (Directed Acyclic Graph) is a collection of all the tasks you want to run, organized in a way that reflects their relationship and dependencies.
In simple terms, it is a graph with nodes, directed edges and no cycles and defined in a python script.
When we run a DAG all the tasks inside it will run the way we have orchestrated. It can be parallel runs, sequential runs or even runs depending on another task or external resources.
The DAG itself doesn’t care about what is happening inside the tasks; it merely concerned with how to execute them- the order to run them in, how money times to retry them, if they have timeouts, and so on.
In Airflow, a Task is the most basic unit of execution. Tasks are organized into DAGs, and upstream and downstream dependencies are established between them to define the order in which they should be executed.
There are three basic kinds of tasks:
- Operators, predefined task templates that you can string together quickly to build most parts of your DAGs.
- Sensors, a special subclass of Operators which are entirely about waiting for an external event to happen.
- A TaskFlow-decorated
@task, which is a custom Python function packaged up as a Task.
A trigger rule defines why a task gets triggered, on which condition. Airflow provides several trigger rules that can be specified in the task and based on the rule, the Scheduler decides whether to run the task or not.
List of all the available trigger rules:-
- all_success: (default) all parents must have succeeded
- all_failed: all parents are in a failed or upstream_failed state
- all_done: all parents are done with their execution
- one_failed: fires as soon as at least one parent has failed, it does not wait for all parents to be done
- one_success: fires as soon as at least one parent succeeds, it does not wait for all parents to be done
- none_failed: all parents have not failed (failed or upstream_failed) i.e. all parents have succeeded or been skipped
- none_failed_or_skipped: all parents have not failed (failed or upstream_failed) and at least one parent has succeeded.
- none_skipped: no parent is in a skipped state, i.e. all parents are in success, failed, or upstream_failed state
- dummy: dependencies are just for show, trigger at will
An Operator is basically a template for a predefined Task, that you can just defined declaratively inside your DAG.
An operator provide integration to some other service like MySQLOperator, SlackOperator, prestoOperator, etc which provides a way to access these services from airflow.
There are three main types of operators:
- Operators that perform an action or tell another system to perform an action
- Transfer Operators that move data from one system to another
- Sensors that will keep running until a specific criterion is met.
Some common operators available in Airflow are:
- BashOperator – used to execute bash command on the machine it runs on.
- PythonOperator- takes any python function as an input and calls the same (this means the function should have a specific signature as well)
- EmailOperator- sends email using SMTP server configured.
- simplehttpOperator- makes an HTTP request that can be used to trigger actions on a remote system.
- MySQLOperator, SquliteOperator, PostgresOperator, MySQLOperator, OracleOperator, jdbcOperator etc- used to run SQL commands.
A DAG Run is an object representing a instantiation of the DAG in time.
Each DAG may or may not have a schedule, which informs how DAG Runs are created.
schedule_interval is defined as a DAG argument, which can be passed a cron expression as a
datetime.timedelta object, or one of the following cron “presets”.
execution_date is the logical date and time at which the DAG Runs, and its task instances, run. This also acts as a unique identifier for each DAG Run.
There are two types of executor
Read Apache Airflow Documentation for more knowledge.
To read more tech blogs, visit Knoldus Blogs.