What is Apache Airflow?
- Apache Airflow is a workflow management system which is used to programmatically author, schedule and monitor workflows.
- Airflow is also known as DAG.
- Airflow allows users to create workflows with high granularity and track the progress as they execute.
- They make it easy to do potentially large data operations.
- For Example: If you want to run an SQL query every day, or if you want to run it every hour and have the results of that SQL query to store as a Parquet file in an S3 bucket, that sequence of operations can done with out-of-the-box components within Airflow. That SQL query could potentially generate terabytes of data, but Airflow doesn’t really care because it’s not necessarily looking at every single row that’s coming out of that query. Airflow coordinates the movement of the bits of the data stream that are most important.
Apache Airflow Four Clear Benefits
One, Apache Airflow is open source. While it’s not for everyone, many data scientists prefer to support and work with their peers in the community versus buy commercial software. There are advantages: you can download it and start using it right away versus enduring a long procurement cycle and process to get a quote, submit a proposal, secure the budget, sign the licensing contract and all that. It’s liberating to be in control and make the selection whenever you want to.
Two, it’s very flexible. In some of the commercial tools, they are great until you go off the main path and try to do something a little creative that goes beyond what the tool was design to do. Airflow is design to work within an architecture that is standard for nearly every software development environment. Dynamic pipeline generation is another attractive aspect of its flexibility. You can run one big Airflow server or multiple small ones; the flexibility is there to support either approach.
Three, it’s highly scalable, both up and down. A typical Airflow deployment is often to simply deploy it on a single server. But you can get it to run very well inside a Docker container, even on your laptop, to support local development of pipelines. You can also scale it up to very large deployments with dozens of nodes running tasks in parallel in a highly available clustered configuration.
Four, it runs very well in a cloud environment. There are options to run it in a cloud native, scalable fashion; it will work with Kubernetes and it will work with auto-scaling cloud clusters. It’s fundamentally a Python system that just deploy as a couple of services. So, any environment that will run one or more Linux boxes with Python and a database for state management can run this environment, which opens a lot of options for data scientists.
Apache Airflow Basic Concepts
Airflow has some basic terms that will be use throughout the series while building and monitoring data pipelines. These terms are as follows:
It is the basic unit of execution. It can be reading the data from a database, processing the data, storing the data in a database, etc. There are three basic types of Tasks in Airflow:
- Operators: They are predefined templates used to build most of the Tasks.
- Sensors: They are a special subclass of Operators and have only one job — to wait for an external event to take place so they can allow their downstream tasks to run.
- Task Flow: It was recently added in Airflow 2.0 and provides the functionality of sharing data in a data pipeline.
Directed Acyclic Graphs
In basic terms, a DAG is a graph, with nodes connected via directed edges and has no cyclic edges between the nodes. In Airflow, the Tasks are the nodes and the directed edges represent the dependencies between Tasks.
A DAG has directed edges connecting nodes. Similarly, in Airflow, a DAG has dependencies connected between tasks. It defines how the workflow should be carried out in a DAG.
Execution of a single task. They also indicate the state of the Task such as “running”, “success”, “failed”, “skipped”, “up for retry”, etc. The color codes of various states of tasks are as follows:
When a DAG is triggered in Airflow, a DAGrun object is created. DAGrun is the instance of an executing DAG. It contains a timestamp at which the DAG was instantiated and the state (running, success, failed) of the DAG. DAGruns can be created by an external trigger or at scheduled intervals by the scheduler.
In this blog, we learned some of the basic concepts of Apache Airflow. Detailed things will come in the next coming blogs.