What is Apache Airflow ?
Apache Airflow is a free and open-source application for managing complicated workflows and data processing pipelines. It’s a platform for automating and monitoring workflows for scheduled jobs. It allows us to configure and schedule our processes according to our needs while simplifying and streamlining the process.
Why do we need Apache Airflow ?
Lets us assume a use case where we want to trigger a data pipeline every day at a given time. The data pipeline might include the following steps: Downloading of Data, Processing of Data, and finally Storing of Data.
Now to fulfill the above tasks the pipeline might make use of external API’s and Databases. Meanwhile, we have to make sure that these external Api’s and Databases are constantly available for use so as to make sure that the data pipeline will succeed.
But what is going to happen if the DB is not present or the API we are using to fetch the data is not present, the data pipeline is going to fail. This problem increases multi-folds if there are not one but hundreds of data pipelines operating simultaneously. This is exactly what Apache Airflow Addresses. With Airflow we can manage our data pipelines and execute our tasks in a very reliable way while monitoring our tasks and retrying them automatically.
Core components of Airflow
Airflow is a simple queueing system based on a metadata database. A scheduler uses the state of queued tasks stored in the database to prioritize how other tasks are added to the queue. There are four main components of Apache Airflow :
The web server is in charge of providing the user interface. It also allows to track job status and read logs from remote file storage.
The scheduler handles scheduling the jobs, it decides which tasks to execute and when and where to execute them. It also decides the execution priority.
Metastore is a database where all the metadata related to Airflow and our data is present. It powers how other components would interact with each other. Stores information regarding the state of each task.
The executor is a process that is tightly connected to the scheduler and determines the worker process which is actually going to execute the task.
Worker is the process where the tasks are executed.
Basic Apache Airflow concepts
A single task in a process is described by an operator and Operators are typically (but not always) nuclear, which means they can stand alone and do not require resources from other operators.
DAG (Directed acyclic graph)
DAG is a collection of small tasks which join together to perform a bigger task. It describes how to run a workflow. It is a collection of all the tasks we want to run organized in a manner that defines their relationship and dependencies.
This is what a standard DAG looks like. It has 4 tasks A, B, C, and D it defines in which order each task will execute and what are their dependencies.
A Task is a basic unit of execution. Each task may have an upstream or a downstream dependency defined. The key point of using Tasks is defining how tasks are related to each other.
Operator is a template for a predefined task that we can declare inside a DAG.
Sensors are special operators as they come into action after an event occurs. Types of Sensors: poke(default), reschedule, and smart sensor.
Benefits of using Apache Airflow
The Airflow dynamic pipeline is built in the form of code, giving it the ability to be dynamic.
Another advantage of working with Airflow is that it is simple to start the operators and executors thus allowing the library to adapt to the level of abstraction required to serve a specific environment.
Apache Airflow is highly scalable and we can execute as many tasks as we want in parallel.
With the help of UI we can monitor our data pipelines and retry our tasks accordingly.