What is Apache Airflow?
- Airflow is a platform to programmatically author, schedule and monitor workflows.These functions achieved with Directed Acyclic Graphs (DAG) of the tasks. It is an open-source and still in the incubator stage. It was initialized in 2014 under the umbrella of Airbnb since then it got an excellent reputation with approximately 800 contributors on GitHub and 13000 stars. The main functions of Apache Airflow is to schedule workflow, monitor and author.
- Apache airflow is a workflow (data-pipeline) management system developed by Airbnb. It is used by more than 200 companies such as Airbnb, Yahoo, PayPal, Intel, Stripe and many more.
- In this, everything revolves around workflow objects implemented as directed acyclic graphs (DAG). For example, such a workflow can involve the merging of multiple data sources and the subsequent execution of an analysis script. It takes care of scheduling the tasks while respecting their internal dependencies and orchestrates the systems involved.
What is a Workflow?
Workflow is a sequence of tasks which is started on a schedule or triggered by an event .It is frequently used to handle big data processing pipelines.
A typical workflow diagram
- There are total 5 phases in any workflow.
- Firstly we download data from source
- Then, send that data to somewhere else to process
- When the process is completed we get the result and report is generated which is sent by email.
Working of Apache Airflow
There are four main components that make up this robust and scalable workflow scheduling platform:
- Scheduler: The scheduler monitors all DAGs and their associated tasks. It periodically checks active tasks to initiate.
- Web server: The web server is Airflow’s user interface. It shows the status of jobs and allows the user to interact with the databases and read log files from remote file stores, like Google Cloud Storage, Microsoft Azure blobs, etc.
- Database: The state of the DAGs and their associated tasks are saved in the database to ensure the schedule remembers metadata information. Airflow uses SQLAlchemy and Object Relational Mapping (ORM) to connect to the metadata database. The scheduler examines all of the DAGs and stores pertinent information, like schedule intervals, statistics from each run, and task instances.
- Executor: There are different types of executors to use for different use cases.Examples of executors:
SequentialExecutor: This executor can run a single task at any given time. It cannot run tasks in parallel. It’s helpful in testing or debugging situations.
LocalExecutor: This executor enables parallelism and hyperthreading. It’s great for running Airflow on a local machine or a single node.
CeleryExecutor: This executor is the favored way to run a distributed Airflow cluster.
KubernetesExecutor: This executor calls the Kubernetes API to make temporary pods for each of the task instances to run.
So, how does Airflow work?
Airflow examines all the DAGs in the background at a certain period. This period is set using the
processor_poll_interval config and is equal to one second. Task instances are instantiated for tasks that need to be performed, and their status is set to
SCHEDULED in the metadata database.
The schedule queries the database, retrieves tasks in the
SCHEDULED state, and distributes them to the executors. Then, the state of the task changes to
QUEUED. Those queued tasks are drawn from the queue by workers who execute them. When this happens, the task status changes to
When a task finishes, the worker will mark it as failed or finished, and then the scheduler updates the final status in the metadata database.
- Easy to Use: If you have a bit of python knowledge, you are good to go and deploy on Airflow.
- Open Source: It is free and open-source with a lot of active users.
- Robust Integrations: It will give you ready to use operators so that you can work with Google Cloud Platform, Amazon AWS, Microsoft Azure, etc.
- Use Standard Python to code: You can use python to create simple to complex workflows with complete flexibility.
- Amazing User Interface: You can monitor and manage your workflows. It will allow you to check the status of completed and ongoing tasks.
- Dynamic: Airflow pipelines are configuration as code (Python), allowing for dynamic pipeline generation. This allows for writing code that instantiates pipelines dynamically.
- Extensible: Easily define your own operators, executors and extend the library so that it fits the level of abstraction that suits your environment.
- Elegant: Airflow pipelines are lean and explicit.
- Scalable: It has a modular architecture and uses a message queue to orchestrate an arbitrary number of workers. Airflow is ready to scale to infinity.
In this blog we discussed about basic overview of Apache Airflow. In the upcoming blogs we will be exploring detailed analysis of the working of the Apache Airflow.
Thanks for reading this blog & stay tuned until next time.