It is one of the most popular open-source workflow management platforms within data engineering to manage the automation of tasks and their workflows. Apache Airflow is written in Python, which enables flexibility and robustness.
What is Apache Airflow?
Apache Airflow is a robust scheduler for programmatically authoring, scheduling, and monitoring workflows. It is a workflow engine that will easily schedule and run your complex data pipelines. It will make sure that each task of your data pipeline will get executed in the correct order. And also make sure each task gets the required resources.
Used of Airflow in industries:–
- Big Data.
- Machine learning.
- Computer software.
- Financial services.
- IT services.
- Banking, etc.
Features of Apache Airflow:
- Ease of use—you only need a little python knowledge for understanding.
- Open-source community—Airflow is free and has a large community of active users.
- Integrations—ready-to-use operators allow you to integrate Airflow with cloud platforms (Google, AWS, Azure, etc).
- Coding with standard Python—you can create flexible workflows using Python with no knowledge of additional technologies or frameworks.
- Graphical UI—monitor and manage workflows, check the status of ongoing and completed tasks.
- DAG: It is the Directed Acyclic Graph – a collection of all the tasks that you want to run which is organized and shows the relationship between different tasks.
- Scheduler: As the name suggests, this component is responsible for scheduling the execution of DAGs. It retrieves and updates the status of the task in the database.
- Web Server: It is the user interface built on the Flask. It allows us to monitor the status of the DAGs and trigger them.
- Metadata Database: Airflow stores the status of all the tasks in a database and do all read/write operations of a workflow from here.
What is DAG?
DAG abbreviates for Directed Acyclic Graph. The vertices and edges (the arrows linking the nodes) have an order and direction associated to them. Each node in a DAG corresponds to a task, which in turn represents some sort of data processing. It is the heart of the Airflow tool in Apache. It is specifically defined as a series of tasks that you want to run as part of your workflow. The main purpose of using Airflow is to define the relationship between the dependencies and the assigned tasks which might consist of loading data before actually executing. However, DAG primarily uses Python and saved as .py extension.
A DAG defines how to execute the tasks, but doesn’t define what particular tasks do.
A DAG can be specified by instantiating an object of the
- DAG run : When a DAG is executed, it’s called a DAG run.
- Tasks: Tasks are instantiations of operators and they vary in complexity.
- Operators: While DAGs define the workflow, operators define the work.
- Hooks: Hooks allow Airflow to interface with third-party systems.
- Relationships: Airflow exceeds at defining complex relationships between tasks.
Apache Air flow Pros and cons.
- Scalable, dynamic, elegant, and extensible.
- uses python for creating workflow.
- Offers useful UI.
- Easy to use if you know Python.
- Offers plenty of integrations.
- Reliance on Python.
In this article, we have seen the basic introduction of Apache Airflow and DAG. In the upcoming article, we will discuss some more about implementing DAG. We will also take defining tasks,installation from step one.