Data Engineering- Exploring Apache Airflow

Reading Time: 4 minutes

Automating tasks play a major role in today’s industries. Automation helps us achieve our goals very quickly and with high efficiency. Yet in today’s day and age people still, fail to reap the benefits of automation. For example, in our daily lives, we deal with repetitive workflows like obtaining data, processing, uploading, and reporting. Wouldn’t it be great if this process was triggered automatically at a specific time, and all the tasks would get executed in order? Apache Airflow is one such tool that can be very helpful for you. Whether you are Data Scientist, Data Engineer, or Software Engineer you will definitely find this tool useful.

What is Apache Airflow?

It is a platform to programmatically author, schedule, and monitor workflows. It is one of the most robust ETL(Extract, Transform, Load) workflow management tools,  used by Data Engineers for orchestrating workflows or pipelines. Using Airflow you can visualize your data pipelines’ dependencies, logs, code, trigger tasks, progress status. The development of Apache Airflow started at Airbnb as an open-source project in 2014. It is currently a part of Apache Software Foundation project available on GitHub for community use. It is used by more than 200 companies some of which are Airbnb, Yahoo, PayPal, Intel, etc. Feel free to join the Community, as a developer, report or fix a bug, add new features, or improve the documentation.

Below image shows an example architecture with Apache Airflow:

Installation:

Please follow the official Quick Start guides available here:

Advantages of using Airflow:

Airflow provides huge advantages over old and aging scheduling solutions like cron jobs:

  1. Handle complex relationships between jobs:
    • Anybody who has used cron will agree that handling relationships between tasks and manging them is a nightmare, whereas with Airflow we can easily visaulise relationships and manage tasks from easy to understand DAGs(Directed Acyclic Graphs).
  2. Handle all the jobs centrally with a well defined user interface:
    • The Airflow UI makes it easy to monitor and troubleshoot your data pipelines.Here, you can list and manage all the DAGs present in your environment. On the contrary cron requires external support to log, track, and manage tasks.
  3. Error reporting and alerting.
  4. Security (protecting credentials of databases).
  5. Scalability(by adding multiple worker nodes using Celery)
  6. Robust Integrations: It supports ready to go integrations with Google Cloud Platform, Amazon AWS, Microsoft Azure, etc.
  7. Plugins: Airflow has a simple plugin manager built-in that can integrate external features to its core by simply dropping files in your $AIRFLOW_HOME/plugins folder.

Airflow Key Components:

Scheduler

The Airflow scheduler monitors all tasks and DAGs, then triggers the task instances once their dependencies are complete. The scheduler spins up a subprocess, which monitors and stays in sync with all DAGs in the specified DAG directory. Once per minute, by default, the scheduler collects DAG parsing results and checks whether any active tasks can be triggered. The Airflow scheduler is designed to run as a persistent service in an Airflow production environment. To run the scheduler you need to execute the airflow scheduler command. It uses the configuration specified in airflow.cfg and runs the scheduler accordingly.

The scheduler uses the configured Executor to run tasks that are ready. Your DAGs will start executing once the scheduler is running successfully.

Executor

Executors are the ones responsible for running the tasks. They have a common API and you can swap executors based on your installation needs. Airflow can only have one executor configured at a time. This can be set by the executor option in the [core] section of the configuration file. For example, if you are using a built-in executor just mention the name of the executor like:

[core]
executor = KubernetesExecutor

There are two types of built-in executors:

  • Local Executors: These executors run tasks locally i.e. inside the scheduler process. Airflow has three such local executors:
    • Debug Executor: The DebugExecutor is meant as a debug tool and can be used from IDE. It is a single process executor that queues TaskInstance and executes them by running _run_raw_task method.
    • Local Executor: LocalExecutor runs tasks by spawning processes in a controlled fashion in different modes.
    • Sequential Executor: The SequentialExecutor is the default executor when you first install airflow. This executor will only run one task instance at a time.
  • Remote Executors:These executors run their tasks remotely using a pool of workers. Airflow has four such remote executors:
    • Celery Executor: Using this excutor you can scale out the number of workers.
    • Dask Executor: This executor allows you to run Airflow tasks in a Dask Distributed cluster. Dask clusters can run on a single machine or on remote networks.
    • Kubernetes Executor: The Kubernetes executor runs each task instance in its own pod on a Kubernetes cluster.
    • CeleryKubernetes Executor:The CeleryKubernetes Executor allows users to run simultaneously a Celery Executor and a Kubernetes Executor. An executor is chosen to run a task based on the task’s queue.

You can also write your own custom executors, and refer to them by their full path:

[core]
executor = knoldus.executors.CustomExecutor
Web Server(Airflow’s Web UI)

Airflow comes with a user interface that lets you see what DAGs and their tasks are doing, trigger runs of DAGs, view logs, and do some limited debugging and resolution of problems with your DAGs. The Airflow UI features three different views: DAGs view, Tree View, and Graph View.

../_images/dags.png
Metadata Database

Airflow supports a variety of databases for its metadata store. This database stores configurations, such as variables and connections, user information, roles, and policies. The Scheduler gathers all the metadata regarding DAGs, schedule intervals, statistics from each run, and tasks from this database.

Airflow uses SQLAlchemy and Object Relational Mapping (ORM) in Python to connect and interact with the underlying metadata database from the application layer.

Conclusion

We will explore more topics on Airflow such as writing DAGs and scaling Airflow in the upcoming blogs.

For more awesome Tech Blogs on various other technologies please visit Knoldus Blogs

knoldus

Written by 

Agnibhas Chattopadhyay is a Software Consultant at Knoldus Inc. He is very passionate about Technology and Basketball. Experienced in languages like C, C++, Java, Python and frameworks like Spring/Springboot, Apache Kafka and CI/CD tools like Jenkins, Docker. He loves sharing knowledge by writing easy to understand tech blogs.

2 thoughts on “Data Engineering- Exploring Apache Airflow6 min read

  1. Pingback: Apache Airflow

Comments are closed.