Apache Airflow is a platform to programmatically author, schedule, and monitor workflows. If you are new to Airflow, please go through my introductory blog. One of Airflow’s biggest strengths is its ability to scale. In this blog, we will find out how we can scale Airflow using executors.
Overview
In Airflow, even the local executor is extremely powerful as you can start scaling Airflow and executing multiple tasks in parallel very easily. You just need to configure PostgreSQL and change the executor in the configuration file of airflow with the local executor and you are ready to execute as many tasks as the resources in your machine can handle. But at some point, you will run out of resources to handle multiple tasks in parallel. This is due to the limitations of your local machine, which has limited resources. Hence, if you have huge amount of tasks that you need to execute, you need to use a different executor to scale as much as you need, to execute as many tasks as you need.
Scaling
There are two such executors in Airflow which are able to acheive such scale. The first one is the Celery Executor and the other one is the Kubernetes Executor.
The Default Executor that is configured with Apache Airflow is the Sequential Executor. It runs tasks sequentially one after another in your machine and uses SQLite to store the task’s metadata. Whereas the Local Executor can run tasks in parallel and uses a database that supports parallelism (eg.PostgreSQL) to store the task’s metadata. While this does acheive some level of scale, generally a remote executor like Celery is used to acheive high scalability. Celery uses multiple worker nodes to acheive high scalability and it can also run on one or more machines. Another remote executor which can also be used is Kubernetes Executor. The Kubernetes executor creates a new pod for every task instance. You can also configure it to dynamically scale up or scale down based on the task requirements.
In this blog, we will try to understand how can we scale using Celery executor in detail.
Using Celery
The Celery executor uses standing workers to run tasks. Using Celery executor we can configure both the number of worker nodes and their size. More the number of workers nodes you have available in your environment, or larger the size of your workers, the more scaling you can acheive. Hence, this huge amount of resources will have more than enough capacity to run hundreds of tasks concurrently.
Prerequisites
Firstly, you need to setup a backend (Database) that supports parallelism to work with Celery. Commonly PostgreSQL is used as the metadata database.
Next, we need to configure our executor parameter to point to CeleryExecutor which is present in airflow.cfg and provide rest of the settings required. Now, we need a message broker. RabbitMQ can be used as the broker but alternatively we can also use Redis, which is easier to deploy and maintain than RabbitMQ.
How does it work?



Now let’s try to understand the working using the architecture shown in the above image.
First, we have Node 1, where we will run the Web Server and Scheduler of Airflow
Node 2 will contain the Metadata Database. We are using PostgreSQL here. The purpose of this node is to store the metdata of the tasks being executed.
Node 3 will have the message broker running in it. Here, we are using Redis. Now the question comes why are we using Redis broker? To understand this we have to recall some of the basics of Apache Airflow.
At first, the tasks to be executed is first pushed into a queue and then as soon as Airflow is ready to execute those tasks, they will be pulled out and then executed. Now, here what is happening is that instead of having the queue inside the executor we are using a third party tool to manage our queue. Here, we are using Redis, which is an in-memory database and we can use it as a queue system or as Message Broker with the executors. Each time you execute a task with the Celery Executor, that task is pushed first into redis and then a worker or machine will pull the task from redis in order to execute it.
Node 4 and Node 5 are the different workers or machines where your task will be executed.
Example
Now, Let us understand how a task is executed. Suppose first we have a task in the scheduler. Now this task when ready will be pushed into Redis(Node3), into the queue. Once the task is inside Redis, one of the workers that is Node 4 or Node 5 will pull out the task from Redis. This process will continue over and over for all the tasks present in the queue.
Now suppose Node 4 and Node 5 have the capacity to process two tasks at once. So together, they can process 4 tasks. To scale further and to execute more tasks in parallel, we can simpply add Node 6 which will be another worker node. We can also increase the capacity of each worker node by setting the value of worker_concurrency
inside airflow.cfg
to a higher value. For example we assumed each Node can at most handle two tasks which means here worker_concurrency
is set to 2. If we set it to 3 then each node can handle 3 tasks. So now, Node 4, Node 5 and Node 6 together can handle 9 tasks(3 nodes each handling 3 tasks). As per the requirements, we can similarly add more worker nodes or increase concurrency to acheive high scalability.
Conclusion
This was a theoretical explanation of how Airflow can be scaled using Celery executor. In the upcoming blogs we will also setup and discover practically how to configure and use the Celery executor with Airflow.
Please check out the official documentation for more info on Apache Airflow.
For more interesting tech blogs on Airflow and various other technologies please visit Knoldus Blogs.