What is Apache Airflow ?
Apache Airflow is a workflow engine that makes scheduling and running complex data pipelines simple. It will ensure that each activity in your data pipeline executes in the proper order also with the appropriate resources.
Airflow is a workflow platform that allows you to define, execute, and monitor workflows. A workflow can be defined as any series of steps you take to accomplish a given goal consequently it can perform thousands of different activities each day, allowing for more efficient workflow management.
Advantages of using Apache Airflow
Let’s analyze all the advantages that we get from using Apache Airflow:
Open-source: Airflow is free to download and use and you can collaborate with others in the community.
Cloud Integration: Airflow integrates effectively with cloud environments thus providing you a wide range of possibilities.
Scalable: Airflow is extremely scalable both vertically and horizontally and It can run on a single server can scale up to big deployments with many nodes.
Flexible and Customizable: Airflow works with the basic architecture of most software development environments, but its flexibility allows for a wide range of customization options.
Monitoring abilities: Airflow allows for diverse ways of monitoring. The user interface, for example, can display the status of your tasks.
Code-first platform: Airflow relies on code which allows us the ability to create whatever functionality we want to execute at each stage of the pipeline.
Key Concepts in Apache Airflow
In Airflow, we define Workflow and these workflows are defined using Dags furthermore these dags consist of tasks that we want to execute and their interrelated dependencies.
For example, we can create a DAG to express the dependencies between tasks X, Y, and Z. We want to execute task Z only after task Y executes but task X can execute independently.
Whenever Dags executes we call it a Dag run. Let’s say we want to execute a DAG after every 5 hours. Each instantiation of that DAG establishes a DAG run. There are can be multiple Dag runs that connects to a single Dag.
Tasks are instantiations of several types of operators that have different levels of complexity. They are the work units in a DAG. They show what work is done at each stage of your workflow, with the actual work being defined by operators.
Operators describe the job, while DAGs define the workflow. An operator is similar to a class for executing a particular operation. BaseOperator is the root of all operators.
Different types of Operators:
- Operators that carry out an action or request a different system to carry out an action.
- Operators that move data from one system to another.
Hooks allow Airflow to interact with third-party systems hence allowing Airflow to connect with Databases like MYSQL and other external APIS. Information is not stored in hooks as it is not safe therefore all the information is stored inside Airflows encrypted metadata database.
How does Airflow work ?
There are four main components of Airflow:
Scheduler: The scheduler keeps track of all DAGs and their tasks. When a task’s dependencies meet, the scheduler will start the task and it checks for active tasks to initiate on a regular basis.
Web server: The web server serves as the user interface for Airflow. It displays job status and allows the user to interact with databases as well as read log files from remote file stores.
Metastore: Metastore is a database where all the metadata related to Airflow and our data is present. It powers how other components would interact with each other. Stores information regarding the state of each task.
Executor: The executor is a process that is tightly connected to the scheduler and determines the worker process which is actually going to execute the task.
Installation of Apache Airflow
Installation of Apache Airflow requires us to first install “python-pip”. To install pip, execute the command given below:
sudo apt-get install python3-pip
Installing Apache Airflow using pip:
pip3 install apache-airflow
Apache Airflow requires a database to constantly run in the background:
To start Apache Airflow Webserver:
airflow webserver -p 8080
Now we have to start the Apache Airflow scheduler:
To access the Airflow Dashboard:
Open the web browser and go open:
If you see the above image when accessing the localhost that means Airflow has been successfully installed on your system.
Stay Tuned for more blogs on Apache Airflow on: https://blog.knoldus.com/