Apache Airflow Operators and Tasks

Reading Time: 3 minutes

Context:

What is Airflow?

  • Airflow is a free to use and open-source tool developed by Apache that is used to manage workflows
  • Most popular and one of the best workflow management systems out there with great community support.

What is a DAG ?

  • DAG stands for Directed Acyclic Graph
  • Directed means the flow is one directional
  • Acyclic means the flow will never come back to the first node (vertex) / starting node (preventing the bi-directional flow) kinda making it similar to simplex communication.
  • Graph is a non-linear data structure that has two components
    • Node (Vertex) : Node represents interconnected objects usually using a bold point
    • Edge: Edge connects one node to the other, if no forward arrow is there it means flow is bi-directional otherwise if an arrow is there it means the flow is only in one direction.
  • The node to which the arrow points is said to have dependency on the node from which the arrow was pointed.

Operators:

What are operators and why we need them?

  • In practical, DAGs only represent the workflow and won’t be doing any computations (or performing operations). So to do actual operations we need operators. Operators are kind of a placeholder in this case that help us to define what operations we want to perform.
  • There are different kinds of operators in Airflow based on your requirements and programming language that you are using. BashOperator , PythonOperator are the most commonly used. You can visit (official Airflow operator directory on Github): github repo link to see all types of operators. Even if the operator that you need is not there, it’s very likely that you will see it soon under Airflow/contrib (github directory: contrib directory link ) thanks to the great community support and all awesome developers out there.
  • BaseOperator is the parent class of all the operators i.e. all the operators will be inheriting the properties of the BaseOperator class.
  • Example: Let’s say you want to run a Python function, so in order to execute it you will need to use PythonOperator.

Classification of Operators:

  • Sensors: Sensor is the type of operator that will be in ON state (keeps running) until it sense what it was trying to sense. eg. HdfsSensor (Hadoop File System Sensor will keep running until it detects a file or folder in HDFS)
  • Operators: Behaves like a function (will just do a certain task). eg. PythonOperator (will helps in executing a function written in Python)
  • Transfers: Transfer data from one place to another. eg. MySqlToHiveTransfer (as the name suggests it will transfer data from MySQL to Hive)
# sample python function
def print_function(x):
    return x + " is best tool for workflow management"

# defining the DAG
dag = DAG(
    'python_operator_sample',
    default_args=default_args,
    description='Demo on How to use the Python Operator?',
    schedule_interval=timedelta(days=1),
)

#creating the task
t1 = PythonOperator(
    task_id='print',
    python_callable= print_function,
    op_kwargs = {"x" : "Apache Airflow"},
    dag=dag,
)

t1

Tasks:

What is a task?

  • An Operator when instantiated (assigned to a particular name) is called a task in Airflow. One task per operator is the standardized way to write task in Airflow. You must define task_id and dag container when defining a task. For eg.
#creating the task
t1 = PythonOperator(
    task_id='print',
    python_callable= print_function,
    op_kwargs = {"x" : "Apache Airflow"},
    dag=dag,
)

What are Dagruns?

  • DagRuns can be defined as DAGs that are scheduled to run after certain time intervals like daily/every minute/hourly etc.
  • We also have execution_time that starts at start_time (inside the DAG) and keeps executing after certain time according to time interval defined in schedule_interval (inside the DAG).

What are TaskInstances?

  • Tasks that are defined within Dagruns are called TaskInstances.
  • It is must that we have a state associated with TaskInstances and Dagruns. eg. “running”, “failed”, “queue”, skipped etc.

What is the role of Task Dependencies?

  • After setting up DAGs and tasks we can proceed to set dependencies i.e. the order in which we want to execute tasks (based on the DAGs that we are having)
  • Let’s say we want task t2 to get executed after task t1, then we will send dependency of t2 on t1 (means only after t1 has completed its work then only t2 will run). There are two ways of setting dependency.
    • Using set_upstream / set_downstream. eg. t1.set_downstream(t2)
    • Using bitshift operators (>> or <<). eg. t1 >> t2

Conclusion:

  • Airflow is an open-source free workflow management tool by Apache that’s probably the best tool out there available.
  • Curious to learn more about this awesome tool? please visit official documentation

For more airflow and other tech-blogs please visit Knoldus Blogs

1 thought on “Apache Airflow Operators and Tasks4 min read

Comments are closed.

Discover more from Knoldus Blogs

Subscribe now to keep reading and get access to the full archive.

Continue reading