Apache Airflow: Environment Variables Best Practices

Reading Time: 3 minutes

A bit about Airflow Variables (Context):

What is Airflow?

Apache Airflow is a work-flow management tool. Airflow makes use of DAGs (Directed Acyclic Graph) to do the same.

What are Airflow variables?

Variables are the key-value pairs, where key represents the variable name, and value represents the assigned value of that particular variable.

Where you store Airflow variables?

Variables are stored inside the Airflow metadata database.

What is the use of Airflow variables?

Airflow variables are usually used to store and fetch content or settings from the metadata database. It can include but not limited to configurations, tables and other static data like Ids. Furthermore, you can also use variables to separate-out constants and variables from the pipeline code. Recommendation is that you should have the ability to see and change the variables or config files through the User Interface (UI).

Best Practices when using Airflow variables:

  • Avoid using variables outside the execute() method of an Operator and Jinja templates because it will slow down fetching and make the Database very heavy as variable(s) establish connection to Airflow metadata Database in order to get the values.
  • When Airflow does the DAG(s) parsing (in the background after a fixed time interval) it will establish a new connection to the metadata every single time it parses a DAG. However, you can configure the frequency by setting the variable “processor_poll_interval” (default is 1 second), but still in the same cases it can lead to many open connections.
  • Always try to make use of the Jinja template when you need to use variables. It will make sure that we are reading the the value only after task execution.
    Sample Code (normally) : {{variable.value.<var_name>}}
    Sample Code (when you require to deserialize a json object from the variable) : {{variable.json.<var_name>}}

Best practices on how to work with Airflow variables?

Airflow variables in UI
  • You can list, update, delete and create variables using the User-Interface (UI) in “Variables” under “Admin”.
  • Otherwise you can create and upload json files in bulk using the UI. This way you can declare multiple variables (and sub-variables (key-value pairs)) at once. This is the recommended way of declaring the variables.
  • It is important that you restrict the number of Airflow variables in your DAG because variables are stored inside the Metadata database so every single time when the variable is used, connection to the database is established. So if the number of variables is high, the allowed number of database connections may get saturated. So, in order to avoid this, It is recommended that you should store all of the DAG configurations within a single Airflow variable with JSON type value.
  • For example, let’s assume we are having three variables: var1=”val1”, var2=[1,2] and var3={c:3}. If you call them normally like var1 = Variable.get(“var1”), var2 = Variable.get(“var2”), var3 = Variable.get(“var3”) it will make 3 DB connections, so in order to improve this approach we can use json config file (let’s call it vars_json), then we can call variables from “vars_json” like config = Variable.get(“vars_json”, deserialize_json=True) then var1 = config[“var1”], var2 = config[“var2”], var3 = config[“var3”]. Using this approach we have successfully reduced the number of DB connections from 3 to 1.
  • You can also use Airflow variables directly through Jinja template using the syntax like {{var.value.<variable_name>}}. For example let’s say you have a task name “task2” such that:

task2 = BashOperator(

task_id = “fetch_var_val”,

bash_command = ‘echo{{ var.value.var2}}’,

dag = dag,

)

Fetching values through Json using Jinja template:

task3 = BashOperator(

task_id = “get_var_from_json”,

bash_command = ‘echo{{ var.vars_json.var2}}’,

dag = dag,

).

  • You can also access variables through Airflow Commands (using CLI – Command-Line-Interface). For example:
    • If you want to get value of var1 you will need to write the following command:
      docker-compose run --rm webserver airflow variables --get var1
    • When you require to set value to another variable let’s say “var4”, you can write the following command:
      docker-compose run --rm webserver airflow variables --set var4 value
    • If there is a need to import json file having your variables, you can run the following command:
      docker-compose run –rm webserver airflow variables --import usrlocal/airflow/dags/config/vars_json.json

Conclusion:

  • Airflow is an open-source free workflow management tool by Apache that’s probably the best tool out there available.
  • Make use of JSON config files to store Airflow variables, it will reduce the number of database calls, hence will make the process faster and ease load on the database.
  • If possible, try to make use of variables using the Jinja template.
  • Curious to learn more about this awesome tool? please visit official documentation
  • For more airflow and other tech-blogs please visit Knoldus Blogs
knoldus