Automation of work plays a key role in any industry and it is one of the quickest ways to reach functional efficiency. But many of us fail to understand how to automate some tasks and end in the loop of manually doing the same things again and again.
Most of us have to deal with different workflows like collecting data from multiple databases, preprocessing it, upload it, and report it. Consequently, it would be great if our daily tasks just automatically trigger on defined time, and all the processes get executed in order. Apache Airflow is one such tool that can be very helpful for you. Whether you are Data Scientist, Data Engineer, or Software Engineer you will definitely find this tool useful.
In this article, we will discuss Apache Airflow, how to install it and we will create a sample workflow and code it in Python.
Apache Airflow is a workflow engine that will easily schedule and run your complex data pipelines. It will make sure that each task of your data pipeline will get executed in the correct order and each task gets the required resources.
It will provide you an amazing user interface to monitor and fix any issues that may arise.
Let’s start with the installation of the Apache Airflow. Now, if already have pip installed in your system, you can skip the first command. To install pip run the following command in the terminal.
sudo apt-get install python3-pip
Next airflow needs a home on your local system. By default ~/airflow is the default location but you can change it as per your requirement.
export AIRFLOW_HOME=~/airflow
Now, install the apache airflow using the pip with the following command.
pip3 install apache-airflow
Airflow requires a database backend to run your workflows and to maintain them. Now, to initialize the database run the following command.
airflow initdb
We have already discussed that airflow has an amazing user interface. To start the webserver run the following command in the terminal. The default port is 8080 and if you are using that port for something else then you can change it.
airflow webserver -p 8080
Now, start the airflow schedular using the following command in a different terminal. It will run all the time and monitor all your workflows and triggers them as you have assigned.
airflow scheduler
Now, create a folder name dags in the airflow directory where you will define your workflows or DAGs and open the web browser and go open: http://localhost:8080/admin/ and you will see something like this:
Now that you have installed the Airflow, let’s have a quick overview of some of the components of the user interface.
It is the default view of the user interface. This will list down all the DAGS present in your system. It will give you a summarized view of the DAGS like how many times a particular DAG was run successfully, how many times it failed, the last execution time, and some other useful links.
In the graph view, you can visualize each and every step of your workflow with their dependencies and their current status. You can check the current status with different color codes like:
The tree view also represents the DAG. If you think your pipeline took a longer time to execute than expected then you can check which part is taking a long time to execute and then you can work on it.
In this view, you can compare the duration of your tasks run at different time intervals. You can optimize your algorithms and compare your performance here.
In this view, you can quickly view the code that was used to generate the DAG.
Let’s start and define our first DAG.
In this section, we will create a workflow in which the first step will be to print “Getting Live Cricket Scores” on the terminal, and then using an API, we will print the live scores on the terminal. Let’s test the API first and for that, you need to install the cricket-cli library using the following command.
sudo pip3 install cricket-cli
Now, run the following command and get the scores.
cricket scores
It might take a few seconds of time, based on your internet connection, and will return you the output something like this:
Now, we will create the same workflow using Apache Airflow. The code will be completely in python to define a DAG. Let’s start with importing the libraries that we need. We will use only the BashOperator only as our workflow requires the Bash operations to run only.
For each of the DAG, we need to pass one argument dictionary. Here is the description of some of the arguments that you can pass:
Now, we will create a DAG object and pass the dag_id which is the name of the DAG and it should be unique. Pass the arguments that we defined in the last step and add a description and schedule_interval which will run the DAG after the specified interval of time
We will have 2 tasks for our workflow:
Now, while defining the task first we need to choose the right operator for the task. Here both the commands are terminal-based so we will use the BashOperator.
We will pass the task_id which is a unique identifier of the task and you will see this name on the nodes of Graph View of your DAG. Pass the bash command that you want to run and finally the DAG object to which you want to link this task.
Finally, create the pipeline by adding the “>>” operator between the tasks.
Now, refresh the user interface and you will see your DAG in the list. Turn on the toggle on the left of each of the DAG and then trigger the DAG.
Click on the DAG and open the graph view and you will see something like this. Each of the steps in the workflow will be in a separate box and its border will turn dark green once it is completed successfully.
Click on the node “get_cricket scores” to get more details about this step. You will see something like this.
Now, click on View Log to see the output of your code.
That’s it. You have successfully created your first DAG in the Apache Airflow.
I recommend you go through the following data engineering resources to enhance your knowledge-
If you have any questions related to this article do let me know in the comments section below.
Lorem ipsum dolor sit amet, consectetur adipiscing elit,
Great post! I'm really interested in learning more about Apache Airflow and how it can help streamline my workflow. I'm currently using a manual process to schedule tasks and it's taking up a lot of time and energy. I'm hoping Airflow can help me automate this process and make it more efficient. Thanks for sharing this information!