
They are used to accomplish specific tasks involving an external application.Īn example would be transferring data from Salesforce to Redshift. PluginsĪirflow plugins can be described as a combination of Hooks and Operators. They are stored in an encrypted metadata database associated with the current Airflow instance. They allow the connection between APIs and external databases like Hive, S3, GCS, MySQL, and Postgres…Ĭonfidential information, such as login credentials, are kept outside the Hooks. On Airflow, Hooks allow interfacing with third-party systems. However, operators can communicate information to each other using XComs.

For example, the FileSensor operator can be used to wait for a file to be present in a given folder, before continuing the execution of the pipeline.Įach operator is defined individually. Transfer operators allow the transfer of data from a source to a destination, like the S3ToRedshiftOperator.įinally, the Sensors allow waiting for a condition to be verified. Examples are the PythonOperator or the BashOperator. First, action operators perform a function.

There are three main categories of operators. The DAG ensures that the operators are scheduled and executed in a specific order, while the operators define the jobs to be executed at each step of the process. It can be an individual task (node of a DAG), defining how the task will be executed. They are used to determine the work done. The operators are the building blocks of the Airflow platform. The represented jobs are defined by the operators The operators It is a representation of a sequence of tasks to be performed, which constitutes a pipeline. TasksĮach node in a DAG represents a task. This ensures that pipelines do not have infinite loops. This means that node B downstream of node A cannot also be upstream of node A. “Acyclic”, because the graphs do not have a circuit. They are “Directed”, which means that the edges of the graph are oriented and that they, therefore, represent unidirectional links. This tool is used by thousands of Data Engineers around the world.Ī graph is a structure composed of objects (nodes) in which certain pairs of objects are related. The Airflow 2.0 version is available since December 17, 2020, and brings new features and many improvements. Almost two years later, in December 2020, Airflow has more than 1400 contributors, 11,230 contributions, and 19,800 stars on GitHub. It continues its development and receives the status of a “top-level” project in January 2019. In April 2016, the project joined the official Apache Foundation incubator. In a few years, Airflow has become a standard in the data engineering field. This scheduling tool aims to allow teams to create, monitor and iterate on batch data pipelines.

To help them, data engineer Maxime Beauchemin created an open-source tool called Airflow. The Californian company was hiring Data Scientists, Data Analysts, and Data Engineers in droves, who had to automate numerous processes by writing scheduled batch jobs. At that time, the vacation rental platform founded in 2008 was experiencing meteoric growth and was overwhelmed by an increasingly massive volume of data. The story of Apache Airflow begins in 2015, in the offices of AirBnB. Find out everything you need to know about this Data Engineer tool: how it works, use cases, main components… Apache Airflow is an open-source workflow scheduling platform, widely used in the data engineering field.
