Skip to content

Commit

Permalink
Merge pull request #126 from teamclairvoyant/teamclairvoyant/sla-miss…
Browse files Browse the repository at this point in the history
…-report

Introducing Airflow SLA Miss Report

The airflow-sla-miss-report DAG consolidates the data from the metadata tables and provides meaningful insights over email to the subscriber to ensure SLAs are met. What sets it apart is the fact that it dwells on a custom KPI and indicators that are very useful to measure the performance of the DAG and also offers a comparative view of them. It also gives users the flexibility to modify the timeframe and email list according to their requirement.

More reading: https://blog.clairvoyantsoft.com/introducing-a-new-way-to-analyze-airflow-sla-misses-2b8ac7958738
  • Loading branch information
prakshalj0512 authored Oct 24, 2022
2 parents 8a0fcb6 + 57b8020 commit fe592a5
Show file tree
Hide file tree
Showing 4 changed files with 851 additions and 7 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -90,3 +90,6 @@ ENV/

# IDEA
.idea

# DS-STORE REMOVAL
.DS_Store
16 changes: 9 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,16 +3,18 @@ A series of DAGs/Workflows to help maintain the operation of Airflow

## DAGs/Workflows

* backup-configs
* [backup-configs](backup-configs)
* A maintenance workflow that you can deploy into Airflow to periodically take backups of various Airflow configurations and files.
* clear-missing-dags
* A maintenance workflow that you can deploy into Airflow to periodically clean out entries in the DAG table of which there is no longer a corresponding Python File for it. This ensures that the DAG table doesn't have needless items in it and that the Airflow Web Server displays only those available DAGs.
* db-cleanup
* [clear-missing-dags](clear-missing-dags)
* A maintenance workflow that you can deploy into Airflow to periodically clean out entries in the DAG table of which there is no longer a corresponding Python File for it. This ensures that the DAG table doesn't have needless items in it and that the Airflow Web Server displays only those available DAGs.
* [db-cleanup](db-cleanup)
* A maintenance workflow that you can deploy into Airflow to periodically clean out the DagRun, TaskInstance, Log, XCom, Job DB and SlaMiss entries to avoid having too much data in your Airflow MetaStore.
* kill-halted-tasks
* [kill-halted-tasks](kill-halted-tasks)
* A maintenance workflow that you can deploy into Airflow to periodically kill off tasks that are running in the background that don't correspond to a running task in the DB.
* This is useful because when you kill off a DAG Run or Task through the Airflow Web Server, the task still runs in the background on one of the executors until the task is complete.
* log-cleanup
* [log-cleanup](log-cleanup)
* A maintenance workflow that you can deploy into Airflow to periodically clean out the task logs to avoid those getting too big.
* delete-broken-dags
* [delete-broken-dags](delete-broken-dags)
* A maintenance workflow that you can deploy into Airflow to periodically delete DAG files and clean out entries in the ImportError table for DAGs which Airflow cannot parse or import properly. This ensures that the ImportError table is cleaned every day.
* [sla-miss-report](sla-miss-report)
* DAG providing an extensive analysis report of SLA misses broken down on a daily, hourly, and task level
84 changes: 84 additions & 0 deletions sla-miss-report/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Airflow SLA Miss Report

- [About](#about)
- [Daily SLA Misses (timeframe: `long`)](#daily-sla-misses-timeframe-long)
- [Hourly SLA Misses (timeframe: `short`)](#hourly-sla-misses-timeframe-short)
- [DAG SLA Misses (timeframe: `short, medium, long`)](#dag-sla-misses-timeframe-short-medium-long)
- [Sample Email](#sample-email)
- [Sample Airflow Task Logs](#sample-airflow-task-logs)
- [Architecture](#architecture)
- [Requirements](#requirements)
- [Deployment](#deployment)
- [References](#references)


### About
Airflow allows users to define [SLAs](https://github.com/teamclairvoyant/airflow-maintenance-dags/blob/teamclairvoyant/sla-miss-report/sla-miss-report/README.md) at DAG & task levels to track instances where processes are running longer than usual. However, making sense of the data is a challenge.

The `airflow-sla-miss-report` DAG consolidates the data from the metadata tables and provides meaningful insights to ensure SLAs are met when set.

The DAG utilizes **three (3) timeframes** (default: `short`: 1d, `medium`: 3d, `long`: 7d) to calculate the following KPIs:

#### Daily SLA Misses (timeframe: `long`)
Following details broken down on a daily basis for the provided long timeframe (e.g. 7 days):
```
SLA Miss %: percentage of tasks that missed their SLAs out of total tasks runs
Top Violator (%): task that violated its SLA the most as a percentage of its total runs
Top Violator (absolute): task that violated its SLA the most on an absolute count basis during the day
```

#### Hourly SLA Misses (timeframe: `short`)
Following details broken down on an hourly basis for the provided short timeframe (e.g. 1 day):
```
SLA Miss %: percentage of tasks that missed their SLAs out of total tasks runs
Top Violator (%): task that violated its SLA the most as a percentage of its total runs
Top Violator (absolute): task that violated its SLA the most on an absolute count basis during the day
Longest Running Task: task that took the longest time to execute within the hour window
Average Task Queue Time (s): avg time taken for tasks in `queued` state; can be used to detect scheduling bottlenecks
```

#### DAG SLA Misses (timeframe: `short, medium, long`)
Following details broken down on a task level for all timeframes:
```
Current SLA (s): current defined SLA for the task
Short, Medium, Long Timeframe SLA miss % (avg execution time): % of tasks that missed their SLAs & their avg execution times over the respective timeframes
```

#### **Sample Email**
![Airflow SLA miss Email Report Output1](https://user-images.githubusercontent.com/32403237/193700720-24b88202-edae-4199-a7f3-0e46e54e0d5d.png)

#### **Sample Airflow Task Logs**
![Airflow SLA miss Email Report Output2](https://user-images.githubusercontent.com/32403237/194130208-da532d3a-3ff4-4dbd-9c94-574ef42b2ee8.png)


### Architecture
The process reads data from the Airflow metadata database to calculate SLA misses based on the defined DAG/task level SLAs using information.
The following metadata tables are utilized:
- `SerializedDag`: retrieve defined DAG & task SLAs
- `DagRuns`: details about each DAG run
- `TaskInstances`: details about each task instance in a DAG run

![Airflow SLA Process Flow Architecture](https://user-images.githubusercontent.com/8946659/191114560-2368e2df-916a-4f66-b1ac-b6cfe0b35a47.png)

### Requirements
- Python: 3.7 and above
- Pip packages: `pandas`
- Airflow: v2.3 and above
- Airflow metadata tables: `DagRuns`, `TaskInstances`, `SerializedDag`
- [SMTP details](https://airflow.apache.org/docs/apache-airflow/stable/howto/email-config.html#using-default-smtp) in `airflow.cfg` for sending emails

### Deployment
1. Login to the machine running Airflow
2. Navigate to the `dags` directory
3. Copy the `airflow-sla-miss-report.py` file to the `dags` directory. Here's a fast way:
```
wget https://raw.githubusercontent.com/teamclairvoyant/airflow-maintenance-dags/master/sla-miss-report/airflow-sla-miss-report.py
```
4. Update the global variables in the DAG with the desired values:
```
EMAIL_ADDRESSES (optional): list of recipient emails to send the SLA report
SHORT_TIMEFRAME_IN_DAYS: duration in days of the short timeframe to calculate SLA metrics (default: 1)
MEDIUM_TIMEFRAME_IN_DAYS: duration in days of the medium timeframe to calculate SLA metrics (default: 3)
LONG_TIMEFRAME_IN_DAYS: duration in days of the long timeframe to calculate SLA metrics (default: 7)
```
5. Enable the DAG in the Airflow Webserver
Loading

0 comments on commit fe592a5

Please sign in to comment.