Similar to other data-oriented companies, here at Digital Turbine we have pretty complex data pipelines. They consist of data streams coming from Kafka, aggregation jobs by Spark, data storage systems, and data analytic tools such as Druid and Trino.
Above all, we use Apache Airflow to orchestrate all these pipelines, to make sure the jobs are synced and timed correctly, that retry in case of failure and easily show the status of each phase in the pipeline.
Since these data pipelines provide the data which our clients consume, we need them to meet our high standards. Above all expectations, we pay special attention to data completeness, data correctness and data latency.
We all expect certain things from our service providers. Take your internet provider for example:
The set of expectations of a customer from the service he receives is called the Service Level Agreement, or in short, the SLA. Therefore, as professional engineers we do our best to meet these expectations!
In most cases, Apache Airflow helps us achieve this. We are provided with a full picture of the current status, what was successful and what has failed – enabling us to see the data completeness. We can also plan our tasks to check that their output is exactly as expected. If the result is a fail and retry, we can tackle the data correctness expectation as well.
What was missing on our side was the full picture of what should have already been finished, but hasn’t yet.
If we should have finished working on the 8:00am data by 10:00am and it didn’t happen, we want to know about it. Moreover, we want an alert for this incident and another alert when it finishes. We also want to view the history of the situation:
Airflow has an inherent SLA alert mechanism. When the scheduler sees such an SLA miss for some task, it sends an alert by email. The problem is, that this email is nice, but we can’t really know when each task is eventually successful. Moreover, even if there is such an email upon success following an SLA miss, it does not give us a good view of the current status at any given time. Most likely, we would have to manually trace our emails to achieve this:
We knew that we were looking for a metric, rather than just some alerts.
We normally use Prometheus to store these metrics and Grafana to display them and alert based on a predetermined rule.
Apart from sending alert emails following an SLA miss, Airflow also saves the SLA misses in its database. and we use them to create our own metrics reporting system.
In simple terms, SLAyer is an independent server application, looking at Airflow’s database. It reports the current status to Prometheus, by sending metrics per dag, task and execution date currently in violation of its SLA.
Once we know what the current status is, what the status was at 2:00am, and at what times during the day we experienced SLA misses (if any), we can receive all the relevant information via our nice and familiar Grafana board. If we want alerts as a complementary mechanism, we can easily send it from Grafana, as well as another alert when the graph has stopped displaying alerts.
We created the SLAyer with the following steps:
For the SLAyer project, we use the following requirements:
In this step, we want to understand which tasks should have already finished and which ones haven’t yet finished.
To do this, we can look at Airflow’s database, specifically the tables sla_miss, task_instance, and dag_run.
Therefore, we use the following SQL query to receive each dag, task and execution date showing in the sla_miss table, in which the task state is not valid by the task_instance table, and the dag state is not successful by the dag_run table.
In this step, we want to report the current status to Prometheus, by sending a metric per dag task and execution date that is currently in violation of its SLA. For that, we use prometheus_client API.
In this step, we want to build a docker image for our slayer server. For that, in the Dockerfile, we start from a python base image, add the slayer folder and install the project requirements.
In this step, we want to deploy the slayer on Kubernetes.
To do this, follow these steps:
After that, we create a service file to expose the slayer server as a network service.
At Digital Turbine, we use Prometheus Operator to create a Prometheus Data Source for Grafana. Prometheus Operator implements the Kubernetes Operator pattern for managing a Prometheus-based Kubernetes monitoring stack, it can automatically generate monitoring target settings based on Kubernetes label queries. We carried this out using a ServiceMonitor, which declaratively specifies how groups of services should be monitored. The Operator automatically generates Prometheus scrape configuration based on the definition.
This means, if there is a new metrics endpoint matching the ServiceMonitor criteria, this target is automatically added to all the Prometheus servers selecting that ServiceMonitor.
Now in the Prometheus-K8S Data Source, we can see our new slayer query. We create a new Grafana board with the following metric:
You can see the full implementation here :