Using Telegraf Operator Internally at DT

This blog post has been prepared to provide you with detailed instructions of how to monitor JMX metrics using the Telegraf operator with Prometheus on the Kubernetes (K8S) environment.

Motivation

The intention was to fix the problem that started when we had an incident on our system leading to a failure to connect to our Kafka brokers. We were attempting to write events to Kafka. However, the Kafka producer was unmonitored, challenging us to understand the problem. After an incident retrospective, I took the initiative to monitor our Kafka producers in the production environment.

Challenges

The app we are using is a scala (JVM-based) app. A quick search online revealed that in JVM cases, the Kafka producer exposes its metrics through the JMX.

Our monitoring setup on K8S is based on Prometheus  as the metrics db. Additionally, we use a standard layout of ServiceMonitor to expose our metrics. This allows the Prometheus operator to scrape them. Therefore, the challenge was to figure out how to expose the JMX metrics to Prometheus for our K8S application.

Solution

We found several solutions for this technical challenge. Yet, two remain the primary-reasonable solutions:

  • Prometheus JMX Metrics Exporter
  • telegraf operator for K8S using the Jolokia metrics agent

Prometheus JMX metrics exporter

A collector that can configurably scrape and expose mBeans of a JMX target for prometheus.

Telegraf operator for K8S using the jolokia metrics agent

telegraf is a plugin-driven server agent for collecting and sending metrics and events from databases, systems, and IoT sensors.

The telegraf operator is an application designed to create and manage individual telegraf instances in Kubernetes clusters. Whereas the jolokia is an agent-based approach for remote JMX access.

We used the telegraf operator as it was found to be more convenient to configure. More importantly, telegraf supports various plugins besides the JMX metrics plugin. With additional functionalities such as monitoring apps written in programming languages that are not JVM based and monitoring system metrics.

The schematic flow is that our JVM app runs the Jolokia agent. This agent collects JMX metrics exposing them to a dedicated port. Telegraf operator is responsible for creating a sidecar container within each of our application pods. This container scrapes the JMX metrics from the jolokia agent and exposes them for Prometheus with the appropriate format.

Implementation process

First of all, we want the jolokia agent jar file to be accessible to your application container. We used an init container since they run before app containers are started, and allow us to fetch the jolokia agent jar file. To do that we  must add the following to our service deployment YAML: First of all, we want the jolokia agent jar file to be accessible to your application container. We used an init container since they run before app containers are started, and allow us to fetch the jolokia agent jar file. To do that we  must add the following to our service deployment YAML:

Here we downloaded the jolokia jar to our local file system and placed it on /opt/jolokia so our service container can access the agent on this path.

You can replace the command section and fetch the file from your preferred object storage.

Finally, we run our application with the jolokia agent is to define the JAVA_TOOL_OPTIONS on the YAML:

Notice we configure port 8778 as the port that jolokia agent exposes its metrics to.

Step 2: Deploy telegraf operator on K8S cluster

telegraf operator for K8S is an open source project you can find here: https://github.com/influxdata/telegraf-operator

We would like to deploy the telegraf operator to our K8S environment. By using dev.yml file from the github repository and run:

Notice how we created a new telegraf-operator namespace with the Telegraf operator pod (or pods) on the K8S cluster.

Now we define a K8S secret containing the telegraf operator's output configurations using classes.yml file as follows:

We need to know that we defined a 'prometheus' class under stringData, to write metrics with Prometheus format to port 9273. Later, we will show how we use it. Let's add this to the K8S cluster:

This is how you would do it using kubectl; However, for our production environment, we created helm charts based on the YAML files to integrate them with our K8S deployment.

Step 3: Set our service to work with telegraf operator

Now that you want to define the application, use the telegraf operator for exposing metrics. On the application deployment YAML file, added the following:

Adding the telegraf.influxdata.com/… annotation to our YAML signals the Telegraf operator, adding the sidecar container to each pod launched using this deployment. Additionally, configure the Jolokia plugin of the Telegraf that our Jolokia JMX metrics are exposed to on port 8778 with this configuration:

Next, we defined what JMX mbean the Jolokia plugin will expose as metrics, with this configuration:

You can replace the MBean with your desired Kafka producer metrics or any other MBean you’d prefer!

Finally, we used the Prometheus class definition from the classes.yml file allowing the configuration of our JMX metrics to be exposed by the telegraf with Prometheus format:

Step 4: This is working!

That's it!

Once you finish deploying it, our application exposes its JMX metrics on port 9273 with Prometheus format. Try it and check it on :9273/metrics. You should see something like this:

kafka_producer_request_rate{client_id=,host=,jolokia_agent_url="http://localhost:8778/jolokia/"} 0.2811884900178086

All we have left to do now is to define a ServiceMonitor for scraping the metrics with prometheus and we are good to go, our K8S application JMX metrics will be on prometheus.

Summary

We covered how we can monitor our JMX metrics with Prometheus on K8S cluster using the setup of running our JVM app with jolokia agent that we obtained by init container, deploy telegraf operator for scraping the Jolokia metrics and exposing them for prometheus.

Additional Links

You Might Also Like
A Novel Approach to Thresholding and Classifying Large-Scale Data
Apache Druid’s Lookups as Code
Testing Environments For Micro Frontends Simplified with ArgoCD ApplicationSet

Newsletter Sign-Up

Get our mobile expertise straight to your inbox.

Explore More

5 Alternative App Realities That Will Help Build an ALL-ternative App Future
Vote for Growth: Maximize Your Game Revenue by Tapping Into U.S. Election Campaign Budgets
A Novel Approach to Thresholding and Classifying Large-Scale Data