In the previous post, we have monitored our Kafka matrices using Prometheus and visualize the health of Kafka over Grafana. Now we will set an alert, so whenever any of Kafka broker is down, we’ll receive a notification.
For Kafka, a single broker is just a cluster of size one. Usually we don’t create a single broker. If a single broker is down, our Kafka server will also stop and we won’t be able to generate any matrices. So, let’s get started by :
Setting up a multi-broker cluster
First we make a config file for each of the brokers
> cd Downloads/kafka_2.12-2.2.0 > cp config/server.properties config/server-1.properties > cp config/server.properties config/server-2.properties
Now edit these new files and set the following properties:
config/server-1.properties: broker.id=1 listeners=PLAINTEXT://:9093 log.dirs=/tmp/kafka-logs-1 config/server-2.properties: broker.id=2 listeners=PLAINTEXT://:9094 log.dirs=/tmp/kafka-logs-2
The broker.id property is the unique and permanent name of each node in the cluster. We have to override the port and log directory only because we are running these all on the same machine and we want to keep the brokers from all trying to register on the same port or overwrite each other’s data.
Since we have set up our multi-broker cluster, let’s start with generating matrices.
- Start the Zookeeper
- Start first Kafka broker with the JMX exporter running as a Java agent.
KAFKA_OPTS="$KAFKA_OPTS -javaagent:$PWD/jmx_prometheus_javaagent-0.6.jar=7071:$PWD/kafka-0-8-2.yml" \ ./bin/kafka-server-start.sh config/server.properties &
- Now, start second Kafka broker with the JMX exporter running as a Java agent.
KAFKA_OPTS="$KAFKA_OPTS -javaagent:$PWD/jmx_prometheus_javaagent-0.6.jar=7072:$PWD/kafka-0-8-2.yml" \ ./bin/kafka-server-start.sh config/server-1.properties &
- Finally, start third Kafka broker with the JMX exporter running as a Java agent.
KAFKA_OPTS="$KAFKA_OPTS -javaagent:$PWD/jmx_prometheus_javaagent-0.6.jar=7073:$PWD/kafka-0-8-2.yml" \ ./bin/kafka-server-start.sh config/server-2.properties &
Visit http://localhost:7071/ to look for the matrices generated for broker one, http://localhost:7072/ for the metrices generated for broker second and http://localhost:7073/ for the matrices generated for broker three.
Start Prometheus for monitoring Kafka matrices
- Visit http://localhost:9090/graph. This is Prometheus platform that monitors all data from the Kafka index.
Provide the query in Expression column
- Edit the prometheus.yml file as:
# Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: 'kafka' # Override the global default and scrape targets from this job every 5 seconds. scrape_interval: 5s # metrics_path defaults to '/metrics' # scheme defaults to 'http'. static_configs: - targets: ['localhost:7071','localhost:7072','localhost:7073']
- Click on Execute.
You will be able to see all the active brokers running. You can also look for the same in the Status menu Target option.
Setting up Grafana
- Start Grafana
- By default, Grafana will be listening on http://localhost:3000 (visit here). The default login is “admin” / “admin”.
In the previous post, I have explained how to create Prometheus as Data Source. If you haven’t set it up, please refer here: Monitoring Kafka with Prometheus and Grafana.
Now, we are supposed to set configuration of senders for sending alerts in Grafana. For this post, I’ll be sending alert notifications through email and thus we will set the smtp configurations.
cd Downloads/grafana-6.1.4/conf cd Downloads/grafana-6.1.4/conf$ gedit default.ini
In SMTP, make the following changes
####################### SMTP / Emailing ##################### [smtp] enabled = true host = smtp.gmail.com:465 user = <your_email_id> # If the password contains # or ; you have to wrap it with triple quotes. Ex """#password;""" password = <your_password> cert_file = key_file = skip_verify = true from_address = <your_email_id> from_name = Grafana ehlo_identity =
Make the same above changes in sample.ini file.
In Grafana, create new Dashboard and select Queries to as Prometheus.
- Provide Alert Rule name, evaluation time i.e. the time interval it should check for the status of Kafka Brokers.
- Determine the Condition. As I have taken 3 brokers, I had put the query as
WHEN sum() OF query(A, 5m, now) IS BELOW 3
which means if sum of my active brokers is less than 3 i.e, even if one broker is down send me an alert.
- Provide Error Handling, set State to “Alerting” and set Notification.
Now, in the Alerting menu, configure the Notification Channel. Provide the type of notification, the medium through which you want to deliver alert. I am providing type here as Email. Provide the receivers of the alerts and other necessary configurations accordingly.
Now, let’s make one of our broker down to check the alerting. Either stop the broker form processing or by killing the port on which broker is running. For ex:
fuser -k 7072/tcp
We can see the status of down broker in Prometheus.
Now you can see the email in the inbox whose email id(s) are provided in the notification channel.
Suppose your one of our brokers down because of a bug. During that time our producer was not able to produce messages (at least to some partitions). If the offline broker was a leader, a new leader is elected from the replicas that are in-sync.
What happens when a broker is down, depends on your configuration. In case you’re using the synchronous producer(where the ordering of messages is important), you need to implement your own retries. The synchronous producer doesn’t handle this scenario where one broker is down.
It flags its cluster info as stale, the next attempt is to re-fetch it (including new leadership info). But you may need to back off a bit and allow the cluster to recover first. So, rescue the exception, then sleep a bit, then try again.
The async producer does this automatically. Thus, getting an alert on time will be beneficial.
This solution proves to be very efficient in collecting metrics, preventing problems and keeping you alert in case of emergencies.