Building Intelligence Into Monitoring

Table of contents

Reading Time: 3 minutes

In our last post, we discussed the pitfalls of the traditional monitoring solutions. We also touched upon a possible solution to take care of the issues. Let us try to analyse that solution in some more detail now.

Intelligent Monitoring (IM) should be able to decrease the burden of system monitoring in multiple ways. An IM system should be able to listen to multiple data streams in real time and act upon them – intelligently. Let us look at each of the pitfalls that we saw in the last post

Set up and configuration: This should be significantly decreased since the monitoring system should be able to self-learn intelligently. It should remove the need for the administrator to set alert thresholds for each and every incoming data. On the basis of real time event flow via SNMP traps or any other mechanism, the IM system should be able to understand the normal scenarios of event flow during the course of business execution. Hence for example it should be able to understand that during 9 AM – 12 Noon EST, the usage is high. It is high because the CPU usage is >70%, memory consumption is > 2G on the machines where pdf generation is taking place, email server sends > 3000 emails and so on.
Apart from the self deciphered alerts, there could be some correlations that the system administrator might want to hint to the IM system. Hence, there should be an interface where there administrator should be able to tell correlations to the IM system.

Root Cause: Since the IM system is correlation based, it knows what a normal behavior is and what an anomaly is. As soon as it detects an anomaly which does not suit the normal pattern then it should generate an alert. So for example, in the above scenario between peak usage time of 9AM-12 PM the CPU is 73% and memory is 2.1G but the number of emails sent is 400. This clearly deviates from the normal behavior that the Intelligent Monitoring system is used to. This is when it generates an alert and provides the system administrator a starting point for debugging activities.

Historical Data: As we saw earlier, that with traditional monitoring systems, you can have loads of raw, uncorrelated data archived for ages without realizing any value from it. This data does not have much value for analysis unless you have some complicated algorithms written in Hadoop to analyse this unstructured data.
The IM system should be able to intelligently store the historical data. By intelligently storing we mean that since the system knows about the correlation and variations across data streams, it can intelligently decipher what information needs to be stored in detail and what other information which is routine information can only be stored in summary. Hence in our scenario, it should be able to store the anomaly in more detail as compared to the routine behavior. This historical data could be used for reporting / charting and other administrative purposes.

Once an IM takes care of the above pain points, it would possibly eradicate 80% [by the 80/20 rule] of the issues that the system administrators face today. The other 20% could possibly be solved by providing custom rules / hints to the IM system.

In the next post, I would introduce you to Premon. Inphina’s flagship product based on the concept of machine learning and complex event processing based correlation techniques to deliver an intelligent monitoring solution for the enterprises. Stay tuned.