Millions of lines of software is being written every day. Though only a subset of this code makes its way to the production system but in-spite of that, there is tons of software which is being served from data centers, private and public clouds. Monitoring the software, hardware and other infrastructural components is a herculean task, if not impossible.
In the good old days, the applications were not that complex. The users were not scattered across the globe and there was virtually no cloud. Application monitoring in those days was limited to a few people observing the infrastructure statistics manually or using some cron jobs. Taking corrective action and then making sure that the application is running fine.
Slowly but surely, came a flood of tools and techniques promising to make the lives easy by providing a lot of metrics about each and every component in your enterprise system. These tools would also alert you if certain thresholds are breached or if something is not performing as expected. Quickly it became painfully obvious that all this data was a problem. Why? Because it is a lot of data.
We had a recent call with one of our enterprise customers who said that inspite of having some of the best of breed monitoring solutions he is still shooting in the dark for making infrastructure related decisions. In his words,
I get a lot of alerts, in fact several thousands of them from the multiple monitoring systems that we have. However, I am still lost. I do not know how to make sense out of them. I am overwhelmed.
So what should a monitoring system look like?
An Enterprise should assess a solution on two main parameters,
- The monitoring solution should alert them to a problem, on the basis of correlated information and they should be able to take corrective action on the basis of information.
- Monitoring system should provide correlated information to drill down to the root-cause of a problem.
With growing number of monitoring systems, there are growing number of alerts and log files which are being generated. This is a lot of raw data. What is required is that somehow we can process this information to get some critical relevant information out. Well, that is the hard part and this is where many monitoring systems struggle today.
For instance, it is easy to send an Alert when the CPU usage has grown to say 90%. That is a threshold alert. It is also easy to send an alert when the free disk space is less than 15%. Again a threshold alert.
Now what about a scenario that sounds like this. If the CPU usage goes up by 8% and within the next 3 minutes the disk space decreases by 15%, the log files start showing OutOfMemory exceptions and there are more than 10 slow queries on the database then do some-action.
Or how about recognizing simple decision-making patterns like “finding a pattern where free memory was less than 10 and in the next 60 seconds an OutOfMemory was logged ”.
This is monitoring on the basis of correlated events. This is called intelligent monitoring. Monitoring where you are able to take intelligent decision on the basis of correlated data and not just raw data. Correlating alerts and making machine learned decisions on the raw data would make lives easier for system administrators, release managers and anyone who would make decisions.
Intelligent Monitoring would define the next step of Enterprise Monitoring. Stay tuned.
3 thoughts on “Monitoring Enterprise Systems: Then, Now and Then3 min read”
Interesting thoughts, but isn’t there a simpler way instead of getting into complex logic of correlating events and updating the correlated event – based intelligent monitored codes , similar to the SMARTS technology that EMC acquired some time ago? That technology is now passe’.
How about something that lets you monitor and control the resources used by an application – all processes within that application put together? So for example you have an antivirus or a backgup or xyz app running on your pc/server, you can make sure that its resource consumption doesn’t exceed your specified value. Even better, any unused resources (CPU, memory, Disk and network IO) that you have lying idle, not used by your primary app in real time, can be used by a secondary workload – for example some calculation, download etc. It ensures that your primary application performance doesn’t get impacted at all. YOu bring up resource utilization, make efficient use of your system, get more done with whatever resources you have already spend money on. Want to know what technology can achieve this – for linux as well as windows? Check out http://silverline.librato.com Beta is free. Works on physical, virtual or cloud server instance.
Hi ComputeMeghadoot, thanks for your thoughts.
I think what you specify is again threshold based monitoring and action taking.So if the threshold is breached then do this … nevertheless the site that you mention is interesting and I would take a look.
Think of intelligent monitoring in a way which is intelligent and learns by the day. So perhaps it has learnt that the CPU utilization of all my nodes in a cluster increases uniformly between say 10-12 AM EST. Now one fine day we see that for 4 out of these 5 nodes, the utilization has increased as anticipated and for the fifth node it has not. This is an anomaly in the pattern which warrants investigation.
Also intelligent monitoring would be looking at the infrastructure and the entire ecosystem for an enterprise. Hence an impact in one area could affect a block of applications and not just one. Also it could happen that a series of events triggered across various infra h/w, s/w, f/w etc has a correlated impact which forms a pattern and triggers relevant, more meaningful alerts. So the avenues are boundless. The way correlation helps is that it allows you to look at the Complex Event, rather than a series of simple events. As you might have noticed from wikipedia
“church bells ringing, the appearance of a man in a tuxedo with a woman in a flowing white gown, and rice flying through the air. A complex event is what this monitoring system may infer from these events: that a wedding is happening. CEP is a technique that helps discover complex events by analyzing and correlating other events: the bells, the man and woman in wedding attire and the rice flying through the air.”
Comments are closed.