How good monitoring can help the business

       2029 words, 10 minutes

There are many times when I see IT managers not monitoring their user’s services at all or setup a general application and consider the job done. Most of the time, they realise much is missing when they have to report about (recurrent) issues regarding business critical services. Monitoring is not to be considered as a cure solution but as a forecast tool. When planned and configured as such, it’ll help prevent predicable failures and drive capacity planning.

What is monitoring?

Quoting Oxford’s online dictionary , a monitor is “a device used for observing, checking, or keeping a continuous record of something”. I like this definition very much because it highlights the reason why monitoring system are build and used as they are nowadays. For historical and cultural reasons, they are not used as a pro-active systems.

Two main monitoring directions can be identified: polling for UP/DOWN services status and collecting statistics. Polling is what tools like Nagios or HPOV do: regularly check if a hardware or a service is UP. If not, it sends an alert to the referred person. Grabbing statistics is what tools like MRTG and RRDtool do. For configured metrics, it connects to equipments or applications and collect a set of values and states.

Why is monitoring important?

Being aware of the global and detailled health of the infrastructure services is one of the key to success when dealing with user satisfaction and business support. Monitoring is often set up to address straight forward questions such as “Is my application up and running?”, “Is the network down?”, “Does my server have hardware issue?”, “Do I provide poor quality of service or is my application slow?”. Those questions can be answered immediately for current and past times. But the answers are not predictable in the future. Unless you have configured your monitoring system with proactivity in mind.

There are times when sh!t just happens. This can’t be avoided and has to be solved fast when it happens. There is no magical wand for such events. All you can do is make sure you get the information fast and can correct the issue according to the Recovery Time Objectives (RTO). How you deal with recovering services is not the point here so I won’t detail it. But basic monitoring helps getting what’s wrong and identify the point of failure so that you can correct it quickly and/or communicate about it and recovery time.

But there are things that can be done to predict issues. The most current problem you’ll hear about is slowness. “Access to this service takes ages”, “That application is damn slow”, “Email arrive within tenth of minutes”, etc. Monitoring won’t solve those issues by itself. But it’ll help you deal with capacity planning, capacity management and quality of service. By regularly and proactively monitoring key features of hardware and software, you’ll be able to predict charge issues and prevent slowness ; or a least be aware of the timing where unacceptable slowness is bound to be reached if nothing’s changed.

How to monitor efficiently?

Minimum monitoring consists on polling various hardware and software. When ever implemented, basic monitoring covers mostly every required metrics.

Extending basic monitoring isn’t that complex: keep an eye on response time or latency. Because I.T. systems are more and more mutualized and/or uses Cloud Computing, the global charge of a service is not enough to deal with quality of service.

Both previous cases addressed monitoring from the I.T. point of view ; hear the technical point of view. It’ll allow you to prove that services are up and hardware properly configured and sized. But it will lack the user experience point of view. Monitoring application quality of service requires the following:

Anticipate to not undergo

The most important thing with monitoring is probably to accept that numbers by themselves have limited usage. Numbers can be interpreted in a multiple way according to the context in which they appear. To make it simple, 99% of CPU usage is an issue ; if this leads to poor service performance. But it can be OK if, for example, it only happens from time to time because of a scheduled process (like backup or statistics production) and allows the result to be delivered in an acceptable amount of time. What’s important here is: will such numbers affect the quality of service ; and if so, when should this become critical to user experience?

Once you have collected the relevant metrics, you have to analyze them in order to identify forecasts on the various aspects of the infrastructure or monitored applications. One-shot audits provide states of the monitored environment at a particular moment. Statistical projections enables a further view on the global health of the monitored environment.

“When will CPU power or RAM lack to an unacceptable rate?”, “When will storage stop providing access to data in a time that provides acceptable performance?”, “When should I renew or buy extra hardware to ensure proper quality of service?”. Those are questions which can be addressed by a complete monitoring solution ; regarding that you use it as an input for supervision and capacity planning rather than a performance history browsing tool. This is called Metrology.