Go to your monitoring system right now.
What color is everything?
If you even have to look you need to rethink how you’re monitoring
The answer is that everything is green or acknowledged.
Here are my rules for making monitoring useful again by monitoring like an adult.
Monitoring is configured automatically
Monitoring configurations should be generated on the fly when a node is added to the pool of available servers. It helps if servers can be tagged as non-operational so that they do not alert until they are added to the pool. I prefer configurations generated by a configuration management system that’s version controlled to just version controlled hand edited files.
Stats collection and monitoring should be separate
Move the statistic collection jobs out of monitoring and let them be handled by something else specific to that task. This makes it easier to turn off any checks that cause trouble on either part of the system.
Checks are periodically culled based on usefulness
“If it can’t stay green, it’s gone”. This step is probably the second most useful of all. Nothing defeats the purpose of monitoring more than false positives. Alerts start getting ignored if a specific check gets a reputation as less than reliable. This slowly undermines confidence in the system as a whole. I have seen many environments where a system was implemented, then utterly ignored to the detriment of all.
End-to-end checks are only useful and should not be implemented until every step along the way is already monitored
While end to end checks can be very useful if there is no way to figure out why they are failing they can drive some extremely poor decision making. People tend to latch on to a few key metrics and drive decisions from those they see frequently. If the end to end check slows down because the monitoring box is out of memory then all of the nodes you throw into the service cluster will not improve. Make sure that every step along the way is monitored and has performance data or you will end up repeating some of my most regrettable moments. Push hard to implement them last.
Dependencies, escalation paths and response times are clear and reasonable
This is the softest and most important bit of monitoring. If there is a NOC then it should be clear how to escalate issues otherwise if one of four nodes is down and your end-to-end check is fine handle it at a reasonable hour. Also, the development team needs to be pat of the escalation procedure. Handling non-critical services in a non-critical manner means that there is more energy to be focused on revenue impacting outages.
Like many worthwhile things you only get out of monitoring what you put into it. View it as an investment into helping your and your team pinpoint problems quickly and efficiently not as a means to CYA.