We maintain many different services at OSG Operations Center; OIM, MyOSG, GOC Ticket, Software, BDII, TWiki, Jira, OSG Display, to name a few. We also have many other auxiliary services that works somewhat behind the scene; RSVProcess, Data Cluster, Monitor, Internal, Jump, RSV Clients, GOC TX, etc, etc.. Many of these services have multiple instances, and we also have ITB instances of most of these services, and some even have dedicated development instances. To make it worse, each services usually have more than 1 applications on them (Myosg has half a dozen consolidator applications) and most of them were developed & maintained by us.
Sometime I wonder how on earth we are keeping up with all of these, but one of the infrastructures that has helped us in recent years is a system we call "GOC Alert & Service Monitoring" system.
Since there is no way to monitor every single services and instances manually, we have a set of scripts for each service that monitor themselves. Also, our services are very dynamic; meaning they are constantly updated and environment that surrounds them also changes frequently which creates new set of issues and possible scenarios that something could go wrong. Unlike Nagios or RSV where similar tests are executed across all services, our services have their unique set of scripts which are constantly updated and fine tuned during a course of our normal operations. If anything goes wrong, it is usually considered a failure of our service monitoring script and they will be updated to prevent any future occurrence of a similar issue. (We also have meta-service monitor which remotely checks our service monitors that they are executed on each services.)
Service monitor scripts are usually simple shell scripts that sends out messages if something is wrong or about to go wrong. The messages are sent directly to our off-site messaging bus which we call "GOC Alert". Currently, GOC Alert is implemented using Google Group's mailing list. The idea here is that, if something goes wrong at Operations Center, we want to ship that information out of our environment as soon as possible, and as far away as possible in order to prevent such messages to be destroyed or become unavailable during the outage. Have you wondered why airplanes store their flight data recorder inside each airplane, instead of sending out the information via satellite to a off-site location so that people can analyze them as soon as any incident occurs? A similar idea here.
Every single services at Operations Center posts messages with its own tags to GOC Alert. Once the messages are published there, we can then subscribe to those messages with specific tags or keywords in various kinds of ways. We can setup an email filter which will only pull certain tags, or let another monitoring service to handle messages posted. One of such service is what we call "Central Monitor". It has a component called "GOC-Alert forwarder" which receives all GOC Alerts, and analyze its content, and based on information such as which service is having what type of problems, and when, it will send a message or simply forward the message to some destination. For example, I have setup our forwarder so that if one of our high priority service's RSV status becomes critical during my off-hours (after 5, weekened, etc.) it will send a SMS message to my cell phone with basic information about the error. Central Monitor service also tracks incoming critical priority GOC tickets and sends alerts to our group chat room via XMPP, among many other things that it tracks.
Above diagram shows a simplified interaction between our various monitoring components which includes Service Monitor, GOC-Alert, and Central Monitor. There are several other components that works together that allows us to keep up with all of our services.
Excellent pieces. Keep posting such kind of information on your blog. I really impressed by your blog.
ReplyDeleteVee Eee Technologies