Inside OSG Ops: 2011

Monday, November 21, 2011

GOC holiday schedule

From 24/Nov through 27/Nov the GOC will be operating on a Holiday
schedule. Staff will be available to respond to emergencies but
routine operations will resume at start of business Monday 28/Nov.

The GOC wishes its users and OSG staff a happy and satisfying
Thanksgiving Holiday.

Thursday, November 3, 2011

Moving Services to Bloomington

As you know, the GOC updates services on the second and fourth Tuesday of each month.
The update scheduled for November 8th marks a milestone for the infrastructure team.
After this date all GOC services (with one exception) be be hosted exclusively in
the Bloomington, Indiana data center.

Previously, most services had two instances, one physically hosted in Indianapolis
the other in Bloomington. These instances are in DNS round robin allowing users
of these services transparent use of either instance. The GOC will continue to
operate (at least) two instances and keep them in round robin, but both instances
will be in Bloomington.

So why the change? Originally, the Bloomington machine room was extremely unreliable.
Problems included a leaky roof, insufficient cooling and power and a lack
of space. In short, the systems hosted there had outgrown the facility. The machine
room in Indianapolis was larger, newer and considered more reliable. The old Bloomington
machine room went down during a thunderstorm when it was discovered that both electrical
feeds were, at one point, hung from the same utility pole. (Care to guess where the
lightning struck?) Two weeks were required to restore power during which many of the
university enterprise services were unavailable. This situation was clearly unacceptable
so the university decided to invest $37.2M in a new, state-of-the-art data center.

The 92,000 sq. ft. Bloomington data center is designed to withstand category 5 tornadoes.
The facility is secured with card-key access and 7 x 24 x 365 video surveillance.
Only staff with systems or network administration privileges have access to the machine room
requiring biometric identity verification. Fire suppression is provided by a double interlock
system accompanied by a Very Early Smoke Detection Apparatus (VESDA). Three circuits feed
the Data Center, traveling redundant physical paths from two different substations.
Any two circuits can fully power the building. A flywheel motor/generator set conditions
the power and provides protection against transient events and uninterruptible power
supplies protect against failures of moderate (~1 hour) duration. Dual diesel generators
can provide power for 24 hours in the event of a longer term power failure. In house
chillers provide cooling. Externally supplied chilled water plus city water can be used
in the event of a failure of this system.

Several advantages are realized by hosting all instances in one location. Service failures
associated with the network between Indianapolis and Bloomington are avoided. By using the
same LAN, DNS round robin can be replaced with Linux Virtual Server (LVS) giving control
of round robin to the GOC rather than the DNS administrators at Indiana University. Also
avoided are failures associated with the loss of one of two data centers. It is trivial to
move virtual machines from host to host since the IP address of the VM does not change, a
property allowing detailed load balancing on all VM hosts.

The GOC looks forward to continuing providing services with the availability OSG users
have come to expect.

Tuesday, November 1, 2011

EGI Technical Forum

Wrote this for the OSG Newsletter, but thought it would be good to drop here also.

Henry Kissinger famously asked, “Who do I call if I want to call Europe?” I was reminded of this quote when I attended the EGI Technical forum in Lyon, France and then visited CERN in September. While I don’t need to call Europe as a whole, as Operations Coordinator for OSG I may need to contact any one of the 30+ National Grid Infrastructures (NGI) that make up the European Grid Infrastructure (EGI) in times of an operational crisis.

Accompanied by Scott Teige the OSG Operations Technical Lead, we presented material on both technical and personal communications between OSG, WLCG, and EGI including Global Grid User Support System (GGUS) ticket synchronization, availability and reliability reporting, Berkeley Database Information Index (BDII) information exchange, and various other ways to keep communication channels open between OSG and our European counterparts. We came away with several technical action items from the SAM team. In addition, OSG Operations took a seat as a non-voting member of the EGI Operations Management Board. We are also participating in the WLCG Operations Technical Evaluation Group which will provide input to chart the future activities within the WLCG. We look forward to continued collaboration with WLCG and EGI on an operational level.

~ Rob Quick

Sunday, October 2, 2011

Rob and Scott go to CERN and EGI Technical Forum

Early in the afternoon of 15th of September Rob Quick and I set off for Geneva to visit our collaborators at CERN and attend the European Grid Initiative (EGI) technical forum. We arrived without difficulty at 7:55 AM on the 16th and decided to get the paperwork needed to become CERN users completed. Paperwork being what it is, this turned out to consume the entire day. Many thanks are due to David Collados who picked us up and at days end delivered us to our hotel.

We used Saturday to adjust our clocks by six timezones, working the entire day before actually helped. We hopped a bus to Ferney just across the border in France. There was an open air market where I bought some olives, bread, beer and goat cheese. Turns out David lives near the market and he and his wife gave us a ride back to Geneva. We visited the UN and gorged ourselves of fondue at a small local restaurant. Later in the evening Rob and I made dinner of my Ferney purchase.

On Sunday we hopped a train to Lyon, France for the EGI conference. Quite an enjoyable trip through the mountains. Our talks were on Monday and we learned some interesting things and met many people we knew only by e-mail or con-call. Wednesday was a dinner at a Paul Bocuse restaurant featuring a dessert that made quite the impression on both of us. Raspberries, Chocolate Mousse and Ice Cream. On Friday we knocked off early and did tourist things by visiting the Roman ruins near the confluence of the Saone and Rhone rivers. Those folks built to last. Saturday was back on the train to Geneva.

On Sunday we visited the old part of Geneva. As far as I could tell, old was defined as being contained within the original city walls. I ate a huge bucket of Mussels cooked in white wine, garlic and saffron. Rob had perch and we both had fries, bread and beer. Three days of very productive meetings with our CERN counterparts then back home Thursday at noon. We had a bit of a delay between Washington and Indianapolis but nothing serious.

We both needed Friday to recover, Saturday I started catching up on chores neglected in my absence. Monday I will do the same at work. All in all, a productive if tiring trip.

S

Thursday, July 21, 2011

2011 OSG Summer School

Last month I had the privilege to attend 2011 OSG Summer School in Madison Wisconsin with a group of select professionals from various countries around the world. My expectation of the class was at the very least, to gain a better understanding of the users perspective on the OSG. I thought that if I could fully experience the user side of OSG, I'd be able to offer even better assistance in resolving issues that are presented to the members of the GOC on a daily basis.

One of the many benefits of attending 2011 OSG Summer School was the opportunity to personally meet all those associated with the OSG face-to-face in Madison. Miron Livny, Alain Roy, Tim Cartwright and Sarah Cushing were all very cordial and highly attentive to the needs of the students and visitors. I really liked the structure of the classes with equal emphasis on lecture and hands on sessions. It divided the day nicely allowing for interaction with my fellow students. For me, the 2 most enjoyable parts of 2011 OSG Summer School were the presentation by Miron Livny and the High-Throughput Computing Showcase. Miron's lecture on HTC which included a little history lesson helped paint a clearer picture in my mind of the OSG. The High-Throughput Computing Showcase invited four presenters (two faculty, two grad students) from different University of Wisconsin labs that make extensive use of local HTC resources and/or OSG. They shared a brief background of their work and how much of a role OSG has aided them in obtaining their goals as scientists. I found the presentations to be very exciting and informative. I've gained a deeper appreciation for the OSG and the collaborators that have worked together in making such an impact on the world we live in.

With a fuller knowledge of each step in submitting a job to the grid, I will be able to efficiently assess the the many issues a user may encounter and provide a solution or direction with greater accuracy. I'll also be able to provide the users assistance with a multitude of tools and technologies, and assist them in efficiently and properly utilizing the OSG.

2011 OSG Summer School was educational and fun. I highly recommend 2011 OSG Summer School for everyone interested in gaining a better understanding of the OSG and making connections with fellow students from around the world. Whether you're part of operations to support the OSG or a scientist in search of better ways to handle your data, there's a lot to be gained from attending the 2011 OSG Summer School.

-Alain Deximo

Thursday, June 16, 2011

GOC Alert & Service Monitor Script

We maintain many different services at OSG Operations Center; OIM, MyOSG, GOC Ticket, Software, BDII, TWiki, Jira, OSG Display, to name a few. We also have many other auxiliary services that works somewhat behind the scene; RSVProcess, Data Cluster, Monitor, Internal, Jump, RSV Clients, GOC TX, etc, etc.. Many of these services have multiple instances, and we also have ITB instances of most of these services, and some even have dedicated development instances. To make it worse, each services usually have more than 1 applications on them (Myosg has half a dozen consolidator applications) and most of them were developed & maintained by us.

Sometime I wonder how on earth we are keeping up with all of these, but one of the infrastructures that has helped us in recent years is a system we call "GOC Alert & Service Monitoring" system.

Since there is no way to monitor every single services and instances manually, we have a set of scripts for each service that monitor themselves. Also, our services are very dynamic; meaning they are constantly updated and environment that surrounds them also changes frequently which creates new set of issues and possible scenarios that something could go wrong. Unlike Nagios or RSV where similar tests are executed across all services, our services have their unique set of scripts which are constantly updated and fine tuned during a course of our normal operations. If anything goes wrong, it is usually considered a failure of our service monitoring script and they will be updated to prevent any future occurrence of a similar issue. (We also have meta-service monitor which remotely checks our service monitors that they are executed on each services.)

Service monitor scripts are usually simple shell scripts that sends out messages if something is wrong or about to go wrong. The messages are sent directly to our off-site messaging bus which we call "GOC Alert". Currently, GOC Alert is implemented using Google Group's mailing list. The idea here is that, if something goes wrong at Operations Center, we want to ship that information out of our environment as soon as possible, and as far away as possible in order to prevent such messages to be destroyed or become unavailable during the outage. Have you wondered why airplanes store their flight data recorder inside each airplane, instead of sending out the information via satellite to a off-site location so that people can analyze them as soon as any incident occurs? A similar idea here.

Every single services at Operations Center posts messages with its own tags to GOC Alert. Once the messages are published there, we can then subscribe to those messages with specific tags or keywords in various kinds of ways. We can setup an email filter which will only pull certain tags, or let another monitoring service to handle messages posted. One of such service is what we call "Central Monitor". It has a component called "GOC-Alert forwarder" which receives all GOC Alerts, and analyze its content, and based on information such as which service is having what type of problems, and when, it will send a message or simply forward the message to some destination. For example, I have setup our forwarder so that if one of our high priority service's RSV status becomes critical during my off-hours (after 5, weekened, etc.) it will send a SMS message to my cell phone with basic information about the error. Central Monitor service also tracks incoming critical priority GOC tickets and sends alerts to our group chat room via XMPP, among many other things that it tracks.

Above diagram shows a simplified interaction between our various monitoring components which includes Service Monitor, GOC-Alert, and Central Monitor. There are several other components that works together that allows us to keep up with all of our services.

Monday, June 6, 2011

Jumping off point...

Welcome to Inside OSG Ops, a place for stories revolving around the Operations services, events, and people that make up OSG Operations.

I'm going to start out the new blog talking about the recent upgrade to the CERN Top-Level BDII, how it affected OSG Ops, and it's status as of today. A few weeks ago OSG Ops got wind of a WLCG Top-Level BDII Upgrade via the daily WLCG Ops call. At this time of reporting this upgrade (May, 12th) the change date was already quite near (May, 17th). This left a Friday and Monday to test. This led to a quite extensive debugging ticket opened on May, 13th when FNAL did not appear in the test systems.

Due to this testing the release was delayed May, 26th. Though the initial discrepancies were never identified it was determined the upgrade was not the cause of the issue reported. Due to the fact we've never identified was caused the issue, or what the solution was the GOC implemented ongoing testing to determine if the WLCG Top-Level BDII or the OSG BDIIs drop any resources from reporting. Here is a recent visualization that shows the BDII stability over the past several days.

As always, we'll keep watch for any unusual activity.

Rob Quick
OSG Operations Coordinator