One advantage of taking a break, however short, from my weekly baking project is that it allows me to put some time into writing other things. Lately, I’ve been looking for better ways to do on-call notification at my workplace, and I seem to be coming up dry. I posted a “call for help” as it were on Twitter, but 140 characters is just a bit too short to explain what we are really looking for, so perhaps it’s not surprising that nobody responded. (The fact that few people in my line of work actually follow me on Twitter doesn’t help either!) So here’s a description of what we’re doing, and where I think I’d like to go, and if any of you folks out there can give me pointers, I’d greatly appreciate it. The stuff we have now is very fragile, has a lot of moving pieces, and depends on far too many manual steps to be truly reliable, but on the other side of the ledger, it has a model that actually makes sense and reflects the way we work — so we’d like to replace it, for the sake of maintainability, but we want to gain rather than lose functionality in doing so. The apparent market leader in this space, PagerDuty, is very expensive despite an extremely poor feature set; I haven’t yet gotten to the point of figuring out what the alternatives to PagerDuty are and whether any of them can come close to supporting our model. Maybe this will turn into a startup idea for someone (in which case, I’m happy to participate, but I’ll stick with my steady job, TYVM).
We are not a DevOps shop. There is a very good reason for this: we are a university research lab, not a SaaS provider or indeed a vendor of any kind; we have no devs. (Not entirely true, but take that as a given for the purposes of this discussion — the in-house apps we do have dev staff for are 9-to-5 internal business apps that do not implicate incident response.) We operate computing infrastructure for a thousand people, most of them graduate students, who use the network, compute clusters, and storage systems we provide to perform actual research, run experiments, write papers, publish data sets, and do other sciencey stuff. There are no SLAs, and the degree to which our users care about incident response is inversely proportional to the time until the next conference submission deadline for their particular field of study.
As for the current conditions: We have a small team, typically between 6 and 8 sysadmins, who are responsible for incident response. Incidents are reported two ways: by our automated monitoring system (Nagios), and by users calling a telephone hotline. Every Friday at noon, two team members are selected (by a crappy in-house app I wrote) to be primary and secondary on-call responders; these two people handle all hotline calls that make it through to voicemail, and all Nagios alerts outside of regular business hours. During business hours, all team members receive all Nagios alerts (but the hotline voicemail just goes to the on-call people because it doesn’t support scheduling — this is a bug). The on-call sysadmins are chosen by a simple algorithm: a calendar is consulted to determine when each sysadmin was most recently on call and which sysadmins are scheduled to be unavailable; the two sysadmins who will be available for the entire week and were least recently on-call (whether scheduled or in emergency substitution) will be on call for the following week, and if possible, will swap primary and secondary roles. If someone has an emergency and needs to be removed from on-call, any sysadmin can go into the app and replace them; a cron job running on the Nagios server will send out a confirmatory alert within 15 minutes.
We’re pretty satisfied with this setup, but it has some serious problems, which are motivating my search for an alternative implementation:
- Our vacation/travel schedules must be manually duplicated in (at least) three different places: our own personal calendars, our group calendar, and the on-call scheduling application. This leads to confusion, and worse, missed shifts, when sysadmins forget to sync up their schedules everywhere. (There’s also a fourth place — the HR application that tracks staff vacation allowances — but we can’t do anything about that, and not all vacations imply unavailable-for-on-call or vice versa.)
- Notification information is also manually duplicated in several places: our personal address books, our group wiki, the on-call scheduling app (it uses this to generate a “currently on call” page that can be IFRAMEd into some of our other internal Web pages), and the home-brew application that Nagios uses to send SMS.
- Sending outgoing SMS messages depends on a single point of failure, a MultiTech CDMA data terminal which is connected by serial port to our Nagios server. It falls over from time to time in strange ways, and occasionally needs to be over-the-air reprovisioned as Verizon Wireless makes changes in their network. And of course it’s limited to traditional 140-byte SMS, and essentially nobody uses these things any more, so if we have problems Verizon is pretty clueless when it comes to helping us fix them. There are also undocumented rate limits, and the timeliness and reliability of SMS delivery (even to other VZW subscribers) is sometimes problematic.
- Our voicemail system doesn’t integrate with this at all. We use
SkyTelAmerican Messaging’s “Universal Master PIN” service — a legacy from the long-ago days when we all carried SkyTel pagers — to provide a secret email address that the voicemail system can use to notify us of an incoming message. However, this depends on the sysadmins to manually update American Messaging when they go on-call, especially after mid-shift substitutions.
- The in-house on-call scheduling app is an ancient, fragile Rails 2 app that I built a decade ago, when we were still using pagers. Pretty much everything about it, other than the on-call selection algorithm, is wrong for today’s world, and it will be a challenge to get it to run on a modern system.
So what do we actually want in a replacement? Calendar-based scheduling is the most important thing. We would like two-way calendar synchronization, both so that the notification system can learn when we are unavailable (and distinguish unavailable-for-on-call from regular out-of-office) to adjust the schedule automatically, and so we can easily see who is on-call in our calendar applications. Obviously we want something that can handle our particular sort of rotation, and can deal with multiple users being notified simultaneously (not an escalation schedule) during business hours. It would be really great to be able to properly implement our primary/secondary thing with a real escalation ladder, and it would be nice to have configurable always-notify filters (so that, for example, our postmaster can be notified about mail system outages even when he’s not officially on call). Obviously, we want to get rid of
SkyTelAmerican Messaging, and get rid of the CDMA data terminal and all of the cruft that goes along with hooking it into Nagios, not to mention the crufty old on-call scheduling app. (We are OK with a cloud-based app, in a way that we were not when the current system was implemented, because we now have the ability (through an off-campus data center) to backup our alerting system in a way that does not depend on the network in our building being functional.)
I haven’t yet found anything that can do these things. PagerDuty clearly can’t, at least based on their public documentation: it can’t even cope with sending notifications to two people at the same time. Does anyone out there know of a service that would meet our needs, out of the box, without a huge investment in (probably outsourced) developer time? If so, please leave comments below.