Occasionally Coherent

In search of a usable on-call notification service

One advantage of taking a break, however short, from my weekly baking project is that it allows me to put some time into writing other things. Lately, I’ve been looking for better ways to do on-call notification at my workplace, and I seem to be coming up dry. I posted a “call for help” as it were on Twitter, but 140 characters is just a bit too short to explain what we are really looking for, so perhaps it’s not surprising that nobody responded. (The fact that few people in my line of work actually follow me on Twitter doesn’t help either!) So here’s a description of what we’re doing, and where I think I’d like to go, and if any of you folks out there can give me pointers, I’d greatly appreciate it. The stuff we have now is very fragile, has a lot of moving pieces, and depends on far too many manual steps to be truly reliable, but on the other side of the ledger, it has a model that actually makes sense and reflects the way we work — so we’d like to replace it, for the sake of maintainability, but we want to gain rather than lose functionality in doing so. The apparent market leader in this space, PagerDuty, is very expensive despite an extremely poor feature set; I haven’t yet gotten to the point of figuring out what the alternatives to PagerDuty are and whether any of them can come close to supporting our model. Maybe this will turn into a startup idea for someone (in which case, I’m happy to participate, but I’ll stick with my steady job, TYVM).

We are not a DevOps shop. There is a very good reason for this: we are a university research lab, not a SaaS provider or indeed a vendor of any kind; we have no devs. (Not entirely true, but take that as a given for the purposes of this discussion — the in-house apps we do have dev staff for are 9-to-5 internal business apps that do not implicate incident response.) We operate computing infrastructure for a thousand people, most of them graduate students, who use the network, compute clusters, and storage systems we provide to perform actual research, run experiments, write papers, publish data sets, and do other sciencey stuff. There are no SLAs, and the degree to which our users care about incident response is inversely proportional to the time until the next conference submission deadline for their particular field of study.

As for the current conditions: We have a small team, typically between 6 and 8 sysadmins, who are responsible for incident response. Incidents are reported two ways: by our automated monitoring system (Nagios), and by users calling a telephone hotline. Every Friday at noon, two team members are selected (by a crappy in-house app I wrote) to be primary and secondary on-call responders; these two people handle all hotline calls that make it through to voicemail, and all Nagios alerts outside of regular business hours. During business hours, all team members receive all Nagios alerts (but the hotline voicemail just goes to the on-call people because it doesn’t support scheduling — this is a bug). The on-call sysadmins are chosen by a simple algorithm: a calendar is consulted to determine when each sysadmin was most recently on call and which sysadmins are scheduled to be unavailable; the two sysadmins who will be available for the entire week and were least recently on-call (whether scheduled or in emergency substitution) will be on call for the following week, and if possible, will swap primary and secondary roles. If someone has an emergency and needs to be removed from on-call, any sysadmin can go into the app and replace them; a cron job running on the Nagios server will send out a confirmatory alert within 15 minutes.

We’re pretty satisfied with this setup, but it has some serious problems, which are motivating my search for an alternative implementation:

So what do we actually want in a replacement? Calendar-based scheduling is the most important thing. We would like two-way calendar synchronization, both so that the notification system can learn when we are unavailable (and distinguish unavailable-for-on-call from regular out-of-office) to adjust the schedule automatically, and so we can easily see who is on-call in our calendar applications. Obviously we want something that can handle our particular sort of rotation, and can deal with multiple users being notified simultaneously (not an escalation schedule) during business hours. It would be really great to be able to properly implement our primary/secondary thing with a real escalation ladder, and it would be nice to have configurable always-notify filters (so that, for example, our postmaster can be notified about mail system outages even when he’s not officially on call). Obviously, we want to get rid of SkyTelAmerican Messaging, and get rid of the CDMA data terminal and all of the cruft that goes along with hooking it into Nagios, not to mention the crufty old on-call scheduling app. (We are OK with a cloud-based app, in a way that we were not when the current system was implemented, because we now have the ability (through an off-campus data center) to backup our alerting system in a way that does not depend on the network in our building being functional.)

I haven’t yet found anything that can do these things. PagerDuty clearly can’t, at least based on their public documentation: it can’t even cope with sending notifications to two people at the same time. Does anyone out there know of a service that would meet our needs, out of the box, without a huge investment in (probably outsourced) developer time? If so, please leave comments below.