One advantage of taking a break, however short, from my weekly baking project is that it allows me to put some time into writing other things. Lately, I’ve been looking for better ways to do on-call notification at my workplace, and I seem to be coming up dry. I posted a “call for help” as it were on Twitter, but 140 characters is just a bit too short to explain what we are really looking for, so perhaps it’s not surprising that nobody responded. (The fact that few people in my line of work actually follow me on Twitter doesn’t help either!) So here’s a description of what we’re doing, and where I think I’d like to go, and if any of you folks out there can give me pointers, I’d greatly appreciate it. The stuff we have now is very fragile, has a lot of moving pieces, and depends on far too many manual steps to be truly reliable, but on the other side of the ledger, it has a model that actually makes sense and reflects the way we work — so we’d like to replace it, for the sake of maintainability, but we want to gain rather than lose functionality in doing so. The apparent market leader in this space, PagerDuty, is very expensive despite an extremely poor feature set; I haven’t yet gotten to the point of figuring out what the alternatives to PagerDuty are and whether any of them can come close to supporting our model. Maybe this will turn into a startup idea for someone (in which case, I’m happy to participate, but I’ll stick with my steady job, TYVM).
We are not a DevOps shop. There is a very good reason for this: we are a university research lab, not a SaaS provider or indeed a vendor of any kind; we have no devs. (Not entirely true, but take that as a given for the purposes of this discussion — the in-house apps we do have dev staff for are 9-to-5 internal business apps that do not implicate incident response.) We operate computing infrastructure for a thousand people, most of them graduate students, who use the network, compute clusters, and storage systems we provide to perform actual research, run experiments, write papers, publish data sets, and do other sciencey stuff. There are no SLAs, and the degree to which our users care about incident response is inversely proportional to the time until the next conference submission deadline for their particular field of study.
As for the current conditions: We have a small team, typically between 6 and 8 sysadmins, who are responsible for incident response. Incidents are reported two ways: by our automated monitoring system (Nagios), and by users calling a telephone hotline. Every Friday at noon, two team members are selected (by a crappy in-house app I wrote) to be primary and secondary on-call responders; these two people handle all hotline calls that make it through to voicemail, and all Nagios alerts outside of regular business hours. During business hours, all team members receive all Nagios alerts (but the hotline voicemail just goes to the on-call people because it doesn’t support scheduling — this is a bug). The on-call sysadmins are chosen by a simple algorithm: a calendar is consulted to determine when each sysadmin was most recently on call and which sysadmins are scheduled to be unavailable; the two sysadmins who will be available for the entire week and were least recently on-call (whether scheduled or in emergency substitution) will be on call for the following week, and if possible, will swap primary and secondary roles. If someone has an emergency and needs to be removed from on-call, any sysadmin can go into the app and replace them; a cron job running on the Nagios server will send out a confirmatory alert within 15 minutes.
We’re pretty satisfied with this setup, but it has some serious problems, which are motivating my search for an alternative implementation:
- Our vacation/travel schedules must be manually duplicated in (at least) three different places: our own personal calendars, our group calendar, and the on-call scheduling application. This leads to confusion, and worse, missed shifts, when sysadmins forget to sync up their schedules everywhere. (There’s also a fourth place — the HR application that tracks staff vacation allowances — but we can’t do anything about that, and not all vacations imply unavailable-for-on-call or vice versa.)
- Notification information is also manually duplicated in several places: our personal address books, our group wiki, the on-call scheduling app (it uses this to generate a “currently on call” page that can be IFRAMEd into some of our other internal Web pages), and the home-brew application that Nagios uses to send SMS.
- Sending outgoing SMS messages depends on a single point of failure, a MultiTech CDMA data terminal which is connected by serial port to our Nagios server. It falls over from time to time in strange ways, and occasionally needs to be over-the-air reprovisioned as Verizon Wireless makes changes in their network. And of course it’s limited to traditional 140-byte SMS, and essentially nobody uses these things any more, so if we have problems Verizon is pretty clueless when it comes to helping us fix them. There are also undocumented rate limits, and the timeliness and reliability of SMS delivery (even to other VZW subscribers) is sometimes problematic.
- Our voicemail system doesn’t integrate with this at all. We use
SkyTelAmerican Messaging’s “Universal Master PIN” service — a legacy from the long-ago days when we all carried SkyTel pagers — to provide a secret email address that the voicemail system can use to notify us of an incoming message. However, this depends on the sysadmins to manually update American Messaging when they go on-call, especially after mid-shift substitutions.
- The in-house on-call scheduling app is an ancient, fragile Rails 2 app that I built a decade ago, when we were still using pagers. Pretty much everything about it, other than the on-call selection algorithm, is wrong for today’s world, and it will be a challenge to get it to run on a modern system.
So what do we actually want in a replacement? Calendar-based scheduling is the most important thing. We would like two-way calendar synchronization, both so that the notification system can learn when we are unavailable (and distinguish unavailable-for-on-call from regular out-of-office) to adjust the schedule automatically, and so we can easily see who is on-call in our calendar applications. Obviously we want something that can handle our particular sort of rotation, and can deal with multiple users being notified simultaneously (not an escalation schedule) during business hours. It would be really great to be able to properly implement our primary/secondary thing with a real escalation ladder, and it would be nice to have configurable always-notify filters (so that, for example, our postmaster can be notified about mail system outages even when he’s not officially on call). Obviously, we want to get rid of
SkyTelAmerican Messaging, and get rid of the CDMA data terminal and all of the cruft that goes along with hooking it into Nagios, not to mention the crufty old on-call scheduling app. (We are OK with a cloud-based app, in a way that we were not when the current system was implemented, because we now have the ability (through an off-campus data center) to backup our alerting system in a way that does not depend on the network in our building being functional.)
I haven’t yet found anything that can do these things. PagerDuty clearly can’t, at least based on their public documentation: it can’t even cope with sending notifications to two people at the same time. Does anyone out there know of a service that would meet our needs, out of the box, without a huge investment in (probably outsourced) developer time? If so, please leave comments below.
The just-in-time scheduling would drive me totally nutty! There’s a big quality of life improvement if you have 6-12 weeks of schedule lookahead, so we (app delivery, IS&T, but the other half of ops does it this way too) set up a plain rotation for a few months at a time. If if conflicts with your vacation, you’re free to ask around for a swap (and it’s rare to get a refusal to swap). But, having the default laid out lets me make plan a weekend outing without having to black out the time.
We did have budget for PagerDuty, and it’s pretty life changing; in particular, something that flaps and comes back up (without intervention) in the night can be much more gracefully ack’d via the app, and you don’t lose the fact that an un-ack’d problem escalates. Having the escalations go via different pathways has saved my bacon more than once.
PagerDuty can and does notify multiple people at the same time for us, or at least some alerts notify two different teams (but I didn’t do the integration of it, so I don’t have the details handy).
Our current system does have a calendar view, which shows who is scheduled (barring changes to availability) to be on call for the next three months. As far as multiple notifications goes, I’ve heard from them “oh yeah we can do that”, but what they don’t seem to be able to do (based on their online documentation that I was pointed at) is notify multiple from the same schedule — they can only notify multiple people if they’re on different schedules, and our team isn’t big enough to have more than one schedule, unless we had two schedules with the same people (and therefore two different places to manage, enter time off, etc.).
Guessing at what problem you’re trying to solve here, but an IM integration for the notifications might serve much better. We used to put two people on call at once for training purposes, but lately it seems to work better simply to set a social expectation that the person on call will escalate very briskly.
I was unhappy with it when two teams had a big overlap in notifications; it always felt like a lot of moral hazard and extra decision making to have split responsibility.
Well, as I suggested in the original article, what we’d really like is actually fire brigade-style escalation: if one person (the “primary on-call” responder) doesn’t ack the issue within a fairly short time frame, another person (“secondary on call”) from the same team will get notified. Then a longer delay, and the whole team is notified. The problem we’re trying to solve is simply that sometimes people lose track of their phones, forget that they put them on silent, run out of battery, are unable to respond because showering, exercising, etc., and need some sort of backup short of “the world is on fire, call in the National Guard!” level of escalation. As it happens, our existing system (without escalation) has worked reasonably well, mostly because team members have different levels of expertise and different wake/sleep schedules, but I’d rather not depend on that. Having true escalation would make the on-call rotation less burdensome, since only one person would usually get annoyed in the middle of the night.
So app-delivery has a rotation of 5-6 people for primary (manager doesn’t take primary, new guy isn’t in the rotation yet) and an escalation schedule that includes the manager (always) and a “secondary” who is simply another team member. (And no, there’s no separate schedule that’s just the manager. PagerDuty can alert to any set of schedules, plus any set of people. I think for “alert to A whenever it’s 9-5” you would need to write a schedule for that, though.)
There’s some social pressure when it comes to preventing something from escalating, and an expectation that for every “pager event” we’ll open a case and send email – so it’s irritating to go into a negotiation over every page for “do you have the tracking ticket or do I” which puts pressure on the idea that one person is on call at a time. But, automatic escalations are a very good idea, and they also keep people honest, as far as ignoring things goes.
We used to play pager-chicken, BITD. Basically, you signed up for some reasonable number of weeks that were convenient for you, and if your week was coming to an end, and nobody was signed up to take the next one, you either kept it, or you would administer guilt trips to people on the team who had signed up for fewer weeks of on-call. That was also terrible.
I would recommend throwing out existing process, and just looking for a schedule and method that’s fair and keeps people happy. Your list of requirements reek of “because we’ve always done it this way” and not actual business requirements.
I’d be happy to do it differently, if “differently” actually solved the problem that we do have, which is that there are too many places people have to keep in sync with their schedules. We too frequently have an issue in the current system where the rotation happens and all of a sudden it’s “oh no, I put my vacation on the group calendar but not on the oncall calendar, who can take my shift?!” at 1 PM on a Friday. And sometimes you find someone, but sometimes you don’t (because, e.g., the only person available is someone who doesn’t normally work Fridays and isn’t following work email closely). What I really want is to just have one calendar — and if the people feel that the oncall rotation is burdening them unfairly, that’s on management to take into account when granting leave.
The dependency between the vacation calendar and the on-call calendar being worked out in real time is the bug here. The two are better separate, and you reconcile them (and update your personal calendar, or not) when you make changes (eg. plan vacation).
Having to poll the group on-call calendar for changes, rather than updating your references when you plan vacation (or someone asks you for a swap) is not a normal way to run things. So yes, you have too many calendars, but most people solve this by making the calendars more predictable, and thus lower maintenance.