Hi folks, it’s time for some work- and computing-related stuff. I just recently got back from the 2017 edition of the Usenix Association‘s annual system administration conference, LISA’17, which was held in San Francisco. I’ve gone to most LISA conferences since 1998 (when it was in Boston), but this is the first time that it’s actually been held in The City that I can recall, even among the ones I didn’t attend. It was very expensive, with membership discounts for registration having been withdrawn and the conference hotel getting a whopping $239 a night (at the discounted conference rate), not to mention the cost of airfare; my total cost (paid by my employer, thankfully) will be around $3,000 — something I could not possibly afford if I had to pay my own way — and that’s without taking any tutorials. That total only includes a couple of meals, because one major problem with this conference is the excess of “free” food, most of which is quite unhealthy, piled on buffets at every break, meal-time, and vendor BoF (Birds of a Feather) event. I gained five pounds in one week, and because of the time shift I found it nearly impossible to use the hotel gym the entire week. (I did spend nearly all of Tuesday walking around San Francisco, which helped a bit, except that I also went to some of the city’s famous bakeries and chocolatiers — so the conference isn’t entirely to blame for my weight gain.) Most of this excess food is paid for by the vendors, but I’d really have preferred if they could have found something else for the vendors to sponsor, and limited “refreshment” breaks to nothing more than coffee and soda — especially in a city like San Francisco where getting out of the hotel at mealtimes would have been much more rewarding than many of the other venue cities.
I’ve seen a number of people, including invited speaker Tanya Reilly and tutorial instructor Tom Limoncelli, post their comments about the program, so I’m going to do the same. In general, my impression is a bit more negative than theirs, and I was left wondering if we actually attended the conference. However, I’d also note that this conference has become more and more dominated by corporate IT and especially Web startups, whose organizations, problems, and space of feasible solutions (generally starting with “throw money and/or developers at it”) are nothing at all like mine. I’m seriously considering not attending the next LISA in Nashville, given the lack of take-home value this time around, whereas at the last few LISAs I’ve had difficulty deciding between two or three great sessions in nearly every time slot. I said as much in the official after-the-conference survey, but I’m honestly not sure how much the program committee cares at this point, or whether they even see R&E shops like us to be within the target audience of the conference. (Hey, I’m not saying that scalability isn’t cool — but nothing that I do will ever scale higher than n=4.) The weakness of the program was a surprise to me, given that I volunteered for the Content Recruitment Team and actually had a chance to double-blind review many of the submissions; after that process I was quite excited about the program and I didn’t bother to review what actually made it before registering.
So anyway, let’s go through the program session by session. The opening plenary was, unusually, divided into two 45-minute slots, and I unfortunately had to race to the bathroom shortly into the first slot so I didn’t really get to see either speaker. Wednesday’s second session was the only one of the entire conference where I seriously had trouble deciding which track to attend; I ended up going to the Mini-Tutorial “Automating System Data Analysis Using R”, taught by Robert Ballance, who covered the same ground in more detail in a half-day tutorial on Monday that I didn’t attend. I felt that the compressed 90-minute format was not a good match for this material; I’ve done some elementary data analysis in R already, and it took well into the second half before he really got into the things that would actually be useful for me at work. This is the sort of thing where I suspect an interactive “lab” or “workshop” format would be much better, with a “bring your own data” element that could actually be more helpful than the synthetic datasets used to present this tutorial. I’ll have to remember to check out the materials for the full-length tutorial to see if there are techniques or packages in there that he didn’t have time for in the mini-tutorial. (Apparently I missed a great talk by attending this tutorial: the commentary on Matt Provost’s “Never Events” talk makes me think I should have gone — waiting for the video to be posted.)
The first conference lunch was held inside the vendor expo, and is effectively paid for by the vendors as an inducement to get the conference attendees to stop by their booths. I noted the absence of a number of vendors this time around: publishers like No Starch and O’Reilly, service providers like PagerDuty and DataDog, and major hardware and software vendors like Dell and Splunk were not to be found. Of the 40 total exhibitors, a quarter were there solely for recruiting purposes, and nearly as many were non-profit organizations there either to raise awareness or (in the case of Princeton University) to recruit sysadmins to participate in a study.
After lunch on Wednesday I stuck with the “Talks II” track for the rest of the day. I thought Silvia Boutros’s talk “Working with DBAs in a DevOps World” was interesting enough despite having no take-home value for me (I’m actually the closest thing we have to a DBA, at least insofar as I wrote a number of internal applications that use a database and have strong opinions about how it should be done). The next talk was “Queueing Theory Practice: Performance Modeling for the Working Engineer” was also interesting without being especially useful, although the most counterintuitive theoretical result that Eben Freeman introduced was one that I already knew (tail latency goes to hell when utilization goes over 80% in a memoryless single-service model with random arrivals). There was some other good stuff in the talk, about balancing coordination overheads against parallelism. The third talk in the session, Stella Cotton on “Distributed Tracing: From Theory to Practice”, had no plausible applicability to anything I do and I tuned out fairly quickly.
The second PM session on Wednesday was divided into two talks. (This confused me a bit: why were some sessions three half-hour talks and some two 45-minute talks? I found in general that the half-hour talks were too compressed: the speakers spent too much time on the motivation and not nearly enough time on the actual results or engineering they were supposed to be describing, and didn’t leave any time for the Q&A that might have brought out more interesting applications.) I sat through Daniel Barker’s “Becoming a Plumber: Building Deployment Pipelines”, but found it uninteresting and remember little of it. Then Tanya Reilly came up and gave one of the three best talks of the whole conference, “Have You Tried Turning It Off and Turning It On Again?” — which was about engineering services to survive a disaster like a power outage that takes down a whole data center. She pointed out that in a “microservices” would, our “technology stack” can easily degenerate into a “technology pile” unless careful attention is paid to avoiding circular dependencies — especially non-obvious multi-node cycles in the dependency graph. This talk really spoke to me because a big part of my responsibilities at work is specifically maintaining those services that have to be up and working before anyone else’s stuff can run — network, time, authentication, directory, database, and other services that the rest of the infrastructure needs to have in order to start up or in order to be managed by other members of my team. As one of the few people who has been around for every facility power outage going back to 1997, it falls on me in particular to worry about this dependency graph, and what happens when we (for example) virtualize services that might be required to boot the virtualization environment. Her talk also reminded me of a Graydon Saunders blog post from last year that it turns out was actually more detailed in my memory than its actual text. (tl;dr: Given modern global supply chains, how many people does it actually take for the global economy to function? Saunders guesses at least a billion.)
After the last session of the day it was time for dinner — in the expo hall, because, well, “free”, and also there really isn’t sufficient time in the schedule to get together with some people and find a restaurant before the beginning of the BoF sessions, especially in a city like San Francisco where the good restaurants are not all in one place and tend to be busy enough that reservations are advisable. The BoF track was pretty uniformly disappointing this year, with nearly all of the potentially interesting BoFs scheduled for the same two time slots, Wednesday at 7 PM and Thursday at 8. In addition to the OpenZFS BoF, which I attended, there were two different monitoring BoFs — we got kicked out of the room just as the first one was getting past the usual introductions, so I went to the second one on Thursday evening as well. I honestly think having fewer and smaller rooms would have been an improvement, forcing the BoF organizers to spread their slots out more. Of course, there is also the usual problem of the “vendor BoFs” — with more free food and alcohol — sucking the life out of the actual shared-interest BoFs.
Thursday’s plenary was moved to the evening slot, rather than the usual first-thing-in-the-morning schedule. So I started out the morning in a talk by Nina Schiff of Facebook about “Disaggregating the Network”, which illustrated why things that work at Facebook are not really practical for the rest of us (see Corey Quinn’s talk below). It’s a nice idea, to commoditize top-of-rack switching in the data center and replace proprietary switch operating systems with the same Linux stack and configuration management used on the servers in those racks, but it’s not a practical exercise for those of us who don’t build a whole new data center to roll out a new service. After that talk I moved into the other session to learn about “Charliecloud: Unprivileged Containers for User-Defined Software Stacks in HPC”, because our environment, though not an HPC cluster at the scale of Los Alamos, shares a lot of the same use model and many of the same software release management problems as HPC centers have — except that in our case, nearly all of that work is done by graduate students who should be doing something else. Containers in general are supposed to abstract away a lot of those problems by allowing multiple independent (and immutable) software stacks to execute on the same machine. There’s still a pretty long way to go before our environment — with lots of GPUs requiring matching kernel drivers, not to mention storage on AFS — can really take advantage of this.
In the second morning session, I went to Trever Vaughan’s talk “Operational Compliance: From Requirements to Reality”, and I have it marked in my program as one that I thought was good, but even after reviewing the slides posted online I’m not sure why. This was followed by two people from NEC Labs presenting a rehash of a 2016 paper (from some other venue) about a tool they developed (and which is not available outside NEC) to automatically cluster log messages and use the results to generate log-parsing patterns that will maximize the amount of useful data extracted under conditions of limited processing power per log message. The tool actually generates multiple sets of parsing patterns which vary in their coverage of the input messages; the user must then choose their desired CPU vs. completeness trade-off. The third half-hour talk in this session was Dan O’Boyle of Stack Overflow explaining why you should give all your cryptographic keys to Google or Amazon rather than allowing your operations staff to touch them. (I suppose that way you can claim you didn’t know that the government was reading all your communications because the National Security Letter wasn’t addressed to you.)
Thursday’s lunch was again “free” in the vendor expo. After lunch, I went to Chris McEniry’s mini-tutorial on “The Ins-and-Outs of Networking in the Big Three Clouds”, which was an overview of how client networks work in AWS, Google Cloud, and Azure, with an emphasis on translations between the names used and capabilities implemented by each provider. (Most important lesson: don’t expect broadcast or multicast protocols for things like service discovery or mastership elections to work in cloud providers!) At the 3:30 PM break, I bailed completely, and did not attend the afternoon plenary (a panel on “Attracting and Retaining a Diverse Workforce”), although those who did thought it was both good and too short. Instead, I took the F-Market streetcar all the way to the “wharf” end, in a tacky, touristy area near the terminus of the Powell-Hyde cable-car line, Ghirardelli Square, and Fisherman’s Wharf. After seeing the long lines waiting to ride the cable car, I chose to talk back to the hotel instead — but following the steep hills of the Powell-Mason and California St. cable cars rather than the flat Embarcadero route of the streetcars. I returned in time for the conference reception (more “free” food that I shouldn’t have eaten), which was in the main atrium lobby area of the hotel this year. (In the distant past, they would rent a museum or some other interesting venue, but in recent years it’s been confined to a hotel ballroom, so I can’t really claim to be disappointed.) The reception was followed by more BoF sessions, of which I attended the second monitoring BoF, which went over its scheduled time by a bit, and the “DevOps Poetry Slam”.
I should say a bit more about monitoring/metrics BoFs: there has been one at every single LISA I’ve ever attended, and it’s quite clear that there is still a great deal of unhappiness with the solutions different organizations have adopted — whether it’s over resource demands, cost of third-party software and outsourced monitoring services, or the difficulty of building dashboards that actually collect all the business-relevant metrics, it’s clear that there’s still a lot of work to be done. We’re not especially happy with our setup either, but we have exactly zero budget in either money or personnel for the sort of solutions that might have a chance of making us happy. Some day, one of us will take it on, and then when that person leaves it will fall into disrepair.
Friday began with a talk by Martin Van Horenbeeck, “An Internet of Governments: How Policymakers Became Interested in ‘Cyber'”, which was exactly what it says on the tin. Following that was a talk by Evan Gilman and Doug Barth about work they had done when both were at PagerDuty. The title of their talk was “Clarifying Zero Trust: The Model, the Philosophy, the Ethos”, but I felt like that oversold the content somewhat. It was interesting to see the specific problem that they were trying to solve, how it related to their business requirement to operate across multiple public cloud providers, and their choice to use IPsec and packet filters to enforce security policies rather than the VPN offerings of each cloud provider (which would harm availability by creating single points of failure in each availability zone). I have it noted on my program that I didn’t much care for the presentation, but I’m not sure why. (The Zero Trust model is a very attractive one for us — indeed, many of the fundamental ideas where developed at MIT in the 1980s — but fails to meet many of our users’ needs or threat models.)
For Friday’s second morning session, I started out with Peter Lega’s talk “DevOps in Regulatory Spaces: It’s only 25% What You Thought It Was”, which was basically about how you convince compliance people in a regulated industry that modern software development methodologies really can address the risks that regulators are most concerned about — in Merck’s case, by integrating the compliance documentation and procedures with the development process, reducing the adversarial relationship between developers (“move fast and break things”) and regulators (“first do no harm”). I then switched sessions to see Corey Quinn’s talk, “‘Don’t You Know Who I Am?!’ The Danger of Celebrity in Tech”, which was excellent and exactly on-point to many of the concerns I’ve had with LISA programs in the past. Quinn made the point that most of the world is not Google, Facebook, Netflix, or Twitter, and that (a) it’s probably not a good idea for organizations that aren’t anything like those companies probably should think carefully before adopting the technology or the methodologies that those companies use. Quinn gave the example of a bank IT director watching a talk about Netflix’s “Simian Army” and wanting to take that approach back home — the methodology that’s appropriate for a company that doesn’t actually do anything important (sorry, Netflixers) may not be something you want people doing when people’s lives (or money) are at stake. Quinn also lit into the other side of the problem, people who work for those companies and use their resumes to shut down discussion of techniques that actually would be appropriate for someone who isn’t working at one of the top five Web origins. I strongly urge anyone to watch this talk when the video is released by Usenix in the coming weeks. (Yes, I got up and asked a question.)
After yet another “free” lunch, the final regular session of talks started with Ben Hartshorne of Honeycomb.io talking about various techniques for sampling application trace data, with a particular emphasis on using business requirements to determine appropriate sampling rates for different events (e.g., sampling important clients at a higher rate, but sampling successful transactions at a lower rate than errors). I didn’t find this talk especially interesting, but I wanted to get a good seat for David Blank-Edelman’s talk, “Where’s the Kaboom? There Was Supposed to Be an Earth-Shattering Kaboom!” I didn’t read the abstract of his talk ahead of time, so I honestly had no idea what it was about — it turned out to be about how lessons we can take from the demolition industry. David delivered the talk with his typical flair, but I did think it was less interesting (and entertaining) than several other talks on similar themes he has done at past LISAs, some of which rated a plenary slot.
After another refreshment break (finally, the “free” food was bags of mass-produced snack foods, which I had no trouble resisting), the conference closed with a plenary address by Jon Kuroda of UC Berkeley. Jon avers that this exact talk was rejected in 2013, but he decided to resubmit it this time around, and the PC was so excited by it that they asked him to speak for 90 minutes rather than just 45. He went into the history of several modern engineering fields, including space flight, nuclear energy, and (the specific focus of this talk) commercial aviation: along with computing, all date to the early post-WW2 period, but unlike computing, all have developed strong protocols for reasoning about and ultimately ensuring safety. (Medical technology and pharmaceuticals, too, date from this period or just slightly later, and have very strong safety cultures now.) Jon went through a number of well-known commercial aircraft accidents, and identified how the operator’s corporate safety culture made these incidents either better or worse than they might have been, exploring what lessons we should bring into our industry as computing systems are more and more involved in making decisions that can cause serious individual harm.
I returned home on Saturday morning, but it’s taken me until now to actually write this. Later today, I’ll be off to Lake Placid to watch some athletic young people in skinsuits hurtle down a mountain face-first while balanced atop a tea tray — which is how I had enough time this evening/morning to finally write this summary.