Attention conservation notice: 6,200 words about conferences you didn’t attend and idiosyncratic constraints that make it unlikely I will attend again any time soon. My apologies that this report is so long, but it would take much too long to make it shorter. If you’re just looking for a summary of the actual program content, skip to the third section, no hard feelings.
The week of March 20–24 I attended a professional conference for the first time since the fall of 2019, Usenix‘s SREcon Americas 2023. I am writing this report for several different audiences, so depending on which one you are in, you may find parts of it redundant or uninteresting; apologies in advance. I’m going to start with the background (my professional background as well as the tech-industry trends that made this conference), then talk a little about my travel experience and the venue itself, before moving on to a discussion of the conference program.
Background
For those who don’t know me, or who have only ever glanced at my social media, you might not know what I actually do. My day job is at MIT Computer Science and Artificial Intelligence Laboratory, the largest interdisciplinary research laboratory at MIT, with about 110 Principal Investigators (faculty and research scientists) and over 1,000 active members and affiliates. I work in The Infrastructure Group, a team of about a dozen people which is responsible for providing computing, storage, and networking infrastructure to the Lab. This is an unusual situation: most university departments, labs, and centers are simply not large enough to afford a support group of this size, and many have no shared computing platform other than what is provided by the university’s central IT group, if even that. For the last 26 years, I have run the network for CSAIL (and its predecessor, the Laboratory for Computer Science), a stark contrast to people in the tech industry who might expect to spend 26 months at a single employer.
“I run the network” is what I usually say, without elaboration, when asked what I do. That isn’t particularly complete, but actually explaining what that means isn’t usually required or wanted; years ago, I found that many people had a misapprehension that I was involved somehow in running Microsoft Windows servers, but as ordinary people have become more distanced from operating systems and server technologies, that’s less of an issue. What I actually think of as my primary job is literally running the network: specifying and configuring the routers, switches, firewalls, and wireless access points that provide Internet access in CSAIL’s physical building and remote data center site. That additionally means I handle our relationship with the central IT group as far as our connection to campus and to the outside world goes, but it also means I’m responsible for a bunch of servers that provide fundamental network services: DNS servers, DHCP servers, the provisioning database from which they get their data, and the antique artisanally crafted Perl scripts that provide a self-service interface those services. I also run our network authentication service, our user account management system (of which I am also the principal developer), and one of our three network storage platforms. As the longest-tenured staff member in our group, I also help out with the budgeting and try to serve as a repository of our institutional memory.
For most of the years from 1998 through 2019, I attended Usenix’s premier winter conference, known as “LISA”, an acronym for “Large Installation System Administration” — the origin story says that when asked to define “large” for the first LISA conference back in the late 1980s, the program chair said “at least five computers”. Back then, advanced computers of the sort that Usenix members cared about, running the Unix operating system, were still thin on the ground; a company might have one or two large VAX minicomputers, with a bunch of terminals or a modem pool for remote access, but Digital and other companies like Sun, Hewlett-Packard, Apollo, and even IBM were selling numerous “engineering workstations” that were designed to run Unix and to be connected together in a local-area network, to companies, government agencies, and universities. LISA was started to help develop a vendor-neutral community and good practices around administering networks of many Unix systems. By the late 1980s, with the introduction of the 32-bit Intel 386 processor, it became practical to use regular (but high end) desktop computers to run Unix and perform many of the functions of these much more expensive workstations, and in a few short years, with the appearance of the free-to-use and free-to-copy 386BSD and Linux operating systems, inexpensive PCs came to dominate the Unix workstation and eventually server market, aside from a few application areas like high-performance storage or graphics that required specialized hardware.
This shift enabled the “Dot-Com Boom” of the late 1990s: all of the major web servers were developed on and for Unix systems; the databases and storage systems that were required for the first generation of e-commerce sites, ran on Unix; and there was a huge boom in both the number of companies that had big networks of Unix systems and the number of people employed in administering those systems. There was a clear need for better system administration practices, something that would allow administration to scale, especially as new startups like Hotmail demonstrated the feasibility of serving millions of simultaneous users on thousands of small rack-mounted PCs rather than a smaller number of much larger and more expensive computers. (Hotmail at this time was one of the biggest — when I saw their data center, Google was still a small startup — and rather than buying servers in cases, Hotmail mounted bare motherboards directly to sliding metal shelves. Hotmail at the time was the largest public user of the FreeBSD operating system, of which I was then a developer.)
LISA boomed with the Internet boom, expanding to include workshops, a multi-day program of tutorials, and eventually the conference program expanded to five tracks with nearly 2,000 attendees. LISA survived the “Dot-Com Crash” following the first boom, I think mainly thanks to having hotel contracts already signed three years in advance, and became one of the few places where people engineering web services and people operating traditional data centers and engineering computing systems would regularly meet and interact.
Then in 2006, Amazon launched what became known as “cloud computing”: rather than owning their own servers, network equipment, and storage, businesses — especially web sites — could simply rent them from Amazon. This was initially less of an engineering shift, as it later came to be, and more of a finance play: just as a business will often prefer to lease an aircraft or a warehouse or a retail storefront, even though they will have to pay a premium compared to the cash cost of just buying the property, with cloud computing you can rent a server by the hour instead of having to pay for it all up front, and as with those other examples, the business doesn’t have to carry those depreciating assets on its books and doesn’t have to hold reserves for their ultimate replacement — what both Wall Street and Silicon Valley investors consider a more “efficient use of capital”. Amazon would own all the servers, would own the data center, would pay for power and cooling and network connectivity, and would allow you to change how much you use almost instantly to match your actual business needs — they took away the risk of buying more servers, more disks, or a bigger network connection than your business actually required. This was great for start-ups, because they could start out small and buy more service as and when customer revenue demanded. It was also very attractive to many large enterprises, who could outsource a “non-core” part of their business, especially if they were facing a major facility upgrade or relocation, freeing up the capital (and real estate) that would have gone into a new data center to be used for something else.
For technology companies, the calculus was different: they were competing in the marketplace for talented engineers and developers and had to offer very generous pay, benefits, and often equity compensation. There was a business imperative to optimize their productivity, by building systems and practices that would allow them to make changes on the level of an entire data center as easily as (or even more easily than) they could on their own laptop. A number of companies, most notably Google (we’ll get back to them shortly) invested heavily in developing their own services and platforms for internal use to increase developer productivity. Many of these were implemented as “web services”, or what now are mistakenly called “APIs”: services that speak web protocols to exchange small blobs of data representing requests and responses, which don’t care whether the client is a browser, a stand-alone application, an embedded device, or another web service. Even more higher-level services were needed to “orchestrate” these arrays of services, to make sure that all of the required services are running, to deploy new instances and terminate old ones when the code gets updated, to increase the processing capacity when things get busy and shut idle servers down when no longer needed.
This resulted in a clear bifurcation of the LISA audience and the LISA program. There were still a good number of us in attendance who represented the old audience of education, enterprise IT, research, and government, but we were vastly outnumbered by the people working either for tech companies or for the operations groups of banks, media companies, and large industrial concerns — all of which had their own large-scale application development groups and were trying to support more and faster development with fewer and fewer operators. I can recall as early as 2010 noticing that the LISA program was less interesting and had less “business value” for me than it had ten years previously — I would not have been invited in 2010 to give the talk I did give in 2005, about CSAIL’s move to our then-new building. I got more value from the social side of the conference (the “hallway track” and the Birds-of-a-Feather sessions) than I did from the actual conference sessions. I still kept on going, because where else was I going to go?
This bifurcation was made particularly obvious by the introduction of a new term, and arguably a new concept, in the system administration literature: “DevOps”. This was the idea, which was quite radical at the time, that application developers — or at least, application development teams — ought to operate their own infrastructure, not just for testing but the actual public web services that users of their web site interact with. This was done by treating the underlying infrastructure that these services run on as yet another component of a software system that could be modified programmatically, making operations more like programming (and as a side effect, deprofessionalizing the actual maintenance and operation of the real underlying physical servers, which was now outsourced). This seems like a reasonable approach (although many developers objected) if you’re stuck in the tech mindset of “more, faster, cheaper”, but for us? We don’t have “developers”! (Well, we have a developer, for internal applications, supporting our thousand-person user base.) Another way of looking at it is that we have eight hundred “developers”, but they’re called “graduate students” and they are working individually on eight hundred individual products called “Ph.D. Theses”. There is no shared objective or profit motive, and unless they crash out, we have to live with them, and they with us, for six to eight years, during which time they would like us to please not rip the guts out form under their research, thank you very much.
Google was in an particularly influential position, as a major sponsor of LISA and Usenix conferences generally, as the employer of a large minority of LISA attendees and invited speakers, and of course as the operator of both a public cloud computing infrastructure and numerous internal platforms that support the vast reach of its services. Internally, Google had been developing a set of organizational structures and practices that came to be called “site reliability engineering” or SRE. This garnered almost as much buzz as DevOps (because if Google is going it…). In 2014, Usenix launched a new conference, called SREcon, and two years later, O’Reilly Media published Site Reliability Engineering: How Google Runs Production Systems, which I am told has been O’Reilly’s best selling book title for seven years straight. (Looking at O’Reilly’s web site you wouldn’t even know they were a publisher of actual ink-on-dead-tree books; I had to go to Amazon to look up the title!) SREcon was so successful that Usenix ended up running it three times a year, on different continents, something they had never done before. This drew a lot of the audience away from LISA, and with it, talk and paper submissions declined precipitously (the refereed paper track was abolished shortly thereafter), and so did the sponsorship that made LISA financially practical for Usenix as an organization.
I continued to attend LISA, as did several other people I had come to know over the years, because SREcon was clearly not pitched toward our professional needs or line of business. (I often wondered in those years if anyone attended SREcon who didn’t already work for Google.) On occasion, when LISA would be held in Boston — Usenix had a policy of regularly rotating between eastern and western North America — I could use my travel budget to attend another conference, since the airfare and lodging were the most significant expenses, rather than the registration fee.
I last attended LISA in 2019, when it was at the Hyatt Regency San Francisco Embarcadero. That was the last ever in-person LISA conference, as it turned out: LISA 2020 was canceled due to the COVID-19 pandemic, and LISA’21 — shifted to the summer rather than LISA’s traditional late-autumn slot — was a virtual-only event that I didn’t feel the need to watch. After LISA’21 concluded, Usenix decided to end the conference. (A retrospective was published in the Usenix magazine ;login: in August, 2021.) By that time, I had really been attending LISA for the people more than for the technical content, and I saw little reason to expect to travel to any conference again.
So why am I writing this? Why did I think it was worth attending a conference that was very much not targeted at me?
Travel and venue
Some time in mid-February, I was sitting alone in my office, and I had opened up my phone for some reason and randomly looked at Twitter. I stumbled across a tweet from @usenix advertising the SREcon Americas 2023 program. For whatever reason, rather than ignoring it, I clicked on the link and scrolled through the listings. I think I copied and pasted a quote from one of the talk descriptions to our work Slack, because it sounded interesting, and mused about possibly going — since I hadn’t done any work travel for three years it seemed like an opportune moment. Once given the go-ahead, I started looking into the mechanics of travel.
Before the pandemic, I had an MIT Travel Card, so I was not being asked to front the Institute thousands of dollars for conference registration, airfares, and lodging. The program said I had until February 28 to get the “early bird” registration and reserve a room in the conference hotel — but when I went to Concur to try to book a flight, it failed with an odd error message. It turned out that, since I had not traveled in three years, the MIT Travel Office had decided that my (unexpired) card had been lost and canceled it. I would have two weeks to try to get a new travel card, during which time I could reconsider whether I still wanted to go or not (again, I wasn’t going to put it on my own card, for which the bill would need to be paid before I would even be eligible for reimbursement). It took a while to find the right contact address for the Travel Office (during which time I had other more important things on my plate), but they confirmed that the card had been canceled due to inactivity and they’d have to order a new one. (Unbeknownst to me, registrations were running below target and Usenix had extended the “early bird” pricing and hotel block until March 3, but this didn’t enter into my calculus at all.)
SREcon Americas is slotted into the Usenix calendar for the end of March. Normally, I would not consider going to a conference near the end of March, because the World Figure Skating Championships are held every year during the last or next-to-last week of March. This year, however, that event was held in Japan, and I won’t fly trans-pacific, so I unusually had no conflicting travel plans. (The next three years will be in Montreal, here in Boston, and Prague, so there’s no chance I’ll attend SREcon again or do any other business travel in March before 2027 at the earliest.)
One factor that had me reconsidering while waiting for my travel card to be reissued was the venue. This conference was being held at the Hyatt Regency Santa Clara and the adjacent Santa Clara Convention Center, a facility that I remembered from having previously attended FAST there. It is a dreadful location, at the intersection of two traffic sewers in the middle of low-rise Silicon Valley sprawl. The principal value in having a conference there, at least in the Before Times, is that more than half of the attendees would be driving there anyway. (Transit access is about what you’d expect from Santa Clara, and although the VTA Light Rail has a stop within walking distance of the hotel, it doesn’t go to either airport, and in fact there is absolutely nothing of any value within walking distance of the hotel. (When I was there for FAST, I had to walk half a mile to the nearest sandwich shop for lunch, but for SREcon the sponsors paid for free lunch every day.) On my trip to FAST, I had flown into SFO and taken a shuttle van ride down to Santa Clara, but this time I did not want to spend so much time sitting in traffic, so I decided to stick to San Jose flights instead.
My replacement travel card arrived on February 28th, so after trudging through the rain to pick it up at the Travel Office, I set to booking. I first hit the conference registration, still not noticing that the deadline had been extended (I could have saved a few dollars if I had noticed, because there was a discount available for educational institutions had I been willing to wait for an email round-trip). I booked the (extremely expensive) hotel next, and then opened Concur to retry the flight search. Searching by schedule, all of the options looked terrible and expensive, but before I booked the least-bad expensive flight option, Concur asked me if I had considered this other itinerary that was only half the cost. It was actually better by most standards than anything I had seen, except that on the return I would have to get up at 6 AM and endure a four-hour layover in Denver. (This was at still better because it didn’t involve flying hundreds of miles out of the way to make a connection in Los Angeles, Seattle, or Houston! I’m pretty sure that there is at least one direct round trip between Boston and San Jose but I didn’t find it — maybe the return flight is a red-eye?)
I ended up flying on United, as you can tell from the Denver connection, except for the outbound DEN–SJC sector which was on a United Express RJ. (I think a flying time of over two hours on a tiny CRJ700 with a single lav and inadequate overhead space is a bit much.) MIT’s policy with respect to getting a tolerable seat, or indeed to getting a seat that comes with a full carry-on allowance on United, is unfortunately unclear and apparently left up to individual departments to figure out for themselves. As a result, I ended up spending $250 out-of-pocket to get better seats and priority boarding, which I have no idea whether CSAIL will reimburse. The equipment on the other sectors was B737-800 outbound, and A320 then B737-900 on the return; only the initial BOS–DEN leg had any in-flight entertainment, leaving me to greatly miss the moving map on the other three legs.
I’ll mention at this point that starting in late February and continuing to the present, I have been suffering from some sort of condition — by the time you read this I’ll have finally seen the orthopedist, who was scheduling three weeks out — that makes sitting in a chair for any length of time quite painful, especially in the morning: not exactly the ideal conditions to fly for six hours, then sit in a chair at a conference venue for eight hours a day, then fly back home for another six hours (plus four hours in Denver airport). Of course, there’s also still a pandemic going on, but I saw very little evidence that anyone around me was paying attention: I was the only person in sight wearing a mask, whether in the airport, on the plane, or at the hotel. (I’ll give Usenix credit for including information about the hotel’s air filtration in the program, but aside from eating and drinking I kept my mask on whenever in a public space for the duration of the trip.)
Conference program
I told several people that I was justifying my travel as “professional development”: even if nothing at SREcon was remotely connected to what I actually do for my employer, it would help me keep up to date with what is going on in the tech industry, because many trends (like cloud storage, infrastructure-as-code, continuous integration/deployment, containers, “serverless”, and “microservices”) that start out there ultimately become something we have to deal with (or work around).
The conference program was fairly neatly divided into two tracks, numbered rather than named: track 1 was more “social” and track 2 was more “technical”, and I spent most of my time in track 1. The opening plenaries set the theme (historically you’d call them “keynotes” for that reason, but “keynote” in tech conferences now seems to mean some Big Name Speaker With Something To Sell Who Nonetheless Got Paid To Speak); the phrase “sociotechnical systems” was repeated frequently throughout the program, emphasizing that this conference was not just about software but also about the organizational structures that produce it and make it operational as A Thing You Can Pay For (and that’s as reliable as your Internet connection). I had to learn a bit of new jargon to understand some of the presenters, like “toil” (which apparently what they call “actually typing stuff in a shell” now), “SLO” (“service level objective”), and “LFI” (an initialism for “learning from incidents”). Track 1 was in fact quite heavy on incident response, incident analysis, and retrospectives, which was more interesting than I expected it to be, although still short on take-home value given the size of our organization.
After the plenaries and a mid-morning snack (I had gotten up too late for the free breakfast and needed to take my medication with food), the first session I attended was about a content-delivery-network failure at Facebook, caused by a testing scenario. In this case, the CDN was ringing with requests for images, as image encoding servers were marked down, their backup servers became overloaded, the primary servers recovered and were marked up, the backup servers failed their health checks, and so on. The solution the engineers finally found was to tell their load balancers to pretend that all servers were healthy — sure, some servers were overloaded and wouldn’t be able to answer requests, but this would allow the load to be spread out evenly across the network, stopping the oscillation, and thereby becoming true.
I moved over to track 2 for the next talk, because I had a personal interest in the subject: how one Mastodon instance (hachyderm.io) handled the sudden influx of tens of thousands of new users who were leaving Twitter in response to its new Main Character For Life. Hachyderm had been serving a few thousand users on a single server in someone’s basement, but scaling to 30,000 users required significantly more resources — particularly since the all-volunteer operations team was committed to doing as much as possible in public and without depending on proprietary value-added services. The original server was also, as it turned out, failing. And oh, by the way, Mastodon is an inconsistently documented Ruby on Rails monolith. The Hachyderm team wanted to avoid putting all their eggs in one basket as far as both physical location and infrastructure providers go, which led to some interesting decisions, like running in three different clouds, and running NFS over a transatlantic network connection. (As someone who runs NFS over a wide-area network within a single small state, that’s not something I’d recommend! They eventually moved all of the media files from NFS to an object store.) The astonishing thing was that they made this work with all volunteer labor, with an average (per the presenters) of only two hours per volunteer per week.
Lunch each day was in what used to be called the “vendor expo” but is apparently now the “sponsor showcase”. I had never heard of the vast majority of the companies exhibiting; when I mentioned this to Tom Limoncelli when we chatted later, he suggested that the majority of the companies hadn’t even existed three years ago, before the pandemic. I didn’t look closely at most of the booths, but I’m given to understand a large fraction of them were actually selling products to ease incident response; another bunch were selling metrics products, either products to collect more of that data, or services to handle all of that data collection for you. The first day it was pouring rain, so I was glad enough to take advantage of the free lunch rather than walking half a mile to Togo’s for a sandwich.
After lunch on Tuesday, I was back in track 1 with a talk from two Datadog engineers about how, for once, it wasn’t DNS. This was actually one of the more technical talks in track 1, getting into the weeds over how a little oddity in their internal metrics after redeploying a Kubernetes pod led them through DNS down into the Amazon Elastic Network Adapter metrics and eventually to a configuration problem with Google RPC that was causing their DNS resolvers to get flooded. It was a great detective story and I’d recommend going back to watch the video once it’s uploaded.
That talk was followed up with Nick Travaglini’s talk about how a test of a NORAD computer system almost started World War III in 1979 — thankfully averted because some eagle-eyed staffers noticed that the timestamp on the supposed “Soviet attack” was off. This was not original research, in the sense of academic historians — everything in the talk was taken from open media and Congressional reports — but a story well told and unfamiliar even to many of us who lived through it.
The day’s final session focused on the actual practice of incident response. First, two people from jeli.io discussed the difference between being an incident commander and being an incident analyst from the perspective of having performed both jobs. Both deprecated the term “incident commander” — a collocation taken from military-style hierarchies of rank that are the norm in the public-safety field from whence the concept derives — and preferred “incident coordinator”. They noted that actually managing incident response in a large team calls for a very different set of skills compared to analyzing the event post facto, conducting interviews and retrospectives; some people might be better suited for one role over the other, and asking people to do both is a recipe for burnout. This was followed by Chad Todd from CloudStrike, who discussed his Lund University master’s thesis research about effective practices for handovers between incident responders during long-running incidents, expanding on literature that heretofore largely focused on the healthcare industry (e.g., nursing shift changes).
Wednesday morning started with an actual breakfast, to my surprise: the program said “Continental Breakfast”, but when I got down to the meeting hall, I found chafing dishes heaped high with scrambled eggs and roasted new potatoes, as well as crispy bacon and the expected pastries and beverages. I was utterly uninterested in either of the first talks of the day, and I’m not even sure what I actually did. I perked up for the second talk in track 2, although I was misled by the title and failed to read the abstract — “high priority” and “happy queues” primed me to be thinking of overbearing VIPs and ticketing systems, rather than the actual subject, correctly specifying the queues in a job queuing system. (The key, as it turns out, is that the queues should be distinguished not by programmers’ subjective notion of “priority”, but by how soon the job must be completed to meet a customer’s service level expectation.)
I bailed entirely on the second session of the day, and spent the time watching the previous day’s figure skating from Japan, then headed back down for free lunch. The first talk, by Lorin Hochstein, is very difficult to summarize — and reviewing the slides didn’t help — but he titled it “Why This Stuff Is Hard“, which gives you a flavor of the level of abstraction it’s pitched it. (It was a good talk for after lunch, talking about big concepts rather than in the weeds, but you’ll need to wait for the video.) After that, I switched back over to track 2 to learn how the Wikimedia Foundation uses Network Error Logging to learn about users’ connectivity problems directly from their web browsers, often in close to real time. It was an interesting idea, and I was quite surprised to learn that many users are able to report connectivity errors when they occur, over the same (presumably unreliable) network. Although an official W3C specification, NEL is only implemented in Chromium-based browsers; Firefox has thus far resisted implementation, rightly considering it to have significant privacy implications.
For Wednesday’s final session, it was more incidents: a short talk about actually doing some statistical analysis on a large corpus of after-incident reports, and a longer talk about practices and skills that incident responders who are not incident commanders (same language caveat) can develop to help their teams work more effectively. This was followed by a conference banquet (again paid for by sponsors) which was loud and confusing: there was apparently Real Food that I did not find until I noped out of the noise and excessive crowding and tried to find the most convenient exit. (I had three small plates of hors d’œuvres, and each time had to sit at a different table because my seat was cleared and taken by someone else before I made it back from the catering stations — but I probably would have been happier just getting a sandwich from the hotel cafe and eating in my room.) I did not attend the Lightning Talks, indeed I did not even see them on the schedule, and in any case if I didn’t finish watching the figure skating that night, NBC would take it away. (Literally, the replays were only available for two days!)
Thursday morning began with an audience-participation talk analogizing incident response to musical improvisation, which I honestly did not pay much attention to as I was busy reading my work mail for nearly the entire talk. This was followed by a talk I can’t make sense of from the published abstract but I remember as being interesting and researchy; you’ll have to wait for the video and judge for yourself.
After the mid-morning coffee break, the second session was again divided into two short talks and one full-length talk. Austin Parker’s “The Revolution Will Not Be Terraformed” was the most direct attack on the deprofessionalization of operations in the entire conference, taking lessons from Kropotkin and Bookchin to encourage the audience to resist commoditization of their field. This was unfortunately followed by a completely useless (but at least short) talk from a bank in Singapore which repeated DevOps and SRE bromides that even I was thoroughly familiar with. The long talk of the session was from a Shopify engineer talking about both failed and successful interventions to reduce cloud resource consumption in the face of budget tightening. The biggest takeaway from this seemed to be that they were wasting a lot of money on fragmentation, the capacity they were paying for but couldn’t use because the resource requirements of their services were not integral divisors of the sizes of their cloud provider’s virtual machine instances. (Yet again this seems like a recapitulation of a long-known issue people would learn about in an undergraduate CS curriculum.)
I don’t remember anything at all about Thursday’s after-lunch session other than that I was there, and the slides aren’t up yet to tickle my memory. I may have bailed on the second talk in the session; neither track’s abstracts sound like something I would have been interested in.
Like the opening plenary session on Tuesday morning, Thursday’s closing plenary was split into two 45-minute talks. In the first talk, two engineers at a bank walked through the process of developing service-level objectives that their management and developer teams would actually believe meant something, which took some trial and error to find metrics that actually correlated well with already accepted high-level measures of customer experience. The final talk of the conference, and probably the talk that most fit the mold of “classic LISA closing plenary”, was “Hell Is Other Platforms“, a riff on Sartre’s No Exit in both title and content. I would highly recommend watching the video once it’s uploaded.
So that’s the formal program. Like other Usenix conferences, SREcon has BoFs — “Birds of a Feather” sessions — where attendees can get together in a meeting room and discuss topics of their own choosing, organized by writing session topics on a blank schedule posted in the meeting hall. There were BoFs on both Tuesday and Wednesday; while the Wednesday topics were uninteresting to me, I attended three sessions on Tuesday after our free dinner in the expo hall. The first session was actually a guy pitching his early-stage startup which proposed to use GPT-3 to analyze terminal session transcripts created by operators and incident responders to help turn them into operations “runbooks”. The language model was invoked after the fact to create a human-readable description of what the command in question was doing, and a human operator could edit these or add more annotations such as specific script fragments that could be executed. The second BoF was about PostgreSQL, a database server of which I operate several instances for both personal and CSAIL systems — but it turned out to be something of a bust when the speaker and I were the only attendees.
My final BoF was organized by Brian Sebby, a sysadmin from Argonne National Lab who, like me, was a long-time LISA attendee and had come to SREcon to see if it had any value for his organization. He titled the session “LISA Refugees”, if I recall correctly — unlike the formal program there’s no digital record of the BoF sessions — and we were joined by fellow LISA regulars Tom Limoncelli (author or coauthor of several notable O’Reilly titles on system administration) and Matthew Barr; later on, we were joined by the Usenix board delegate for SREcon, Laura Nolan. We had a wide-ranging discussion about how the needs of our (Brian’s and mine) communities were not really represented very well, and how the announcement of LISA’s end had promised that Usenix would try to make more programming that was of interest to operations-only organizations. We discussed some of the other conferences we had considered attending, and some of the areas where there was still meaningful overlap. Tom suggested that we ought to consider ourselves as being engaged in the emerging practice of “platform engineering” — there are now enough people at enough companies building these technological platforms that “product” sits on top of that there is a movement to recognize this as a distinct discipline. We also discussed a number of other topics, including how even some big-name web shops are actually pretty “enterprisey” in their service architecture, and how at CSAIL we now have a lot of students doing summer internships in tech companies and being taken aback at the low level of abstraction provided by our long-evolved service offering. We’re not far at all from the “cloud generation” entering grad school and lacking a basic understanding of servers and files, if we haven’t already passed that point. It was a good discussion, even if it came to no particular conclusion.
So that was my trip. If I have any take-home from all of that (6,100 words!), it would be that my MIT colleagues on the academic side — particularly in systems and HCI areas — really should engage more with this this community. Will I attend SREcon again? Probably not, if the “Americas” edition continues to be held in March, but if the 2027 Worlds get awarded to Australia, maybe? Assuming we all still have jobs by then, and haven’t been replaced by GPT-42.