Saturday, June 2, 2018

The Oncall Conundrum, and thoughts on the heroic last-minute push

I have never met a software engineer who liked being oncall. It makes sense: we're used to flexible hours and working conditions—we don't show up for a shift, aren't usually tied to a location—and the most glorified and interesting parts of the job typically happen in deep flow, while building things. Being oncall violates all these terms. You are on the hook for a fixed period, you can't go off into the woods and your plans may be interrupted at any time, and when shit happens, you're stressed and maybe sleep-deprived and you're not building—you're debugging and doing damage control. Your actions feel like short-term hacks, and you know you're going to have to actually clean things up later. Add on top of this: when an incident happens at night, often you are alone, you feel bad for waking other people up, and you may end up hunting around in systems you don't really understand, seeking out outdated half-done documentation. It's a special loneliness.

Some of the worst oncall situations I've been in occurred on the ads team at Pinterest. I described it to some people as feeling like being thrown in a meat grinder—I knew I'd get chewed up by the machines. Of course, the one most memorable experience was caused by my own work, which left a painful but deeply-ingrained lesson. (Long story short: I had to work around a flaw in my new metrics workflows by staying up all night, refreshing pages and clicking buttons every 20-30 minutes. That or else advertisers wouldn't get their metrics, which sounds relatively benign, but is your whole point of view when you are working on ads data.) But when the fires are due to systems you didn't build, especially large ones that will take a long time to fix, that can cause a lot of frustration and loss of morale. 

Perhaps this is why, when Li Fan first joined Pinterest as its head of engineering and the company held an eng all-hands, I remember only one question from the Q&A. Someone stood up and said: I think being oncall is one of the most under-appreciated tasks. You may stay up all night keeping the site up, yet this stuff gets little visibility and recognition. What are you going to do about this? 

I'm going to guess fundamentally not much has changed, and not because of her or Pinterest. It has taken me the past five years of experience to realize this. It's because the issue is tricky: you want to praise the motivation and raw effort that goes into great firefighting, but you also want to recognize firefighting as a failure case: "Thanks for doing this work, but this work should ideally not exist." You don't want to create perverse incentives by explicitly rewarding firefighting, otherwise people will be incentivized to set fires or ignore preventative measures. On the other hand, the human effort and individual sacrifice involved in firefighting is inordinate compared to other parts of the job. Unlike literal firefighters, software engineers who are oncall do not consider firefighting their primary responsibility. It's usually the activity that takes away from the parts of the job that they are hired and recognized for. Thus the conundrum.

The best thing to do, of course, is to decrease the amount of firefighting needed. One thing I've never seen tracked at any company I've worked at is the number of hours spent firefighting. At least, as an oncall, I have never reported it or been asked. Measuring has to be the first step. Then surfacing it, e.g. in managers' team health reports.

But no matter what, something will always happen. One thing that could improve how oncalls feel is to give explicit time off for people who do end up fighting fires at odd hours: i.e. something like, "if you fought a fire last night, don't come in the next day." This seems obvious, but I have never heard this policy articulated at any company I've worked at. I think it should be time off immediately after the event, so the intent is clear: to help people recover (and so people wouldn't be motivated by getting extra flex vacation days). One concern is then: but what if I have commitments the next day and have to come in anyway? One way to mitigate this is to plan ahead of time: don't have people who have time-sensitive commitments be oncall close to their deadlines. (I find that when I think about people things, often I end up with some version of "talk about it" and "do better planning.") 

There are bound to be places where the constraints are too hard and the above approaches can't work, like if there are deadlines and fires all the time. If that's you... you may work at a startup.


I now work at a startup where the flavor of oncall issues are currently less fire-this-moment, more customer-support oriented (though this is not to say there are no drop-everything fires). They also typically happen during business hours (which does not mean just US or SF business hours). The point being that oncall has actually not really been top-of-mind. At this SaaS company it does feel like part of the job.

I was thinking about the oncall mentality recently because of a pattern I've seen across my whole career, though it's more pronounced in smaller companies: the heroic last-minute push to finish and ship a product. When you're participating in one of these, it's like a longer version of an oncall shift—you are often hammered for several days or longer. In January, there was one week where I worked sixteen hours a day for seven days straight. We stopped working shortly before our demo at 10pm on a Sunday (business hours in another time zone). At the end of one of these, you want to feel good about having put in the hard work. The tough thing is to step back and say, but really, ideally this should not have happened. And on a tiny team, you know it happened in part because of you.

It's actually easier to reflect on the meta-failure if you were part of the team in the sprint—you have more contextual information and the platform to self-criticize. As an observer, it's uncomfortable to say something because it might be perceived as taking away from the hard work that other people put in. Unlike oncall, where even the participants may easily admit that the work is better off being entirely mitigated, the participants in a product sprint are building something that should be meaningful and that they should want to feel good about. Invariably a discussion about the "how" of a product execution is tied up with people's feelings about their contributions. 

This is the lens through which I can understand the value of an explicit product manager, or even a project manager, at a small company that might perceive their biggest challenge as having enough hands on deck to build—there is someone whose craft is this meta, which makes it easier to talk about and critique this meta. A software engineer invites a discussion of code style, because it's part of their craft. A product or project manager invites us to talk about the craft of execution. 

No comments:

Post a Comment