Monday, June 18, 2018

Cross-Site Request Forgery in Plain Language


Cross-Site Request Forgery, commonly known as CSRF, is one of the most well-known web security attacks. I first studied it in a college classroom, and since then I've mostly worked at big companies where safety was often baked into the frameworks most software engineers used. This week I brushed off my dusty knowledge, and in the process realized there is a lack of explanations of common attacks in plain language.

I'll start with one issue, CSRF. This isn't mean to be comprehensive, but hopefully will augment other resources. This post will outline the problem and is meant for a general nontechnical audience.

Browser Background


To understand CSRF, you first need to know a bit about web browsers. Notice that when you log into a website, from then on, the website seems to "know" it's you who is browsing it. You can continue to click from page to page without being asked to log in again and again. How does this work?

This works because of a browser feature called cookies. On a basic level, a cookie is a piece of information that a website can send to your browser when you visit it, and the website can metaphorically ask to store that information on your device. You can configure your browser to reject cookies, but by default most browsers accept and store them.

From then on, browsers follow two rules:
  1. Whenever you access a website that sent you cookies, the browser will send back whatever cookies that website stored on your device.
  2. The browser will only send cookies that came from Website A back to Website A; it will never send cookies that came from Website A to Website B.

Note: none of the above is magic. It is how browsers are expected to behave. When you use Chrome or Safari, you are trusting that Google or Apple have correctly implemented those rules.

Let's go back to the example of logging into a website. Here's what happens behind the scenes, on a high level:
  • After you send your username and password to the site, the site verifies your username/password, and sends back a set of cookies that essentially says, "This certifies we know you are YOUR-NAME-HERE." Importantly, it's hard to forge these cookies.
  • On each subsequent request to the website, your browser sends back that set of cookies, which stands in place of your username/password to identify you.

You can imagine it would be pretty bad if an attacker got access to your cookies and could then impersonate you. This is the logic behind Browser Rule #2 above, and underlines the trust you are placing in your browser.

However, the point of CSRF is, an attacker doesn't even need to get your cookies.


CSRF: The What


In CSRF, an attacker tricks you into taking an action, identified as you, without you intending to do that action. To give a physical example in the same spirit: it reminds me of this phone scam where a caller asks you, "Can you hear me?" The attacker records you saying "Yes", and then uses that "Yes" fraudulently as proof you wanted to buy some product you didn't actually agree to. You protest, "But that's not what I meant when I said 'Yes'!" However, you did say 'Yes'—that is your voice.

CSRF has a similar structure, except the thing used to identify you is your cookies, not your voice. Let's assume you are logged into a website we'll call MyBank.com. Thus, your browser has stored the cookies set by MyBank.com that say "This certifies we know you are YOUR-NAME-HERE." Remember that whenever the browser makes a request to MyBank.com, it will always send those cookies that identify you.

So if the attacker wants you to take an action on MyBank.com that you would not knowingly want to do, they need to come up with a way to make you send a request to Website A without you knowing it. A plausible request you would not want to make is "Send $100 to Account-Owned-By-the-Attacker." The question is, how do they convince you to make a request you don't want to make?

Importantly, you don't need to have MyBank.com open in a browser tab in order for your browser to make a request to it. A malicious page can contain code that executes "Make a request to Website A that sends the attacker money." So the short answer to the above question is: the attacker gets you to visit a site, or open an email, that contains malicious code, while you are logged into MyBank.com. They cannot guarantee that when you visit the malicious page, you are logged into MyBank.com at the same time. But if you happen to be, then the attack can succeed.

In the worst case, it is really that simple—you open a harmful website or email, and you've been hit. And most of the time you will not even know, because the harmful website or email is disguised to look harmless.

Prevention


As an end-user, you can try not to click on sketchy emails or visit malicious sites, or any sites that might have embedded sketchy code. In practice, this is tough to do, and most of us simply trust that MyBank.com has implemented standard safety checks to prevent this kind of attack from happening on it.

The key behind the protections is this: Remember above, I said "you don't need to have MyBank.com open in your browser in order for your browser to make a request to it." That is true, but a website that is properly protected will only allow requests that were made on the website proper to succeed. It will detect that requests to it made from other locations are not legitimate, and cause those requests to fail. So, if MyBank.com is well-protected, an attacker can still make you visit a malicious website that makes a request to MyBank.com that says "Send $100 to Account-Owned-By-the-Attacker," but MyBank.com will recognize that this is not a request you made from MyBank.com and disregard it.

How does MyBank.com do this? I'll save this for a later post. There are a few different solutions, but the summary is again: you are putting your trust into the fact that your web browser is properly implementing security policies, and that MyBank.com is properly implementing these prevention techniques. As the end-user, an important thing to understand is what entities you are implicitly trusting.

Saturday, June 2, 2018

The Oncall Conundrum, and thoughts on the heroic last-minute push

I have never met a software engineer who liked being oncall. It makes sense: we're used to flexible hours and working conditions—we don't show up for a shift, aren't usually tied to a location—and the most glorified and interesting parts of the job typically happen in deep flow, while building things. Being oncall violates all these terms. You are on the hook for a fixed period, you can't go off into the woods and your plans may be interrupted at any time, and when shit happens, you're stressed and maybe sleep-deprived and you're not building—you're debugging and doing damage control. Your actions feel like short-term hacks, and you know you're going to have to actually clean things up later. Add on top of this: when an incident happens at night, often you are alone, you feel bad for waking other people up, and you may end up hunting around in systems you don't really understand, seeking out outdated half-done documentation. It's a special loneliness.

Some of the worst oncall situations I've been in occurred on the ads team at Pinterest. I described it to some people as feeling like being thrown in a meat grinder—I knew I'd get chewed up by the machines. Of course, the one most memorable experience was caused by my own work, which left a painful but deeply-ingrained lesson. (Long story short: I had to work around a flaw in my new metrics workflows by staying up all night, refreshing pages and clicking buttons every 20-30 minutes. That or else advertisers wouldn't get their metrics, which sounds relatively benign, but is your whole point of view when you are working on ads data.) But when the fires are due to systems you didn't build, especially large ones that will take a long time to fix, that can cause a lot of frustration and loss of morale. 

Perhaps this is why, when Li Fan first joined Pinterest as its head of engineering and the company held an eng all-hands, I remember only one question from the Q&A. Someone stood up and said: I think being oncall is one of the most under-appreciated tasks. You may stay up all night keeping the site up, yet this stuff gets little visibility and recognition. What are you going to do about this? 

I'm going to guess fundamentally not much has changed, and not because of her or Pinterest. It has taken me the past five years of experience to realize this. It's because the issue is tricky: you want to praise the motivation and raw effort that goes into great firefighting, but you also want to recognize firefighting as a failure case: "Thanks for doing this work, but this work should ideally not exist." You don't want to create perverse incentives by explicitly rewarding firefighting, otherwise people will be incentivized to set fires or ignore preventative measures. On the other hand, the human effort and individual sacrifice involved in firefighting is inordinate compared to other parts of the job. Unlike literal firefighters, software engineers who are oncall do not consider firefighting their primary responsibility. It's usually the activity that takes away from the parts of the job that they are hired and recognized for. Thus the conundrum.

The best thing to do, of course, is to decrease the amount of firefighting needed. One thing I've never seen tracked at any company I've worked at is the number of hours spent firefighting. At least, as an oncall, I have never reported it or been asked. Measuring has to be the first step. Then surfacing it, e.g. in managers' team health reports.

But no matter what, something will always happen. One thing that could improve how oncalls feel is to give explicit time off for people who do end up fighting fires at odd hours: i.e. something like, "if you fought a fire last night, don't come in the next day." This seems obvious, but I have never heard this policy articulated at any company I've worked at. I think it should be time off immediately after the event, so the intent is clear: to help people recover (and so people wouldn't be motivated by getting extra flex vacation days). One concern is then: but what if I have commitments the next day and have to come in anyway? One way to mitigate this is to plan ahead of time: don't have people who have time-sensitive commitments be oncall close to their deadlines. (I find that when I think about people things, often I end up with some version of "talk about it" and "do better planning.") 

There are bound to be places where the constraints are too hard and the above approaches can't work, like if there are deadlines and fires all the time. If that's you... you may work at a startup.


I now work at a startup where the flavor of oncall issues are currently less fire-this-moment, more customer-support oriented (though this is not to say there are no drop-everything fires). They also typically happen during business hours (which does not mean just US or SF business hours). The point being that oncall has actually not really been top-of-mind. At this SaaS company it does feel like part of the job.

I was thinking about the oncall mentality recently because of a pattern I've seen across my whole career, though it's more pronounced in smaller companies: the heroic last-minute push to finish and ship a product. When you're participating in one of these, it's like a longer version of an oncall shift—you are often hammered for several days or longer. In January, there was one week where I worked sixteen hours a day for seven days straight. We stopped working shortly before our demo at 10pm on a Sunday (business hours in another time zone). At the end of one of these, you want to feel good about having put in the hard work. The tough thing is to step back and say, but really, ideally this should not have happened. And on a tiny team, you know it happened in part because of you.

It's actually easier to reflect on the meta-failure if you were part of the team in the sprint—you have more contextual information and the platform to self-criticize. As an observer, it's uncomfortable to say something because it might be perceived as taking away from the hard work that other people put in. Unlike oncall, where even the participants may easily admit that the work is better off being entirely mitigated, the participants in a product sprint are building something that should be meaningful and that they should want to feel good about. Invariably a discussion about the "how" of a product execution is tied up with people's feelings about their contributions. 

This is the lens through which I can understand the value of an explicit product manager, or even a project manager, at a small company that might perceive their biggest challenge as having enough hands on deck to build—there is someone whose craft is this meta, which makes it easier to talk about and critique this meta. A software engineer invites a discussion of code style, because it's part of their craft. A product or project manager invites us to talk about the craft of execution.