A couple of months ago, I posted about leaving academia. Two weeks ago, I joined Google as a Site Reliability Engineering (SRE) manager. I’ll be working to keep bits of Google’s technical infrastructure running smoothly, at least once I’ve learned enough about how it works and what all the various switches and levers do to be dangerous. The past two weeks have been a deluge of new things to learn, but I’ve finally got my head far enough above water to reflect on things a bit.
This was my favorite informal question during the interview process(1): why Google, why SRE, why management? The leap from curmudgeonly pseudoacademic to technical manager isn’t as far as it could be, but “academia” and “industry” are often treated as completely separate tribes(2).
The most interesting part of this question “why SRE”. I was a sysadmin for a few years in college, running UNIX (and UNIXy) things for the Civil Engineering department at Georgia Tech, and while none of my ops horror stories are truly terrible(3), the responsibility-without-authority that characterized late-90s sysadmin was not something I was eager to continue after leaving school. The SRE model is different, and actually reminds me a lot of what I’d wished that job could be. “SRE is what happens when you ask a software engineer to design an operations team”, as Benjamin Treynor Sloss says in the introduction to the book. I wouldn’t go so far as to call the process I used “software engineering”, but “why manage systems when you can write Perl(4) scripts to manage systems” was pretty much my mantra back then.
I spent the last few years of my research career trying to make (small parts of) the Internet easier to measure and manage, and SRE is an approach to measurement-driven, progressively automatic management of large-scale systems. I’m really excited to learn, hands-on, how it’s done out in the “real” world. I really get the feeling that I’ve left a university to go back to school: everything here is a little different, and looking after bits of it requires that one be an expert in how it works and how it can break. That’ll take some time.
(1): I actually liked a lot of the questions, though it’s still a hard day: after finishing my onsite I went directly home, promptly fell asleep at around seven, and didn’t move for thirteen hours.
(2): I spent an embarrassing amount of time my first week walking into doors: it’s been a long time since I had a badge.
(3): I do have a pretty good story about that time we
required everyone to change their passwords (you know, for security) after I
rm -rf /etc’d the production mailserver as
(4): I did say late 90s.