Love to Code (& the Goals of an Engineering Manager)

Some time ago, I published a blog post entitled “Love to Code”. The post was loosely inspired by a blog post from Jeff Atwood and another by Joel Spolsky, only my post was specifically advocating that people who *manage* Software Engineers should love to code. In hindsight it didn’t offer much in the way of interesting content; it was essentially an emotional appeal for managers to stay in the weeds and keep their hands dirty, because intuitively as a manager I felt those things made a difference.

In the time since, I’ve crystallized my thinking on both 1) the goals for an Engineering Manager (and I’m using that term loosely to mean “anyone who manages engineers”, whether it’s a small team or a large organization) and 2) why an Engineering Manager needs to love to code and stay technical to achieve those goals, and I figured that it made sense to go and touch up the old post with some of my more recent musings. In my mind the primary goals of every Engineering Manager are to increase team engineering throughput and channel that throughput in the right direction by:

  1. Mentoring and growing engineers on the team.
  2. Hiring (and occasionally firing) engineers so that the team is staffed with the right number and mix of people.
  3. Organizing engineers on the team to minimize friction and execute on the product vision.
  4. Directly contributing engineering horsepower when appropriate/necessary.

Note that I’m not claiming that this is an exhaustive list of goals, and in many contexts an Engineering Manager may have a many other goals in addition to this set. For example as an Engineering Manager on the Service Availability initiative at Riot Games, I spend a lot of time thinking about the roadmap for the technical products that we’re building and thinking holistically about our company-wide hiring strategy for engineers (which are related to, but not identical to #3 and #2 in the list). I am claiming that this list is as close as it gets to a lean set of goals that apply to managing Software Engineers in any context. I would argue that none of those goals are fully achievable by a manager who isn’t technical or doesn’t spend some time doing engineering work, and I’ll briefly explain why.

Mentoring engineers requires some measure of empathy and understanding on the part of the mentor and respect on the part of the mentee, and those things are best established by spending some time shoulder to shoulder in the trenches. The process of hiring engineers involves making a call on a candidate’s technical chops based on a very thin slice of data, and a hiring manager will make better hiring decisions if she is doing engineering work because she understands the nuances of what it takes to be successful. The effects of Conway’s law dictate that teams will produce software that mirrors the shape of the organization, which makes it vital that managers driving the discussion on how to best organize are familiar with the technical domain. And finally, every team will go thru crunches where an extra set of experienced hands on deck can make the difference between shipping the product on time or missing deadlines.

The skeptic will argue that each of these things can be outsourced to others (for example to Senior Software Engineers on the team or elsewhere in the organization). To be clear, I don’t fully disagree; goals are fluid and both the priority and ownership of the goals may change based on context. I fully acknowledge that in large cross-functional organizations at some level, a leader can’t be an expert in every function. As a brief aside, this is why I think that an organizational structure like Riot’s “matrixed” model is more effective (which probably warrants a post of it’s own). I will claim that over the course of a career, Engineering Managers who stay technical, love to code, and focus on this set of goals will be far more effective than those who drift into “pure people management” territory and cross-their fingers, hoping that others can fill in the gaps and help them to accomplish goals that they aren’t equipped to deliver on.

Your Critical Data Isn’t Safe

I’m willing to bet that just about every working moment of your life up to this very instant has been an attempt to make money with the goal of accumulating enough wealth to live comfortably and achieve a set of objectives (retirement, travel, an increased standard of living). That nest egg is probably stored at a bank on a computer system as a set of 1’s and 0’s on a disk. Current analysis shows that in data centers on average somewhere between 2-14% of hard disks fail every year. In other words every single month x% of my monthly paycheck is removed for retirement savings and stored on a disk that has around a 1 in 20 chance of failing sometime this year, and if that money isn’t available in 30 years I’m hosed.

A dire situation indeed, but I’m obviously omitting a few important little details. Fortunately for me the bank has a government/shareholder vested interest in seeing me cash my retirement dollars out someday, so they’ve hired a small army of programmers to design systems that guarantee that my financial data is safe. The question is, how safe? Is my blind faith that my digitally stored assets will never be lost justified?


Let’s start by considering a few basic scenarios around data storage and persistence on a single machine. Suppose that I’m typing up a document in a word processing application. What assumptions can I make about whether my data is safe from being lost? Most modern hardware splits storage between fast volatile storage where contents are lost without power (memory), and slower non volatile storage where contents persist without power (disk). It’s possible that in the future advances in non-volatile memory will break down these barriers and completely revolutionize the way that we approach programming computers, but that’s a lengthy discussion for another time. For now it’s probably safe to assume that my word processor is almost certainly storing the contents of my document in memory to keep the application’s user interface speedy, so something as trivial as a quick power blip can cause me to lose my data.

One way to solve this problem is by adding hardware, so let’s say that I head to the store to buy a nice and beefy UPS. I’ve covered myself from the short power outage scenario, but what about when I spill my morning coffee on my computer case and short out the power supply? My critical document still only exists in memory on a single physical machine, and if that machine dies for any reason I’m in a world of hurt.

Suppose I decide to solve this by pushing CTRL+S to save my document to disk every 5 minutes. Can I even assume that my data is being stored on disk when I tell my application to save it? Technically no, it depends on the behavior of both my word processor application and the operating system. When I push save the word processor is likely making a system call to get a file descriptor (if it doesn’t already have one) and making another system call to write some data using that file descriptor. At this point the operating system still probably hasn’t written the data to disk; instead it’s probably written it to a disk buffer in memory that won’t get written to disk until the buffer fills up or someone tells the operating system to flush the buffer.

Let’s assume that I’ve actually examined the code of my word processor and I see that when I press save it is both writing data and flushing the disk buffer. Can I guarantee that my data is on disk when I press save? Probably, but it’s still possible that I will lose power before the operating system has the chance to write all of my data from the buffer to disk. People who implement file systems have to carefully consider these kind of edge cases and define a single atomic event that constitutes crossing the Rubicon, the point of no return. In many current file systems that event is probably the writing of a particular disk segment in a journal with enough data to repeat the operation: if the write to the journal completes then the entire write is considered complete, if it isn’t written then any portion of the write that has been completed should be invalidated.

What if I can somehow guarantee that the the disk write transaction has completed and my document has been written to the disk. Now how safe is my data? I’ve already touched briefly on hard disk failure rates. My disk could die for a variety of electronic or mechanical reasons, or because of non-physical corruption to either firmware or something like the file allocation table.

Again I turn to hardware and I decide set my computer up to use RAID 1 so that my data is saved to multiple redundant disks in the same physical machine. I’ve drastically reduced the chance of losing my data due to the most common disk failure issues, but my data remains at risk of being lost in a local fire or any other event which could cause physical damage to my machine. I may be able to recover the contents of one of the disks despite the machine taking a licking, but there aren’t any guarantees and even if I can recover the data it’s likely to take a significant effort and a lot of time.

I’ve pretty much run out of local options, so I run to the promise of the cloud. I script a backup of my file system to some arbitrary cloud data storage every N minutes. I decide that I’m alright if I lose a few updates between backups, and the data store tells me that it will mirror my data in on disks in separate machines in at least N geographically distinct locales across the globe. So what are the odds that I lose it? Obviously a world class catastrophe like a meteor striking earth could still obliterate my data, but in that scenario I probably wouldn’t be too stressed about losing my document. So what credible threats remain?

One of the biggest dangers for data stored in the cloud is the software that powers the cloud. A while ago I worked on a project (that I won’t name) that involved a very large scale distributed data store with geographic redundancy. We had fairly sophisticated environment management software that handled deploying our application plus data, monitoring the health of the system, and in some cases taking corrective action when anomalies were detected (for things like hardware failure, for example to reimage a machine when it first came online after getting a new disk drive). At one point a bug in the management software caused it to simultaneously start to reimage machines in every data center around the world. The next few days ended up being a pretty wild ones as we worked to mitigate the damage, brought machines back up, and worked through various system edge cases that we had never previously considered. We lost a significant amount of data, but we were fortunate because the kind of data that our system cared about could be rebuilt from various primary data stores. If that weren’t the case we would have lost critical data with significant business impact.

Another risk to data in any cloud is people with the power to bring that cloud down: a disgruntled organization member or employee, an external hacker, or even a government. When arbitrary control of a system can be obtained via any attack vector or even by physical force, one of the potential outcomes is intentional deletion of data. I’ve focused the thread on data safety (by which I mean prevention of data loss) rather than data security (which I would take to mean both safety and the guarantee of keeping data private), but malicious access to data tends to favor the latter since stolen data is lucrative. It’s perfectly plausible that future attacks could focus on trying to delete or alter data and destroy the means of recovering from the data loss, regardless of the degree of replication. Think digital Tyler Durden. People who stored data on MegaUpload probably never envisioned that they would lose it.

My main point is that whether data is held in local memory, on disk, replicated on a few redundant local disks, or distributed across continents and data centers, there is always some degree of risk of losing the data. Based on my anecdotal experience most people don’t associate the correct level of risk with data loss regardless of where the data lives. I think those kind of considerations will become increasingly important as more and more data moves to both public and private clouds with varying infrastructures. There is no such thing as data that can’t be lost, only ways to make data less likely to be lost.

Being A World Class Software Development Manager

Over the past few years I’ve managed 3 software development teams at 2 very different companies and I’ve learned a lot in the process. I also read several blogs on software development somewhat religiously and I’ve liberally borrowed good ideas and added them to my toolbox. I’ve witnessed amazing managers as well as ineffective mangers, and I’ve spent time pondering on what differentiates the two. I don’t consider myself a guru on effective software development management or anything, but I thought I would share my thoughts on some traits that I would look for if I were hiring a software development manager to work for me because I think that a manager who exhibits these traits has the greatest chance of success.

Be a Computer Geek

Why does it matter? Because geeks are drawn to fellow geeks. Before I went into management at Microsoft I was consistently drawn to teams who were led by someone who I respected not only as a leader, but also for their technical chops. I knew that I could learn more from those sorts of people. I also felt more confident that my geek manager would understand and help contribute to the correct technical direction for a product and would be more prone to allocate time to do things the right way. I’ve always had a special respect for Scott Guthrie because he maintains a technical blog while operating as a VP. When ASP.Net MVC was new and I was searching for a tutorial I stumbled on his Nerd Dinners post which is pretty much the canonical search result for the topic and I was in shock: here’s a dude that manages more people than I ever will and he’s really taking the time to get to know the tech! That’s the kind of person I always want to work for and it’s the kind of manager that I want to be. Michael Lopp (aka Rands) does a great job providing his own thoughts on managing nerdsand providing some additional context on why it matters.

So how do you become a geek if you’re not one already? Allocate time to read nerdy stuff, every day. Keep coding, if you don’t have time for it in your daily job go take classes or work on a side project (even if it means waking up at 5am). Participate in team code reviews, I try to carefully review at least 1 out of every ~10-15 or so changes that my developers are submitting for review to let them know that I care and to encourage both thorough reviews and a high bar when it comes to code quality. Follow nerdy technocrats on Twitter or Blogs. Read a book on Operating Systems or Compilers. Technology is changing so quickly that not intentionally allocating time to immerse yourself in it will render you obsolete in a hurry. There are plenty of non-technical people who are competent managers that can lead a content team into oblivion because they can’t see the correct technical direction to follow. There are relatively few people who are both competent managers and technical rock stars who can both lead a team and provide the correct technical direction to make a team both happy and really successful.

Manage Projects that you Care Obsessively About

This one is a bit loaded because in the real world you can’t always pick what the teams that you manage are delivering, but your odds of cranking out a world class product are exponentially greater if you really believe in what you’re doing. I don’t just mean you think it’s mildly interesting, I mean you think that it has the potential to change your company (if not the world) in some tangible way that excites you. If you really buy into what your team is working on you’ll spend more time working, you’ll lay awake at night when you can’t sleep iterating on the technical vision for the product that your team is creating, and your enthusiasm will show through to the people who report to you. It sounds cheesy but it’s true, people won’t work hard for someone who doesn’t believe in what the team is doing.If you think that you’re working on something that isn’t exciting but you’re skeptical about your ability to move, you probably need to push yourself and do some shopping around. Towards the end of my time at Microsoft my team was re-org’d (and essentially re-purposed) to work on some things that I wasn’t as excited about. Despite getting good reviews my entire time at the company I had this sort of skepticism about my ability to interview well and move around either within the company or externally. It wasn’t until a recruiter contacted me randomly after stumbling across my resume on LinkedIn and described a technology that sounded exciting that I finally pulled the trigger and went through the interview process. And guess what? I didn’t end up getting an offer. In his shameless Google plug Steve Yegge does a great job describing something that he calls the “anti-loop”: an interview loop with a set of people that even the most qualified candidate will not survive, and something that I’ve personally observed while hiring. I bring up the anti-loop because it’s mere existence means that not getting an offer your very first time thru the process shouldn’t in any way discourage you from continuing to hunt. Shortly after my initial failed loop Amazon contacted me about a position to manage a team to work on something that I felt could completely change the game for how software development is done, and an interview loop later I had a job offer. Software Development is a weird industry in that the skills to be successful at the job aren’t necessarily directly related to the skills necessary to successfully interview for the job, so if you’re not interviewing at least once per year to keep your skills sharp you may want to consider doing so. Regardless of how often you’re interviewing you shouldn’t let yourself be paralyzed by fear of the unknown, it’s an eye opening experience to immerse yourself in a totally foreign team/organization/place and it’s a huge way to both promote personal growth and position yourself to work on things that excite you.

Surround Yourself with Amazing People

Sounds obvious, but I really can’t overstate the point. One of the most important assets that you have as a manager is your network of connections in the industry. LinkedIn and the competitive nature of talent acquisition in the industry has changed the game, and if you’re not adding talented people to your network that you can call on when your team needs to quickly staff up then you’re putting yourself in a hole. Talented people change teams and even companies fairly frequently today and as a manager you will be faced with situations where you have to hire on a time crunch. Even the greatest sourcing strategies can’t substitute for firsthand knowledge of how good someone is and having a rapport in place, which means you need rock stars in your network that enjoy working for you and may be willing to follow you to work on new things. A side effect of this phenomenon is that if you’re not changing teams/jobs every several years then you’re probably not adding new talented folks to your network as quickly as you could be, and you may either want to look around at other opportunities or consciously try put yourself in positions to regularly form relationships with people outside your network.

The ability to build a good team and hire effectively is the single most important trait in a software development manager and it requires both a solid network and being a great interviewer, a separate topic that has been covered in a lot of detail all over the place and that I will likely touch on in a separate post. The best manager in the world cannot succeed without an amazing team, and a bad manager who inherits a good team will do better than a good manager who can’t hire (in the short haul, until the talent leaves).

Clear a Path for Those People

After building a great team a manager’s job comes down to clearing a path for those people to succeed. Think of your team as a pipeline for delivering goodness, and your job is to point the pipe in the right direction and remove friction that can slow the speed of the goodness. Aiming the pipeline requires crafting clear and concise vision documents that the team buys into within the parameters of the organization. There is no substitute for being able to write great documents, they are one of the primary deliverables of any good manager. Documents stand the test of time in a way that a Power Point presentation doesn’t because it can’t be read by anyone in any situation without the context of the presenter. If you can write an effective document to describe what your team will be working on over the next 12 months with specific dates for delieverables that both meet customer requirements and are realistic based on team bandwidth then you already possess an underrated skill that many managers lack.

Removing friction from the pipeline means a variety of things depending on the team, organization, and product. It doesn’t necessarily mean getting rid of as many meetings or as much process as possible. It does generally mean things like putting an effective process in place, weighing in on technical design, and providing motivation to folks on your team. There are plenty of articles discussing why money doesn’tnecessarily directly contribute to job satisfaction or effectiveness over the long run, as a manager it’s your responsibility to know how to keep your employees satisfied and motivated. Let your employees know that you genuinely care about them by asking them how life is going outside of work. Setup effective 1:1’s on a weekly basis, and be sure that you’re not just covering things that are tactical. Be invested in your employees career growth. Setup bi-weekly staff meetings where employees can comfortably ask questions about the direction of the team or provide feedback on the process.

Install a Lightweight Process and be Flexible

One of the most common mistakes that I see managers make is a “my way or the highway” type approach when it comes to team process. Scrum has obviously become the hawtness that many teams use today. Scrum was developed for manufacturing. Software development is not manufacturing. I don’t mean to say that scrum is bad by any stretch of the imagination, but it works well in some situations and fails miserably in others. On the last team that I managed at Microsoft we operated on a fairly strict scrum model. We kept a super detailed product backlog that we groomed regularly. When we pulled items from the product backlog into the sprint backlog during sprint planning we had a great idea of how much bandwidth we had, what load factor we were operating at, and exactly what we could get accomplished. We did a pretty good job scheduling demos at the end of sprints, and we always held retrospective meetings and carefully considered and integrated feedback from the team. In the context of our organization we were operating as model citizens, and it worked well.

The first thing I realized when I came to Amazon is that if I ran my team the same way my developers would probably all commit mutiny and the team wouldn’t run effectively. The culture is different and the product that we’re working on happens to be in a very early “protype-ish” phase so requirements are changing very rapidly. It’s taken a bit of trial and error but we’re starting to really hit our stride on a hybrid scrum model with mini weekly iterations within each sprint. The scrum purists would roll over in their grave, but I’m alright with that.

When a manager takes over a new team the first step should be observing how the team has run in the past (unless it’s a brand new team). The next step should be investigating how other effective teams within the organization are operating. The last question they should ask themselves is how their past experiences can play into the process within the context of the new team. The goal in the entire excercise should be to build a process that is as lightweight as possible while effectively meeting the requirements for partners and customers and keeping engineers happy. If it sounds like a lot to ask, it is. The odds that you’ll get it right on your first try for any team are low, which is why it’s important to approach new team processes with humility and a willingness to be flexible. The last thing you want to do is create what Jeff Atwood aptly describes as micromanagement zombies by installing an over restrictive process just because it worked at your last gig.

If you’re nailing everything mentioned in that list I’m willing to bet that you’re enjoying work and leading a highly effective software development team, and that you’re probably a better manager than I am. Alternatively if you disagree I would certainly love to hear your feedback so that I can be convinced and update my list accordingly.

Shifting Gears A Bit

In the past I’ve tended to blog (rather infrequently) about different technical solutions to problems that I’ve stubbed my toes on in hopes that I would spread the love and save others from getting stumped by the same problems, but a few things have happened recently that have impacted the kind of stuff that I will probably bother to blog about in the future. First, Stack Overflow essentially became the single source to answer technical coding questions. Joel Spolsky may claim that the primary UI for Stack Overflow is Google, but to be honest the content on the site is generally so good these days that I head straight there to unlock the deepest darkest coding mysteries and I bypass blogs and other sources of wisdom in the process. I’m sure others do the same, so the value of answering technical questions in a blog is probably diminished.

Another change inspiring factor is that about 3 months ago I quit my job at Microsoft after over 6 years working on Bing and took a position working on the WAP team at Amazon. I won’t bother with the details of what inspired the change, but I’ll just briefly comment that I really enjoyed my time at Microsoft and I’ve also loved working at Amazon thus far. I’ll also point out that I find it pretty remarkable how differently the two companies function and specifically how different the “Manager” job at Amazon is from the “Lead” job at Microsoft (which are essentially equivalent roles). In a nutshell as a Software Development Manager at Amazon you run your team as if you’re running a small startup within a big company, so you’re on the hook for everything from your team strategy and internal marketing to sourcing and hiring to product design and implementation. One of the downsides to this approach is that it doesn’t leave room for the 20-30% coding time that Microsoft typically encourages Software Development Leads to partake in. As a result I’ll probably start focusing the blog a bit more on effectively running a software development team, and a bit less on nitty gritty coding/technical problems/issues.

A third factor is that I’ve started taking classes in the UW PMP CS program which has me daydreaming about things like compilers and operating systems, so stuff that I’m learning in classes or questions related to the material may seep into my blog posts time to time.

That’s all for now, just a brief explanation to the few readers who trickle by my blog that the scenery may change just a bit.