Great Documents: A By-Product of Effective Software Development Managers

If you’re managing people who develop software you should probably be spending a nontrivial portion of your time writing documents, and the quality of those documents is critical. Documents matter because there are several questions that every manager needs to answer for their management chain:

  • What’s current state of the union?
  • Where are we headed over the next 12-36 months?
  • What level of staffing do we need to achieve that vision?
  • Are my employees compensated appropriately?

Managers also need to answer a related and partially overlapping set of questions for their employees:

  • Where is the team headed?
  • What’s my role in helping us get there?
  • How have I been performing?
  • How can I improve my performance, and grow my career?

Several tools are available to answer these questions in the modern business setting, but none are as effective as written documents. Face to face conversations or meetings are less efficient, more random, and can’t be archived for easy consumption after the fact. PowerPoint presentations require the context of a speaker (if they don’t you’re abusing PowerPoint and giving bad presentations) who is only presenting to a limited audience, so they share the archive problem of conversations. Videos or audio recordings of either conversations or presentations are impossible to quickly scan, and it’s more difficult for the person consuming the information to backtrack or skip around as needed. Email or instant messenger conversations are less formal and rigorous than documents by convention, so they allow for glossing over areas where deeper thought and investment is essential. Put simply: creating documents forces a manager to codify thoughts or ideas into an artifact that is easy for others to parse at any point down the road, with an effectiveness that no other process can duplicate.

I didn’t realize the value of written documents until I started working at Amazon almost a year ago. It’s very common at Amazon to walk into an hour long meeting and spend the first 20 minutes in silence reading and marking up a hard copy of a particular document, and spend the remaining 40 minutes discussing it. Initially I found that odd, until I went through the exercise of preparing my first long range planning document for my team and getting it iteratively reviewed by my team, peers, and various levels of my management chain. It took a lot of work, but all of that work ended up being hugely beneficial. We spent extra time meeting with customers to update requirements and get product feedback, held brainstorming sessions with team members and senior engineers who were interested in the space, and did some analysis on the cost of operations and ways that we could optimize some of that overhead. The final product was a 6 page plan that I could hand to anyone and rest assured that after a few minutes of reading they would have a great feel for what my team is up to, and why.

As a quick aside, this is a great example of why I previously wrote encouraging both software developers and managers to change companies/environments over the course of their career. There are a lot of things that I learned at Microsoft that I never could have learned at Amazon. There is an equally long list of things that Amazon does really well that I wouldn’t have learned anywhere else. Throwing yourself in a totally different neck of the woods provides a unique opportunity to grow in areas that you couldn’t have developed by staying put.

Back to documents. To clarify, I’m not talking exclusively about technical documentation like functional specifications, design documents, or test plans. I’m more focused on things like vision documents (which direction things are headed), long range planning documents (the nitty gritty on how to move in that direction), and documents about things like employee growth or promotion readiness. The neigh-sayers will argue against the value of these kinds documents because they aren’t part of what ultimately ships to the customer: the bits and bytes that get burned to a disc or deployed to a server somewhere. I would argue the exact opposite. For example taking the time to produce a long range plan that you can hand to engineers, customers, and partners can help you avoid building meaningless features, and help customers and partners give earlier feedback on where you’re headed. Similarly, taking the extra time to prepare a document evaluating an employee’s readiness for promotion is a great way to keep that employee apprised of growth areas, ensure that the employee is happy in their career progression and nip problems in the bud, and in the end save you from reduced productivity while back-filling for the attrition of an unhappy team member.

So without further ado, here are a few tips that I think will make you better at producing high quality documents:

Define your audience before you start.

In most cases it’s not possible to effectively address multiple audiences in a single document. Before you put pen to paper, define your audience. If the audience seems too broad, consider writing multiple documents instead of one. For example if you’re writing a document that tells what your team will deliver over the next 12 months, it may be appropriate to have 2 flavors of the doc: one for your management chain, and one for your customers. Your managers may want to know some of the dirty details like how much time your team spends on operations or how much test debt you need to backfill, but your customers may only care about what new incremental features your team will deliver. I’ve also seen cases where authors don’t define their audience right out the gate where the end result is a document that’s not really meaningful to any group of people.

Make bold claims. Don’t use weasel words. Be specific about dates.

Weasel words kill the impact of a document, and are a mechanism to avoid hard thinking or research that needs to happen. Consider a sentence like “Implementing this feature will represent a significant win for customers.” The sentence begs the questions: what feature, how significant, and what kind of win? Now consider the impact of rewriting it to “Implementing the features to allow multi-tenancy will allow customers to reduce the size of their fleet by 50%, resulting in a million dollar reduction in TCO.” Note that getting from A to B requires a lot of research, but the result is a statement that is much more impactful and makes it easier to gauge the value of the feature in question.

It’s equally important to be specific about dates. For example when you read something like “The feature will be completed later this year”, you should automatically ask the question: when this year? Are we talking next week, or late December? If my team has a dependency on your feature, I’ll need some more granular deets. If it’s impossible for some reason to provide a date, then provide a date by which you’ll have a date.

Finish Early, Allow Bake Time

This is critical. If your document is due in 3 weeks, plan to complete it in 1. Before you write the document you should identify peers that you want to read it, ping them to be sure that they block out time to do so, and then be sure to get them a copy on schedule. Consider iterative rounds of reviews with different groups of people that are stakeholders for the document. For example if you’re a line manager creating a vision document for your team you may want to start by getting it reviewed with a few of your peers and senior engineers, then take it to folks up your management chain, and then review the document with a few key customers. In my experience the resulting document is often drastically different (read: drastically better) than the original version.

Review, and review again. Use hard copy.

On a similar note, review your work often. Don’t write a document in one shot and call it good. When you finish it, step aside for a day and then read it afresh. Print the document out and review it in hard copy, pen in hand (and then go plant a tree). Staring at a piece of paper puts your brain in a different mindset than staring at Word on the computer screen. When you’re staring at your screen your mind is thinking rough draft, or work in progress. When you’re staring at ink on paper your mind is thinking finished product. You’re more likely to be a good editor of your document in the latter mode.

Conclusion

This isn’t an exhaustive list by any means, but it does include the tips that I’ve personally found to have the biggest impact on document quality. At some point I may put together a follow up list with some additional ideas on writing docs that I’ve excluded from this post. I personally apply these ideas to everything I write, including blog posts like this one. I hope that you find this helpful, and if you have additional ideas on either the value of documents or how to produce great ones I would love to hear them in the comments!

The Perl Script That May Save Your Life

I had a major “oh !@#$” moment tonight. While playing around with Maven, M2Eclipse and moving some project folders around I hastily hammered out a “sudo rm -R” and realized seconds later that I had blown away some code that I wrote last night that wasn’t in version control. All deleted. Not cool.

Fortunately I stumbled on this simple yet life saving article + perl script that greps the nothing of sda* for a particular string that was in your file and prints the contents around it:

#!/usr/local/bin/perl
open(DEV, '/dev/sda1') or die "Can't open: $!\n";
while (read(DEV, $buf, 4096)) {
  print tell(DEV), "\n", $buf, "\n"
    if $buf =~ /textToSearchFor/;
}

A quick run and a few minutes later I had my code back in one beautiful piece again. Mad props to chipmunk!

Why You’re Missing the Boat on Facebook Stock

I was about 2 hours into a 5 hour drive en route to an annual weekend golf trip when Facebook went public. That made me a captive audience for the 70 something year old family friend (who admittedly is a sharp cookie at his age and a damn good golfer) in my back seat as he lectured the rest of us on why the stock would be worthless in five years. In the weeks since I’ve heard a million flavors of the same message from people who’s tech savvy ranges from expert hackers to completely clueless. I respectfully disagree, and I think that there is a compelling technical argument that can be made for why Facebook has tremendous upside as a company. So let’s consider the question: should we all be buying Facebook stock at post IPO prices?

The Completely Tangential Bit

The first answer that I get from most folks is no, because Facebook adds no real value to people’s lives. In fact in some ways the result of the company’s existence is a net negative because it causes people to waste massive amounts of time and/or productivity. The company doesn’t produce goods or real services, and some would argue that it’s just a glorified LOLcats. I actually kind of agree, but I don’t think that it matters. What Facebook does produce as a sort of byproduct is an absolutely massive repository of personal data. More on that later.

The Red Herring

The next objection that people raise is based on an assumption that the primary way to monetize the website is ads. The company has certainly toyed with all kinds of ways of putting paid content in front of users, and the early returns seem to indicate that Facebook’s ads don’t work (at least not compared to Google’s paid search advertising). It doesn’t take a rocket scientist to realize that social pages are a whole different beast than search results pages. When people visit Google their intent is to navigate to another page about a topic. They don’t particularly care whether the link that takes them there is an algorithmic search result or a paid ad, they’re just looking for the most promising place to click. When people visit their BFF’s Facebook page they aren’t looking to leave the site, they’re planning on killing some time by checking what their friends are up to. So again on this point I agree; I’m skeptical that Facebook will never see the kind of crazy revenue growth from ads or any sort of paid content on their side that would justify even the current stock price. But advertising is just one way to skin a cat…

The Glimmer of Hope

But slightly off the topic of ads, and in the related space of online sales and marketing is where the first signs of promise can be found. Let’s get back to that data thing: Facebook has an absolute gold mine of knowledge that other companies would pay cold hard cash to access. Consider Amazon, for example. Amazon spends plenty of money mining user data to make more educated recommendations based on past purchase history. What would it be worth to them if they could find out that I have an 8 month old daughter, so I need to buy diapers on a regular basis? That I love Muse, so I may be interested in purchasing and downloading their new album? That I checked in at Century Link Field for a Sounders match last week, so maybe they can tempt me with a new jersey? Those are some of the more obvious suggestions, but there are actually more elaborate scenarios that could be interesting. What if you could combine Amazon purchase data with Facebook social graphs and figure out that three of my friends recently bought a book on a topic that I’m also interested in, and then offer those friends and I all a discount on a future purchase if I buy the book as well?

Facebook’s current market cap as I’m writing this is sitting at 57 billion. To get to a more reasonable 20 price to earnings multiple that seems relatively inline with other growth companies in the industry they need to add around 2 billion in annual earnings. Based on the numbers that I could dig up, that’s less than 1% of online sales in the US alone. Is that possible? Consider the margins of the biggest online retailer. Amazon is legendary for operating on razor thin margins, but their US margins last year were around 3.5%. How much of that margin would they part with for ultra meaningful personalization data that could have a huge positive impact on sales volume? Also, keep in mind that these numbers are for the US only, and they don’t include the astronomical projected growth in online sales moving forward. Regardless of exactly what the model looks like, I think there is a path for Facebook to leverage their data to grab some small piece of that growing pie.

The privacy hawks out there are already sounding alarms, I can hear them from where I’m sitting. But who says that there isn’t a model of sharing data that Facebook users would be happy with? I would venture that there are arrangements where users would be happy to share certain kinds of information to get a more relevant shopping experience. Taking things one step further, there are certainly users who would expose personal information in exchange for deals or rebates that online retailers like Amazon could kick back as an incentive to get the ball rolling, and Amazon isn’t one to pass on a loss leader that drives business with a long term promise of return on investment.

The Real Diamond In The Rough

And that gets us to the crux of the matter. Online sales are just one example of a market that Facebook can get into and leverage it’s data to make a buck. The evolution of computer hardware, the maturity of software that makes it trivial to perform distributed computation in the cloud, and continued advances in machine learning have ushered in the age of big data. Computer scientists who specialize in machine learning and data mining are being recruited to solve problems in every field from pharmaceuticals to agriculture. And the currency that these scientists deal in is huge amounts of data. Facebook has data in spades, and it has a very valuable kind of data that nobody else has.

The model for monetizing that data isn’t clear yet, but I can think of possibilities that make me optimistic that good models exist. For example think about the kind of money that Microsoft continues to pour into improving Bing and leapfrogging Google’s relevance to become the leader in online search. Facebook’s data could be an absolutely massive advantage in trying to disambiguate results and tailor content to a particular user. Google’s SPYW bet and Bing’s Facebook integration are different approaches on trying to integrate bits of social data into search, but they fall way short of the kind of gain that could be had via direct access to Facebook’s massive amount of social data.

Or suppose that a company or government body is trying to gain information about the spread of a particular disease. Maybe they have medical records that include the identities of people who are carriers, but not much more than that. If they had access to Facebook’s data they could suddenly know about the ethnicity, social network (who’s hanging out with who), and habits (through check-ins) of people in both classes: carriers and non-carriers. Applying machine learning to that training set may yield some interesting information on what traits correlate with becoming a carrier of the disease.

The One Armed Bandit

Of course, there’s a risk involved. As a friend of mine aptly pointed out, my case for Facebook’s value looks something like: 1) have a lot of important data, 2) mystery step, 3) profit. I would argue that if the mystery step was clear today, the valuation of Facebook stock would be much higher than even where it’s currently trading. I’ve given a few fictional examples to make the case that the mystery step probably exists. If you buy that argument, then you too should be buying Facebook stock. And this bar may be serving some expensive drinks in the future.

How To Automagically Classify Webpages By Topic

Ever wondered how you can automate the process of figuring out whether a webpage is about a particular topic? I’ve spent some time recently on a side project that involved solving a flavor of that exact problem. I’ve seen several related questions on Stack Overflow and other sites, so I thought I would throw together a quick post to describe bits of my implementation.

For our example, let’s suppose that we need to implement something that exposes an API like the following:

  • boolean classifyWebpage(Webpage webpage)
  • void trainClassifier(Map < Webpage, boolean > trainingExamples)

We will mandate that consumers call the function to train the classifier once with a training set before we can call the function to evaluate whether a webpage is about our topic. Our train classifier function will take a bunch of webpages and whether or not they are about the given topic, to use as training examples. Our classify webpage method will take a webpage and it returns true if the webpage is about the topic and false if it isn’t. To achieve this, we’ll implement a few helper functions:

  • String cleanHtml(String html)
  • Map < String, int > breakTextIntoTokens(String text)
  • Map < String, float > getTokenValues(Map < String, int > tokenCounts)

Let’s look at how we can implement some of these pieces in detail.

Cleaning up HTML

The first piece of infrastructure that we’ll want to build is something that strips markup from an HTML string and splits it into tokens, because words like “href” and “li” are about formatting and aren’t part of the true document content. A naive but decently effective and low cost way to this is to use regular expressions to strip out everything in the contents between script and style tags, and then everything between < and >. We’ll also want to replace things like non-breaking space characters with literal spaces. Assuming that we’re working with fairly conventional webpage layouts, the blob of text that we’re left with will include body of the webpage plus some noise from things like navigation and ads. That’s good enough for our purposes, so we’ll return that and make a mental note that our classification algorithm needs to be good at ignoring some noise.

Break Text into Tokens

Once we have clean text, we’ll want to break it into tokens by splitting on spaces or punctuation and storing the results in a data structure with the number of occurrences of each token. This gives us a handy representation of the document for doing a particular kind of lexicographical analysis to bubble up the words that matter the most. Again, regular expressions are our friend.

Find the Keywords

Armed with a map of tokens and count of occurrences for each token, we want to build something that can pick the keywords for the document. Words like “the” and “to” don’t provide any clues about what a document is about, so we want to find a way to focus on keywords. The important words in a document are likely to be repeated, and they’re also not likely to be found often in most other documents about different topics. There’s a nifty algorithm called Term Frequency Inverse Document Frequency that is both easy to implement and does a great job find keywords by comparing the frequency of words in a single document with the frequency of words in a corpus of documents.

To make this work we’ll need to start by building a corpus. One option is to bootstrap by crawling a bunch of websites and running the entire set through the our initial function for cleaning and tokenizing. If we’re going to go this route we need to be sure that we’ve got a good mix of pages and not just ones about our topic, otherwise the corpus will be skewed and it will see things that should be keywords as less valuable. A better option in most cases is to use an existing corpus , assuming that one is available for the desired language, and manipulate it into whatever format we want to use for classification.

Classify a Webpage based on Keywords

The next bit is the secret sauce. We know that given any webpage we can extract keywords by doing some prep work and then comparing against a corpus, but given those keywords we need to decide whether a webpage is about a given topic. We need to pick an algorithm that will give us a boolean result that tells us whether a webpage is about our topic. Keep in mind that while we’re setting up our algorithm we have some training examples to work with where we’re given a class, in other words we know whether they are about the topic or not.

The first option that most people think of is to come up with a mathematical formula to tell whether a webpage matches a topic. We could start by boiling the problem down to how well two specific webpages match each other by coming up with a mathematical formula to compare two webpages based on similar keywords. For example we could compute a running similarity total, adding to it the product for the ranking values in each respective page for keywords that match. The result would be a scalar value, but we could convert it to a boolean value by coming up with some arbitrary threshold based on experimentation and saying that pages with similarity over our threshold are indeed about the same topic. In practice, this actually works decently well with some exceptions. With these building blocks we could figure out whether a given webpage is about a topic by finding how similar it is to webpages in our training set that are about that topic versus ones that aren’t, and making a decision based on which group has a higher percentage of matches. While it may be effective, but it has several flaws. First, like Instance Based Learning it requires comparison with the training set during classification which is slow at runtime because we have to consider many permutations. More significantly, we would have applied a human element to the algorithm by defining the threshold for a match, and humans aren’t very good at making these kind of determinations because they can’t process certain kinds of data as quickly as a computer can.

Using machine learning, we can enlist the help of a computer to apply the data to a particular learner that will output a classifier for the domain. Frameworks like Weka offer all kinds of learners that we can try use out of the box with our training data to create classifiers. For example Naive Bayes is an example of one such algorithm that tends to do a great job with text classification. If we use our words with weights as attributes and each website in the training set as an example to train on, a Naive Bayes learner will find probabilistic correlation between the occurrence of words and the topic of a webpage and will output a classifier that is likely to give more accurate results than any algorithm that a human could come up with in a reasonable amount of time.

Wiring it Up

So how do we wire these pieces together, and what does it look like to consume the finished product? Let’s suppose that we want to be able to tell whether a website is about soccer. We start by creating a whitelist of websites that we know produce editorial content about soccer. We’ll also want to create a blacklist of sites that produce content about world news, technology, rock and roll, ponies, and anything that isn’t soccer. We throw together a function that crawls the sites and for each example we infer a class based on the source (we may be wrong in some edge cases, but in general we’re going to assume that the sites in our whitelist/blacklist are producing content that is or isn’t soccer across the board). We run the webpages through our cleaning, tokenizing, and ranking functions and we end up with training examples that look like the following contrived ones:

  • foo.com – True. Manchester (.85), Rooney (.75), United (.64), match (.5).
  • bar.com – False. Muse (.9), Bellamy (.72), guitar (.72), cool (.48), show (.43).

Getting a Weka to speak our language may require massaging the examples into ARFF or some format that the framework understands, but at this point we can directly apply the training set to the learner. For subsequent webpages we run them through the same functions to get ranked keywords, and then we pass the new example into the classifier and we’re given a boolean result. Magic.

Simple Optimization

Note that we only used the words in the body of a webpage, but in the real world we would have access to more data. We could potentially look at the hints provided in the page title, meta tags, and other tags like heading/bold or bigger font sizes and make some assumptions about the importance of the words (of course we have to do this before stripping tags out). If we could get our hands on link text for sites that link to the article that could also serve as valuable input, although search engines aren’t making it as easy to access link data from their indexes these days. We could use this additional data to either augment our Naive Bayes learner using arbitrary weights, or we can use more complex learners like a Perceptron or a Support Vector Machine to try to let the computer decide how important we should consider these other inputs to be. It’s certainly possible that for some topics other kinds of learners may produce better results. Or we could investigate ways to use learners in combination (via Bagging or Boosting, for example) to get better accuracy than any single learner.

Conclusion

Classifying webpages by topic is a fairly common example of a problem that can be solved by an algorithm created by either a human or a computer. My aim in this post was to provide a quick look at one way to attack the problem and to touch on some very introductory machine learning concepts. If you’re interested in reading more about machine learning specifically there are countless resources online and some great books available on the subject. Hope you found the post helpful. Classify away, and I’ll certainly look forward hearing back from any readers who have tackled similar challenges or want to provide feedback!

How the Cloud Saved Me from Hacker News

If you’re reading this post, we probably have one thing in common: we both spend at least some of our free cycles perusing Hacker News. I know this because it has driven most of my blog traffic over the past week. I have a habit of submitting my recent blog posts, and the other day I was surprised to see one particular post climb to number three on the Hacker News homepage. My excitement quickly gave way to panic, however, as I realized that the sudden rush of traffic had taken my blog down in the middle of it’s shining hour.

Back up a couple of months. I started blogging back in 2009 on Blogspot. At some point I was tempted by the offering of a free AWS EC2 Micro Instance; I had been thinking about setting up a private Git Server and running a few other servers in the cloud and I decided that like all of these guys, I would migrate my blog to Self Hosted WordPress on EC2. The whole migration was rather painless, I’ll spare the monotonous details because there are quite a few blog posts out there on getting the setup up and running, and how to move content. I will say that the one issue that I ran into is that I had issues with the existing BitNami AMI’s preinstalled with WordPress, so I ended up picking a vanilla Ubuntu AMI and installing LAMP + WordPress myself. Suffice to say that I’m still relatively new-ish to the Linux world, and I pulled it off without much trouble.

But now, my blog was down. Fortunately I was able to cruise over to AWS Management Console and stop my EC2 Instance, upgrade temporarily to a Large Instance, restart it, and then update my Elastic IP. Just like that I was back in business, and my blog that previously got a couple hundred hits on busy days suddenly fielded over 20k hits in a day and another 6k over the next few days.

I figured I would throw together a quick post on my experience for a few reasons. First, because some folks who posting to Hacker News may not have an idea exactly what to expect if they make the homepage. Read: If you’re EC2 hosted, upgrade your Instance size ahead of time. And second, I just wanted to marvel at the power of the cloud. A decade and a half ago I remember ordering a physical Dell rack server and hauling it over to a local ISP where I collocated it for a couple hundred bucks a month and used it host a few websites and custom applications. The fact that I can now spin up a virtual machine in the cloud in minutes, have my software stack up and running in less than an hour, and instantly scale to accommodate huge traffic variance (and all for cheap) is a testimony to the infrastructure underneath modern cloud offerings.