The New Platform War

There’s a new battle raging for customer eyeballs, application developers, and ultimately… dollar signs. To set the stage, flash back to the first platform war: the OS. Windows sat entrenched as the unassailable heavyweight, with Linux and Mac OS barely on the scene as fringe contenders. But the ultimate demise of Windows’ platform dominance didn’t come from another OS at all; it came from the move to the browser. Microsoft initially saw the problem and nipped it in the bud by packaging IE with Windows, then tried to prolong the inevitable by locking the IE team away in a dark basement and trying to stifle browser innovation by favoring closed solutions for browser development like Silverlight instead of open standards like HTML 5. That strategy clearly wouldn’t work forever, and the net result was a big boost in the market share of competing browsers like Firefox and ultimately Chrome. Suddenly people weren’t writing native Windows apps anymore, they were writing applications that ran in the browser and could run on any OS.

The pattern of trumping a dominant platform by building at a higher level has repeated itself many times since. In some sense Google subverted platform power from the browser by becoming the only discovery mechanism for browser apps. When social burst onto the scene Facebook and Twitter became king of the hill by changing the game again. The move to mobile devices has created a bit of a flashback to the days of OS platform dominance, but it’s inevitably a temporary shift. At some point history will repeat itself as devices will continue to become more powerful, standards will prevail, and developers will insist on a way to avoid writing the same app for multiple platforms.

Which brings us to today, as the platforms du jour are again threatened. In this iteration the challenger to the dominance of Facebook and Twitter is the domain specific social apps that are built on top of them. When social network users share their status with friends, text + images + location isn’t enough anymore. Different kinds of activities call for customized mechanisms of data entry and ways to share the data that are tailored for the experience. For instance, when I play 18 holes of golf I enter and share my data with Golfshot GPS, which makes data entry a joy by providing me yardages and information about the course and gives my friends the ability to see very granular details on my round when I share. When I drink a beer I share with Untappd, when I eat at a restaurant I share a Yelp review, if I want to share a panoramic view I use Panorama 360. Even the basic functions like sharing photos and location work better with Instagram and Foursquare than Facebook’s built in mechanisms.

The social networks will never be able to provide this kind of rich interaction for every experience, and they shouldn’t attempt to. At the same time they run the risk of the higher level apps becoming the social network and stealing eyeballs; a position which some apps like Foursquare clearly already have their eyes on. For power users these apps have already made themselves the place to go and enter domain specific data. That trend will continue to expand into the mainstream as people continue to dream up rich ways to capture real life experiences through customized apps. To use the OS analogy: there’s no way that Microsoft can dream up everything that people want to build on top of Windows and bake it into the OS, nor would it be a good thing for consumers if they could.

It will be interesting to see how Facebook and Twitter respond to the trend. I suspect that users will continue to move towards domain specific apps for sharing, but that the social networks will remain the place to browse aggregated status for friends across specific domains. Unless, of course, the owners of the highest profile apps somehow manage to get together and develop an open standard for sharing/storing data and create an alternative browse experience across apps to avoid being limited by the whims of Facebook and Twitter and the limitations on their APIs.

A Strategy For The Dreaded Interview Coding Question

If you’re a software developer and you’re thinking about changing jobs, you’re probably at least a bit anxious (if not downright freaked out) about the prospect of facing a whiteboard armed with only a trusty dry erase marker and your wits while an interviewer fires a coding question at you. That’s not shocking because software development interviews are weird: the skills necessary to answer the technical and behavioral/situational questions that are asked don’t necessarily map 1:1 with the skills to be a good developer. We’re used to developing with access to tools like IDE’s and StackOverflow, without unnatural time constraints and the pressure of landing a job in the balance. I’ve interviewed literally hundreds of candidates in my roles as a manager both at Microsoft and Amazon, and I’ve seen hundreds bomb coding questions. That doesn’t shock me for the reasons previously mentioned, but what does shock me is the number of bright folks who fail on the questions simply because they don’t approach them with a solid strategy.

The anti-patterns are crystal clear and they almost always lead to a wipe, for example: diving straight in on code, assuming that language/syntax doesn’t matter, or failing to consider edge cases before implementation. To avoid these pitfalls, I recommend that every interviewing developer should practice the following strategy before going into interviews and put it into practice (without fail, no matter how simple the question seems) during the process.

Restate the problem and ask clarifying questions.

Repeating the problem in your own words and asking some follow up questions only takes a second, and it’s a good way to quickly tease out any bad assumptions that have been made. It also gives the interviewer confidence that you’re used to attacking real world coding tasks the right way: being sure that you’ve correctly interpreted requirements and thinking through questions that impact various potential approaches. Ask how important optimization is instead of just assuming that implementing the naive solution is bad. Ask what you should optimize for, for example quickest execution speed or smallest memory footprint.

Walk through a basic example in detail and consider a few edge cases.

Take the time to think through at least one straightforward case, as well as a few relevant edge cases. Talk through your thought process as you’re going through them, utilizing the whiteboard as much as you can. Consider null or zero length inputs. Consider very large inputs, and be prepared to answer questions about whether your implementation would fit in memory on specific hardware given specific inputs. The process of walking through these cases should get you very close to pseudocode.

Write up pseudocode.

Be sure that you’re not writing in a real programming language. Pick a spot on the board where you don’t have to erase your pseudocode when you start to write real code, and will be able to read it. Lots of interview questions require thinking about recursive versus iterative implementations, so it doesn’t hurt to always consider whether that question is in play if it is relevant to the problem. Don’t abandon the pseudocode to dive into real code until you have completed the problem. Be sure to continue the dialogue with the interviewer while you’re thinking, and show that you can listen and course correct given hints.

Pick a language, and ask how important syntax is.

Always assume that for actual implementation, the interviewer cares about the details. I’m generally not a stickler for small syntactical minutia, but I get annoyed I get when an interviewer just assumes that it’s alright for the final implementation to be in pseudocode or some hodge-podge of languages. If you decide to code in a language other than the one that you indicated that you’re the most comfortable with on your resume, be sure to explain why. Asking how much the interviewer cares about syntax can help you decide whether to take an extra pass at the end of the loop being sure that everything is spot on; if the interviewer doesn’t care they may see it as a waste of precious time.

Code it!

You’ve done all the hard work, getting from pseudocode to your language of choice should be fairly trivial.

It’s important to remember that a typical interview in a loop will run 45-60 minutes, and most interviewers are going to want to touch on more than a single coding question. The expectation is most likely that you can complete the question in 20-30 minutes, so be sure that you’re not spending a ton of time on each step. A lot of interviewers will tell you that they’re not necessarily looking for you to get to a fully working implementation, but don’t believe them. If you don’t get code up on the board or stall out they will definitely ding you. Don’t spend more than a few minutes restating the question and walking through edge cases. The bulk of your time should be spent in an even split between pseudocode and code.

The beauty of following this strategy is that you will come across as organized and informed even if you don’t understand the question. It also provides an opportunity to work with the interviewer through follow up questions while running through examples and pseudocoding. Remember, the interviewer knows the answer to the question and they probably want to get you hints as you move in the right direction, so engaging them and using them as a resource is critical. Hope that these ideas help, now go nail that interview loop!

The Physical Versus The Digital

I don’t want to buy things twice. I’m even more hesitant to pay again for intellectual property, which costs little or nothing to clone. I don’t want to buy Angry Birds for my iPhone, Kindle Fire, PC, and Xbox 360. I’m even crankier about buying digital goods when I’ve already bought the IP via physical media. I want the convenience of reading my old college text books on my Kindle without buying them again, and I shouldn’t have to. I hate the dilemma of trying to figure out whether to order my grad school textbooks digitally (because it’s lightweight, convenient, and portable) or not (because the pictures render properly, it’s handier to browse, and looks cooler on the shelf). Maybe I’m in the minority here, but I’m also too lazy to consider buying and setting up a DIY Book Scanner.

Anyone who reads, plays games, or listens to music has shelves or boxes of books, NES cartridges, or CD’s that they probably don’t use often and don’t know what to do with. I would love the option to fire up RBI Baseball or reread Storm of Swords on modern devices with the push of a button, but it’s not worth storing the physical media and/or keeping obsolete devices around.

My frustration has caused me to conclude the relatively obvious: some company needs to offer a way to send back physical media along with a nominal fee in trade for the digital version. The physical media could be resold second hand or donated to charitable causes, and the folks ditching their physical media could access the things that they have already paid for in a more convenient format. Amazon is the one company that seems poised to make this happen given that they deal in both physical/digital and they have efficient content delivery mechanisms in place for goods of both kinds. Is there a financial model that makes swapping physical for digital work for all parties involved, and is it something that will ever happen?

Great Documents: A By-Product of Effective Software Development Managers

If you’re managing people who develop software you should probably be spending a nontrivial portion of your time writing documents, and the quality of those documents is critical. Documents matter because there are several questions that every manager needs to answer for their management chain:

  • What’s current state of the union?
  • Where are we headed over the next 12-36 months?
  • What level of staffing do we need to achieve that vision?
  • Are my employees compensated appropriately?

Managers also need to answer a related and partially overlapping set of questions for their employees:

  • Where is the team headed?
  • What’s my role in helping us get there?
  • How have I been performing?
  • How can I improve my performance, and grow my career?

Several tools are available to answer these questions in the modern business setting, but none are as effective as written documents. Face to face conversations or meetings are less efficient, more random, and can’t be archived for easy consumption after the fact. PowerPoint presentations require the context of a speaker (if they don’t you’re abusing PowerPoint and giving bad presentations) who is only presenting to a limited audience, so they share the archive problem of conversations. Videos or audio recordings of either conversations or presentations are impossible to quickly scan, and it’s more difficult for the person consuming the information to backtrack or skip around as needed. Email or instant messenger conversations are less formal and rigorous than documents by convention, so they allow for glossing over areas where deeper thought and investment is essential. Put simply: creating documents forces a manager to codify thoughts or ideas into an artifact that is easy for others to parse at any point down the road, with an effectiveness that no other process can duplicate.

I didn’t realize the value of written documents until I started working at Amazon almost a year ago. It’s very common at Amazon to walk into an hour long meeting and spend the first 20 minutes in silence reading and marking up a hard copy of a particular document, and spend the remaining 40 minutes discussing it. Initially I found that odd, until I went through the exercise of preparing my first long range planning document for my team and getting it iteratively reviewed by my team, peers, and various levels of my management chain. It took a lot of work, but all of that work ended up being hugely beneficial. We spent extra time meeting with customers to update requirements and get product feedback, held brainstorming sessions with team members and senior engineers who were interested in the space, and did some analysis on the cost of operations and ways that we could optimize some of that overhead. The final product was a 6 page plan that I could hand to anyone and rest assured that after a few minutes of reading they would have a great feel for what my team is up to, and why.

As a quick aside, this is a great example of why I previously wrote encouraging both software developers and managers to change companies/environments over the course of their career. There are a lot of things that I learned at Microsoft that I never could have learned at Amazon. There is an equally long list of things that Amazon does really well that I wouldn’t have learned anywhere else. Throwing yourself in a totally different neck of the woods provides a unique opportunity to grow in areas that you couldn’t have developed by staying put.

Back to documents. To clarify, I’m not talking exclusively about technical documentation like functional specifications, design documents, or test plans. I’m more focused on things like vision documents (which direction things are headed), long range planning documents (the nitty gritty on how to move in that direction), and documents about things like employee growth or promotion readiness. The neigh-sayers will argue against the value of these kinds documents because they aren’t part of what ultimately ships to the customer: the bits and bytes that get burned to a disc or deployed to a server somewhere. I would argue the exact opposite. For example taking the time to produce a long range plan that you can hand to engineers, customers, and partners can help you avoid building meaningless features, and help customers and partners give earlier feedback on where you’re headed. Similarly, taking the extra time to prepare a document evaluating an employee’s readiness for promotion is a great way to keep that employee apprised of growth areas, ensure that the employee is happy in their career progression and nip problems in the bud, and in the end save you from reduced productivity while back-filling for the attrition of an unhappy team member.

So without further ado, here are a few tips that I think will make you better at producing high quality documents:

Define your audience before you start.

In most cases it’s not possible to effectively address multiple audiences in a single document. Before you put pen to paper, define your audience. If the audience seems too broad, consider writing multiple documents instead of one. For example if you’re writing a document that tells what your team will deliver over the next 12 months, it may be appropriate to have 2 flavors of the doc: one for your management chain, and one for your customers. Your managers may want to know some of the dirty details like how much time your team spends on operations or how much test debt you need to backfill, but your customers may only care about what new incremental features your team will deliver. I’ve also seen cases where authors don’t define their audience right out the gate where the end result is a document that’s not really meaningful to any group of people.

Make bold claims. Don’t use weasel words. Be specific about dates.

Weasel words kill the impact of a document, and are a mechanism to avoid hard thinking or research that needs to happen. Consider a sentence like “Implementing this feature will represent a significant win for customers.” The sentence begs the questions: what feature, how significant, and what kind of win? Now consider the impact of rewriting it to “Implementing the features to allow multi-tenancy will allow customers to reduce the size of their fleet by 50%, resulting in a million dollar reduction in TCO.” Note that getting from A to B requires a lot of research, but the result is a statement that is much more impactful and makes it easier to gauge the value of the feature in question.

It’s equally important to be specific about dates. For example when you read something like “The feature will be completed later this year”, you should automatically ask the question: when this year? Are we talking next week, or late December? If my team has a dependency on your feature, I’ll need some more granular deets. If it’s impossible for some reason to provide a date, then provide a date by which you’ll have a date.

Finish Early, Allow Bake Time

This is critical. If your document is due in 3 weeks, plan to complete it in 1. Before you write the document you should identify peers that you want to read it, ping them to be sure that they block out time to do so, and then be sure to get them a copy on schedule. Consider iterative rounds of reviews with different groups of people that are stakeholders for the document. For example if you’re a line manager creating a vision document for your team you may want to start by getting it reviewed with a few of your peers and senior engineers, then take it to folks up your management chain, and then review the document with a few key customers. In my experience the resulting document is often drastically different (read: drastically better) than the original version.

Review, and review again. Use hard copy.

On a similar note, review your work often. Don’t write a document in one shot and call it good. When you finish it, step aside for a day and then read it afresh. Print the document out and review it in hard copy, pen in hand (and then go plant a tree). Staring at a piece of paper puts your brain in a different mindset than staring at Word on the computer screen. When you’re staring at your screen your mind is thinking rough draft, or work in progress. When you’re staring at ink on paper your mind is thinking finished product. You’re more likely to be a good editor of your document in the latter mode.

Conclusion

This isn’t an exhaustive list by any means, but it does include the tips that I’ve personally found to have the biggest impact on document quality. At some point I may put together a follow up list with some additional ideas on writing docs that I’ve excluded from this post. I personally apply these ideas to everything I write, including blog posts like this one. I hope that you find this helpful, and if you have additional ideas on either the value of documents or how to produce great ones I would love to hear them in the comments!

The Perl Script That May Save Your Life

I had a major “oh !@#$” moment tonight. While playing around with Maven, M2Eclipse and moving some project folders around I hastily hammered out a “sudo rm -R” and realized seconds later that I had blown away some code that I wrote last night that wasn’t in version control. All deleted. Not cool.

Fortunately I stumbled on this simple yet life saving article + perl script that greps the nothing of sda* for a particular string that was in your file and prints the contents around it:

#!/usr/local/bin/perl
open(DEV, '/dev/sda1') or die "Can't open: $!\n";
while (read(DEV, $buf, 4096)) {
  print tell(DEV), "\n", $buf, "\n"
    if $buf =~ /textToSearchFor/;
}

A quick run and a few minutes later I had my code back in one beautiful piece again. Mad props to chipmunk!

Why You’re Missing the Boat on Facebook Stock

I was about 2 hours into a 5 hour drive en route to an annual weekend golf trip when Facebook went public. That made me a captive audience for the 70 something year old family friend (who admittedly is a sharp cookie at his age and a damn good golfer) in my back seat as he lectured the rest of us on why the stock would be worthless in five years. In the weeks since I’ve heard a million flavors of the same message from people who’s tech savvy ranges from expert hackers to completely clueless. I respectfully disagree, and I think that there is a compelling technical argument that can be made for why Facebook has tremendous upside as a company. So let’s consider the question: should we all be buying Facebook stock at post IPO prices?

The Completely Tangential Bit

The first answer that I get from most folks is no, because Facebook adds no real value to people’s lives. In fact in some ways the result of the company’s existence is a net negative because it causes people to waste massive amounts of time and/or productivity. The company doesn’t produce goods or real services, and some would argue that it’s just a glorified LOLcats. I actually kind of agree, but I don’t think that it matters. What Facebook does produce as a sort of byproduct is an absolutely massive repository of personal data. More on that later.

The Red Herring

The next objection that people raise is based on an assumption that the primary way to monetize the website is ads. The company has certainly toyed with all kinds of ways of putting paid content in front of users, and the early returns seem to indicate that Facebook’s ads don’t work (at least not compared to Google’s paid search advertising). It doesn’t take a rocket scientist to realize that social pages are a whole different beast than search results pages. When people visit Google their intent is to navigate to another page about a topic. They don’t particularly care whether the link that takes them there is an algorithmic search result or a paid ad, they’re just looking for the most promising place to click. When people visit their BFF’s Facebook page they aren’t looking to leave the site, they’re planning on killing some time by checking what their friends are up to. So again on this point I agree; I’m skeptical that Facebook will never see the kind of crazy revenue growth from ads or any sort of paid content on their side that would justify even the current stock price. But advertising is just one way to skin a cat…

The Glimmer of Hope

But slightly off the topic of ads, and in the related space of online sales and marketing is where the first signs of promise can be found. Let’s get back to that data thing: Facebook has an absolute gold mine of knowledge that other companies would pay cold hard cash to access. Consider Amazon, for example. Amazon spends plenty of money mining user data to make more educated recommendations based on past purchase history. What would it be worth to them if they could find out that I have an 8 month old daughter, so I need to buy diapers on a regular basis? That I love Muse, so I may be interested in purchasing and downloading their new album? That I checked in at Century Link Field for a Sounders match last week, so maybe they can tempt me with a new jersey? Those are some of the more obvious suggestions, but there are actually more elaborate scenarios that could be interesting. What if you could combine Amazon purchase data with Facebook social graphs and figure out that three of my friends recently bought a book on a topic that I’m also interested in, and then offer those friends and I all a discount on a future purchase if I buy the book as well?

Facebook’s current market cap as I’m writing this is sitting at 57 billion. To get to a more reasonable 20 price to earnings multiple that seems relatively inline with other growth companies in the industry they need to add around 2 billion in annual earnings. Based on the numbers that I could dig up, that’s less than 1% of online sales in the US alone. Is that possible? Consider the margins of the biggest online retailer. Amazon is legendary for operating on razor thin margins, but their US margins last year were around 3.5%. How much of that margin would they part with for ultra meaningful personalization data that could have a huge positive impact on sales volume? Also, keep in mind that these numbers are for the US only, and they don’t include the astronomical projected growth in online sales moving forward. Regardless of exactly what the model looks like, I think there is a path for Facebook to leverage their data to grab some small piece of that growing pie.

The privacy hawks out there are already sounding alarms, I can hear them from where I’m sitting. But who says that there isn’t a model of sharing data that Facebook users would be happy with? I would venture that there are arrangements where users would be happy to share certain kinds of information to get a more relevant shopping experience. Taking things one step further, there are certainly users who would expose personal information in exchange for deals or rebates that online retailers like Amazon could kick back as an incentive to get the ball rolling, and Amazon isn’t one to pass on a loss leader that drives business with a long term promise of return on investment.

The Real Diamond In The Rough

And that gets us to the crux of the matter. Online sales are just one example of a market that Facebook can get into and leverage it’s data to make a buck. The evolution of computer hardware, the maturity of software that makes it trivial to perform distributed computation in the cloud, and continued advances in machine learning have ushered in the age of big data. Computer scientists who specialize in machine learning and data mining are being recruited to solve problems in every field from pharmaceuticals to agriculture. And the currency that these scientists deal in is huge amounts of data. Facebook has data in spades, and it has a very valuable kind of data that nobody else has.

The model for monetizing that data isn’t clear yet, but I can think of possibilities that make me optimistic that good models exist. For example think about the kind of money that Microsoft continues to pour into improving Bing and leapfrogging Google’s relevance to become the leader in online search. Facebook’s data could be an absolutely massive advantage in trying to disambiguate results and tailor content to a particular user. Google’s SPYW bet and Bing’s Facebook integration are different approaches on trying to integrate bits of social data into search, but they fall way short of the kind of gain that could be had via direct access to Facebook’s massive amount of social data.

Or suppose that a company or government body is trying to gain information about the spread of a particular disease. Maybe they have medical records that include the identities of people who are carriers, but not much more than that. If they had access to Facebook’s data they could suddenly know about the ethnicity, social network (who’s hanging out with who), and habits (through check-ins) of people in both classes: carriers and non-carriers. Applying machine learning to that training set may yield some interesting information on what traits correlate with becoming a carrier of the disease.

The One Armed Bandit

Of course, there’s a risk involved. As a friend of mine aptly pointed out, my case for Facebook’s value looks something like: 1) have a lot of important data, 2) mystery step, 3) profit. I would argue that if the mystery step was clear today, the valuation of Facebook stock would be much higher than even where it’s currently trading. I’ve given a few fictional examples to make the case that the mystery step probably exists. If you buy that argument, then you too should be buying Facebook stock. And this bar may be serving some expensive drinks in the future.

How To Automagically Classify Webpages By Topic

Ever wondered how you can automate the process of figuring out whether a webpage is about a particular topic? I’ve spent some time recently on a side project that involved solving a flavor of that exact problem. I’ve seen several related questions on Stack Overflow and other sites, so I thought I would throw together a quick post to describe bits of my implementation.

For our example, let’s suppose that we need to implement something that exposes an API like the following:

  • boolean classifyWebpage(Webpage webpage)
  • void trainClassifier(Map < Webpage, boolean > trainingExamples)

We will mandate that consumers call the function to train the classifier once with a training set before we can call the function to evaluate whether a webpage is about our topic. Our train classifier function will take a bunch of webpages and whether or not they are about the given topic, to use as training examples. Our classify webpage method will take a webpage and it returns true if the webpage is about the topic and false if it isn’t. To achieve this, we’ll implement a few helper functions:

  • String cleanHtml(String html)
  • Map < String, int > breakTextIntoTokens(String text)
  • Map < String, float > getTokenValues(Map < String, int > tokenCounts)

Let’s look at how we can implement some of these pieces in detail.

Cleaning up HTML

The first piece of infrastructure that we’ll want to build is something that strips markup from an HTML string and splits it into tokens, because words like “href” and “li” are about formatting and aren’t part of the true document content. A naive but decently effective and low cost way to this is to use regular expressions to strip out everything in the contents between script and style tags, and then everything between < and >. We’ll also want to replace things like non-breaking space characters with literal spaces. Assuming that we’re working with fairly conventional webpage layouts, the blob of text that we’re left with will include body of the webpage plus some noise from things like navigation and ads. That’s good enough for our purposes, so we’ll return that and make a mental note that our classification algorithm needs to be good at ignoring some noise.

Break Text into Tokens

Once we have clean text, we’ll want to break it into tokens by splitting on spaces or punctuation and storing the results in a data structure with the number of occurrences of each token. This gives us a handy representation of the document for doing a particular kind of lexicographical analysis to bubble up the words that matter the most. Again, regular expressions are our friend.

Find the Keywords

Armed with a map of tokens and count of occurrences for each token, we want to build something that can pick the keywords for the document. Words like “the” and “to” don’t provide any clues about what a document is about, so we want to find a way to focus on keywords. The important words in a document are likely to be repeated, and they’re also not likely to be found often in most other documents about different topics. There’s a nifty algorithm called Term Frequency Inverse Document Frequency that is both easy to implement and does a great job find keywords by comparing the frequency of words in a single document with the frequency of words in a corpus of documents.

To make this work we’ll need to start by building a corpus. One option is to bootstrap by crawling a bunch of websites and running the entire set through the our initial function for cleaning and tokenizing. If we’re going to go this route we need to be sure that we’ve got a good mix of pages and not just ones about our topic, otherwise the corpus will be skewed and it will see things that should be keywords as less valuable. A better option in most cases is to use an existing corpus , assuming that one is available for the desired language, and manipulate it into whatever format we want to use for classification.

Classify a Webpage based on Keywords

The next bit is the secret sauce. We know that given any webpage we can extract keywords by doing some prep work and then comparing against a corpus, but given those keywords we need to decide whether a webpage is about a given topic. We need to pick an algorithm that will give us a boolean result that tells us whether a webpage is about our topic. Keep in mind that while we’re setting up our algorithm we have some training examples to work with where we’re given a class, in other words we know whether they are about the topic or not.

The first option that most people think of is to come up with a mathematical formula to tell whether a webpage matches a topic. We could start by boiling the problem down to how well two specific webpages match each other by coming up with a mathematical formula to compare two webpages based on similar keywords. For example we could compute a running similarity total, adding to it the product for the ranking values in each respective page for keywords that match. The result would be a scalar value, but we could convert it to a boolean value by coming up with some arbitrary threshold based on experimentation and saying that pages with similarity over our threshold are indeed about the same topic. In practice, this actually works decently well with some exceptions. With these building blocks we could figure out whether a given webpage is about a topic by finding how similar it is to webpages in our training set that are about that topic versus ones that aren’t, and making a decision based on which group has a higher percentage of matches. While it may be effective, but it has several flaws. First, like Instance Based Learning it requires comparison with the training set during classification which is slow at runtime because we have to consider many permutations. More significantly, we would have applied a human element to the algorithm by defining the threshold for a match, and humans aren’t very good at making these kind of determinations because they can’t process certain kinds of data as quickly as a computer can.

Using machine learning, we can enlist the help of a computer to apply the data to a particular learner that will output a classifier for the domain. Frameworks like Weka offer all kinds of learners that we can try use out of the box with our training data to create classifiers. For example Naive Bayes is an example of one such algorithm that tends to do a great job with text classification. If we use our words with weights as attributes and each website in the training set as an example to train on, a Naive Bayes learner will find probabilistic correlation between the occurrence of words and the topic of a webpage and will output a classifier that is likely to give more accurate results than any algorithm that a human could come up with in a reasonable amount of time.

Wiring it Up

So how do we wire these pieces together, and what does it look like to consume the finished product? Let’s suppose that we want to be able to tell whether a website is about soccer. We start by creating a whitelist of websites that we know produce editorial content about soccer. We’ll also want to create a blacklist of sites that produce content about world news, technology, rock and roll, ponies, and anything that isn’t soccer. We throw together a function that crawls the sites and for each example we infer a class based on the source (we may be wrong in some edge cases, but in general we’re going to assume that the sites in our whitelist/blacklist are producing content that is or isn’t soccer across the board). We run the webpages through our cleaning, tokenizing, and ranking functions and we end up with training examples that look like the following contrived ones:

  • foo.com – True. Manchester (.85), Rooney (.75), United (.64), match (.5).
  • bar.com – False. Muse (.9), Bellamy (.72), guitar (.72), cool (.48), show (.43).

Getting a Weka to speak our language may require massaging the examples into ARFF or some format that the framework understands, but at this point we can directly apply the training set to the learner. For subsequent webpages we run them through the same functions to get ranked keywords, and then we pass the new example into the classifier and we’re given a boolean result. Magic.

Simple Optimization

Note that we only used the words in the body of a webpage, but in the real world we would have access to more data. We could potentially look at the hints provided in the page title, meta tags, and other tags like heading/bold or bigger font sizes and make some assumptions about the importance of the words (of course we have to do this before stripping tags out). If we could get our hands on link text for sites that link to the article that could also serve as valuable input, although search engines aren’t making it as easy to access link data from their indexes these days. We could use this additional data to either augment our Naive Bayes learner using arbitrary weights, or we can use more complex learners like a Perceptron or a Support Vector Machine to try to let the computer decide how important we should consider these other inputs to be. It’s certainly possible that for some topics other kinds of learners may produce better results. Or we could investigate ways to use learners in combination (via Bagging or Boosting, for example) to get better accuracy than any single learner.

Conclusion

Classifying webpages by topic is a fairly common example of a problem that can be solved by an algorithm created by either a human or a computer. My aim in this post was to provide a quick look at one way to attack the problem and to touch on some very introductory machine learning concepts. If you’re interested in reading more about machine learning specifically there are countless resources online and some great books available on the subject. Hope you found the post helpful. Classify away, and I’ll certainly look forward hearing back from any readers who have tackled similar challenges or want to provide feedback!

How the Cloud Saved Me from Hacker News

If you’re reading this post, we probably have one thing in common: we both spend at least some of our free cycles perusing Hacker News. I know this because it has driven most of my blog traffic over the past week. I have a habit of submitting my recent blog posts, and the other day I was surprised to see one particular post climb to number three on the Hacker News homepage. My excitement quickly gave way to panic, however, as I realized that the sudden rush of traffic had taken my blog down in the middle of it’s shining hour.

Back up a couple of months. I started blogging back in 2009 on Blogspot. At some point I was tempted by the offering of a free AWS EC2 Micro Instance; I had been thinking about setting up a private Git Server and running a few other servers in the cloud and I decided that like all of these guys, I would migrate my blog to Self Hosted WordPress on EC2. The whole migration was rather painless, I’ll spare the monotonous details because there are quite a few blog posts out there on getting the setup up and running, and how to move content. I will say that the one issue that I ran into is that I had issues with the existing BitNami AMI’s preinstalled with WordPress, so I ended up picking a vanilla Ubuntu AMI and installing LAMP + WordPress myself. Suffice to say that I’m still relatively new-ish to the Linux world, and I pulled it off without much trouble.

But now, my blog was down. Fortunately I was able to cruise over to AWS Management Console and stop my EC2 Instance, upgrade temporarily to a Large Instance, restart it, and then update my Elastic IP. Just like that I was back in business, and my blog that previously got a couple hundred hits on busy days suddenly fielded over 20k hits in a day and another 6k over the next few days.

I figured I would throw together a quick post on my experience for a few reasons. First, because some folks who posting to Hacker News may not have an idea exactly what to expect if they make the homepage. Read: If you’re EC2 hosted, upgrade your Instance size ahead of time. And second, I just wanted to marvel at the power of the cloud. A decade and a half ago I remember ordering a physical Dell rack server and hauling it over to a local ISP where I collocated it for a couple hundred bucks a month and used it host a few websites and custom applications. The fact that I can now spin up a virtual machine in the cloud in minutes, have my software stack up and running in less than an hour, and instantly scale to accommodate huge traffic variance (and all for cheap) is a testimony to the infrastructure underneath modern cloud offerings.

The Software Developer’s Guide to Fitness & Morning Productivity

If you’re a software developer (or frankly, if you spend a large portion of your day sitting in a chair in front of a computer) you will be more productive if you find a way to incorporate a workout into your daily routine. I literally believe that if you’re working 8 hour days today, you will get more done working 7 hours and squeezing in a 30-40 minutes of physical exercise. I believe this because a couple months ago my family and I moved into the city a few blocks from where I work, and I traded long commutes sitting in traffic for some relaxing morning time with the family and a quick work out in the mornings at the fitness center down the hall. The value of living close to work and having a bit of relaxing time in the morning is probably fairly self explanatory, but for now I want to focus on why I’ve found exercising to be so valuable. I also want to call out a few things that I’ve learned in the process that I hope may make your life easier if you aren’t exercising regularly and decide at some point that you want to incorporate a work out into your day. I don’t claim to be a personal trainer or any kind of fitness expert (although I’ve consulted a few while putting together a program that’s effective and gets me in and out of the gym quickly). Don’t treat this post as a replacement for good advice from qualified health and fitness professionals; think of it as one computer geek sharing some practical tips with his fellow geeks about a particular way to get in shape and increase productivity.

Benefits of Exercise

From a pure productivity perspective, the biggest benefit to exercising for me is specific to working out in the morning. Rather than getting to the office feeling like I needed another two hours of sleep and only 4 cups coffee will get me through the day, I show up feeling awake and ready to start knocking off tasks in my queue. Because many of the folks on my teams tend to show up at 10 or 11 and work late my schedule is generally meeting free in the morning, which also makes it the most valuable time to be productive.

I don’t have evidence to support this, but anecdotally I have observed a link between fitness and career success. That’s not to say that you can’t have one without the other, but I believe that you have a better shot of being successful in your career if you work out on a regular basis. Working out makes you feel good, boosts your energy levels, helps strengthen your core muscles so that you’re comfortable sitting in a chair all day, gives you confidence, and perhaps most importantly gets you in a habit of setting goals and achieving them over long periods of time. When you’re jumping between jobs, there’s also evidence to suggest that interviewers make a hire/no hire decision that is extremely tough to overturn in the first 15 seconds of the interview process and whether you like it or not that first impression includes what you look like.

When to Exercise

Some people believe that working out in the morning boosts your metabolism throughout the rest of the day, but the limited research that I’ve seen seems to suggest that regardless of when you work out you get a short metabolism boost that goes away in a set amount of time. I’ve touched on why I find working out in the morning to be especially beneficial, but I would recommend working out at a time where you know you can be consistent; if you try to vary your workout daily according to your schedule you’re going to be way more likely to skip it. If the only way that you can be consistent is to take a quick jog on a treadmill in a 3 piece suit at lunch, do that… and do it consistently.

How to Excercise

Map out a routine that’s short and sweet, and ideally one that you enjoy. Get your heart rate up to your target zone and try to keep it up for 20-30 minutes. Pick a few exercises and do them in circuits with little or no rest between exercises (and a short rest between sets), at high intensity. Lean towards workouts that work large groups of muscles, for example doing push ups (or better yet, burpees) instead of bench press.

Personally I run between 1-2 miles and then pick 3 different exercises and do them in a circuit. I split the exercises into upper body, lower body, and core. I try to make sure that I hit each big muscle group at least once per week. It gets me in and out of the apartment gym in around a half hour, and I’ve found it to be effective. If you’re having trouble figuring out what exercises you should incorporate into your workout, chat with a trainer or check out one of the apps (there are several Crossfit WOD specific ones if you want to go that route) that are available on any phone.

How to Eat

One of the first things that I noticed when I started working out was that after my morning burst of energy I would start getting tired right before lunch. I figured out that eating protein in the morning helped, so I ordered a big tub of whey protein and started making a quick fruit/protein shake with some yogurt/milk every morning. Remember that your body needs protein to rebuild muscles after a workout, and if you’re like me you’re probably not in the habit of eating enough protein to start your day. Protein provides energy for a longer period of time than fat or carbs, so you’ll be getting fuel from your morning snack for longer.

Hope you find this helpful, and if you figure out any workout tips of your own as you get going please do share!

Dependency Injection (& Small Furry Animals)

Dependency Injection (DI) is a game changing design pattern that most programmers should have in their toolbox. DI was initially known as Inversion of Control (IoC), but because frameworks inherently invert some sort of control Martin Fowler proposed the naming switch to DI in his canonical post on the subject to better describe which aspect of control is being inverted. There are plenty of tutorials on DI that are extremely helpful when ramping up on the topic, although some of them are either tightly coupled to a particular language or framework while others can be a bit lengthy. For that reason I thought I would take the time to write a quick post that explains the concept of DI at a high level, in a (somewhat) concise fashion. I’ll try to explain things in a way that is language agnostic, but I will use Java for my initial example. After that I’ll touch on ways to use DI in a few common languages, and then I’ll close by listing a few reasons why DI is cool.

To illustrate what DI is all about, let’s suppose that I’m writing a Java application that makes small furry animals sing and dance on the screen (because everyone loves a minstrel marmot). I begin by creating a SmallFurryAnimal Interface like so:

Interface SmallFurryAnimal {
  void Sing();
  void Dance();
}

I then create a couple of animal Classes called Marmot and Squirrel that implement SmallFurryAnimal with their own animal specific implementation of Sing and Dance. I tie everything together in a main method that looks like this:

Class Application {
  public static void main(String [ ] args) {
    while (true) {
      Marmot.Sing();
      Marmot.Dance();
      Squirrel.Sing();
      Squirrel.Dance();
    }
  }
}

When I run javac to compile my application to bytecode I need to be sure to pass the source files for Marmot and Squirrel to compile Application because they are hardcoded compile time dependencies. When I run the application using java I also need to be sure that the ClassLoader is able to load Squirrel and Marmot. This is all fine and dandy until I realize that I want to give people who use my application the ability to design, implement, and plugin their own arbitrary small furry animals without having to worry about recompiling or even knowing about the rest of my application. Essentially I want to do the following:

Class Application {
  public static void main(String [ ] args) {
    while (true) {
      for (all the aniamls that I inject at runtime) {
        SmallFurryAnimal.Sing();
        SmallFurryAnimal.Dance();
      }
    }
  }
}

With these changes I’m only bound to the interface for SmallFurryAnimal, not any particular implementation. I’m free to code up animals to my heart’s content, and I can use DI to inject them into my application at runtime.

It turns out that this kind of ability to inject dependencies into an application at runtime is very common for all sorts of applications. One very common example is applications that run workflows and allow users to inject custom workflow tasks. Another similar example is a state machine application that allows users to inject custom states behaviors that relate to states. In fact almost any time you implement an interface or derive from a base class may be a candidate for injecting the dependency on the implementation at runtime unless the number of possible or necessary implementations is of a relatively small fixed size that isn’t likely to change.

It’s possible to use DI in Java without taking advantage of any additional frameworks by using Reflection to crack open Class Files and look for Classes that implement a specific interface. The most common way to use DI in Java is Spring, a popular Java Framework targeted at a wide array of Java deployment use cases. One of the things that Spring provides is an IoC container which provides the ability to load Java objects using Reflection (typically by specifying which objects to load in configuration) and handles managing the lifecycle of those objects.

C# is obviously very similar to Java, and it allows you to implement DI using Reflection. .NET has also introduced the Managed Extensibility Framework which allows usage of the DI pattern without configuration by using class attributes to tell the Composition Container what to load at runtime.

If you’re using dynamic programming languages like Perl or Ruby then DI is essentially baked in, you’re probably already using it whether or not you realize it.

There are a bunch of reasons why DI is super important, and why many programmers who first stumble on the pattern feel like it’s a game changer. It improves testability by providing the ability to inject mock objects during test execution. It improves reusability of code by allowing developers to build components once and dynamically injected them into multiple applications, and by allowing application consumers to inject objects that are specific to their requirements. It makes code more readable by making dependencies explicit in configuration or metadata. It makes applications easier to deploy or upgrade by guaranteeing loose coupling with dependencies so that either the application or any injected object can be deployed in isolation.

It’s not the only way to accomplish these objectives, but it’s certainly a great way to do so. Hopefully this quick overview has served to help introduce you to the DI pattern, and provided some useful links in case you want to dig deeper.