String Alignment, Dynamic Programming, & DNA

Most modern runtime environments provide an implementation of some basic functionality for working with strings. We’re used to leveraging simple tasks like splicing together strings or checking whether one string contains another, but what if we wanted to do something more complex and determine whether two strings have a shared etymology or are likely to have a similar meaning? The problem is interesting for a few reasons. First, there are some very interesting applications for predicting how closely related strings are. In terms of spoken and written language, being able to glean information about the evolutionary origin of words can serve as one building block for more complex problems like both natural language processing and translation. There are also very important uses for this kind of algorithm in related fields like DNA sequencing, and even in comparing the source code of programs. Second, the algorithms at the core of the solution to the problem provide a neat example of Dynamic Programming, which is a handy tool for any developer to have in their toolbox.

To get started, let’s consider a simple real world example. Suppose that we need to programmatically decide whether the English word “dictionary” and the Spanish word “diccionario” are related in order to draw some conclusions about the meaning of each word in absence of a… well, dictionary. We begin by noting that in practice, words tend to evolve over time through the insertion, deletion, and mutation of characters. To capitalize on this observation, we need to start by trying to find the most likely alignment between the two words by inserting gaps to make similar characters match up. The first step in finding that alignment is to establish a scoring system between characters in each word. The scoring system should be based on data and should reward similarity while penalizing differences between the strings. For our example, let’s start with a very simple scoring system where letters that match exactly are worth +2, vowels that don’t match are still worth +1 (and we’ll include the character ‘y’ with vowels), and letters that are mismatched (or matched with gaps) are worth -3. This scoring system takes advantage of a somewhat anecdotal observation that vowels tend to turn into other vowels more frequently than other mutations.

Given these words and the scoring system, our brains can actually scan the two words rather intuitively and posit a valid alignment. We can represent the alignment by showing the two words with dashes inserted for gaps, and a third row where letters indicate a direct match and plus signs indicate a non-match that was still assigned a positive score:

DICTIONAR-Y
DIC IONAR +
DICCIONARIO

Our alignment has 8 letters that match perfectly and a score of 11, and at a glance it seems like a very strong match. It also happens to be a global alignment, which I’ll touch on later. The gap in “dictionary” is also somewhat arbitrary in that the score would be identical if the gap followed the ‘y’ instead of preceding it, which means that in this case there are multiple alignments that are equally valid.

So we’ve produced an alignment, but how could we automate the alignment finding process? To take things a step further, how can we calculate the probability that the alignment is meaningful and not just coincidence? Let’s deal with the former problem first. At first blush, it looks like we could just implement a recursive top down algorithm that finds the optimal alignment by drilling down to individual characters and then building back up by adding together the maximum scoring alignments of each possible combination of substrings and gaps. If we want to align “dictionary”, we could first find the possible alignments and scores of ‘d’ and ‘i’ and then find the alignments and scores for “di” by picking the from individual character alignments and gaps, and so on. The problem with this approach is that we would end up considering each character in the first string with each character in the second string, and then we would also consider each combination of characters in the first string with each combination in the second string. The result is a less than desirable runtime complexity of O(n^2m) for strings with length n and m, respectively. Unfortunately that won’t scale well with bigger strings (like DNA, which we’ll touch on later) when we have to run a large number alignments to try to calculate the probability that the alignment is significant (which we’ll also touch on later).

An alternative is to use Dynamic Programming (DP): breaking the problem up into smaller sub-problems and solving from the bottom up. The beauty of DP is that we can optimize so that we don’t have to redo some of the simple calculations that get repeated with recursion, and we can also take advantage of nuances of the problem in the way that we setup the algorithm. For string alignment the particular nuance that we can capitalize on is the fact that subsequent characters of the string must occur in order, so at each character we can only either insert the next character or insert a gap in one string or the other.

I’ll pause briefly before explaining the DP algorithm to clarify that there are two flavors to string alignment. Global alignment requires that we use each string in it’s entirety. Local alignment requires that we find only the most aligned substring between the two strings. The most widely used global alignment algorithm is called Needleman-Wunsch, while the local equivalent is an algorithm called Smith-Waterman. Both algorithms are examples of DP, and begin by building up a two dimensional matrix of integers with dimensions of the size of each respective string plus one (because we can start each aligned string with either a character or a gap). I’m going to use the local alignment algorithm in my example, but global alignment is very similar to implement in practice.

The first step in finding a local alignment is building up the matrix of integers, where each location in the matrix represents a score as we traverse the alignment. We start by initializing the values along the top and left of the matrix to zero, since local alignment can start at any point in the string. If we were finding a global alignment we would have initialized the edges so that at each subsequent location we imposed the mismatch or gap penalty, because we would need to include the entire string in the result. Next we start with the top left corner that remains and work either right or down (it doesn’t matter as long as we stay consistent), and at each location we enter the maximum of four values: the value to the left plus the score of the top character and a gap, the value to the top plus the score of the left character and a gap, the value up and left plus the score of the characters to the top and left, or zero. If that explanation isn’t clear or the intuition isn’t there, consider what each decision is implying with respect to the alignment that we’re scoring. The first two options imply inserting a gap in one of the strings, the third implies aligning two characters, and the forth implies ending a local alignment. For our contrived example and scoring system, the matrix looks like the following:

  – D I C C I O N A R I O
– 0 0 0 0 0 0 0 0 0 0 0 0
D 0 2 0 0 0 0 0 0 0 0 0 0
I 0 0 4 1 0 2 0 0 0 0 2 0
C 0 0 1 6 3 0 0 0 0 0 0 0
T 0 0 0 3 3 0 0 0 0 0 0 0
I 0 0 2 0 0 5 2 0 0 0 2 0
O 0 0 0 0 0 2 7 4 1 0 0 4
N 0 0 0 0 0 0 4 9 6 3 0 1
A 0 0 0 0 0 0 1 611 8 5 2
R 0 0 0 0 0 0 0 3 81310 7
Y 0 0 0 0 0 0 0 0 51010 7

Apologies if my somewhat crude tables don’t format well on your device, I was lazy and copied/pasted them from program output without applying a lot of formatting. You can see that building up the matrix takes O(nm) time for strings with length n and m, so we’ve reduced our runtime complexity by an order of magnitude.

The next step is retrieving the alignment from the matrix we start with the location with the maximum value, which we can store as we build up the matrix. From that location we trace backwards and figure out which action resulted in that value by considering the locations to the top, left, and top left of the current location and calculating which value was a valid part of the calculation for the current value. In some cases there will be multiple valid alignments, as indicated by either multiple starting locations with identical scores or multiple valid locations that could have contributed to the calculation for the current location. In that case it’s important to consider the context and decide whether it makes sense to return multiple alignments, pick a single alignment arbitrarily, or create a criteria to pick between alignments (for example penalizing longer alignments). Since we’re finding a local alignment, we trace until we reach a location where the score is equal to or less than zero. If we were doing global alignment we would use the same logic, but we would start in the bottom right location and go until we reach the top left location. The same matrix below shows our trace back with the path that we’ve selected in bold:

  – D I C C I O N A R I O
– 0 0 0 0 0 0 0 0 0 0 0 0
D 0 2 0 0 0 0 0 0 0 0 0 0
I 0 0 4 1 0 2 0 0 0 0 2 0
C 0 0 1 6 3 0 0 0 0 0 0 0
T 0 0 0 3 3 0 0 0 0 0 0 0
I 0 0 2 0 0 5 2 0 0 0 2 0
O 0 0 0 0 0 2 7 4 1 0 0 4
N 0 0 0 0 0 0 4 9 6 3 0 1
A 0 0 0 0 0 0 1 611 8 5 2
R 0 0 0 0 0 0 0 3 81310 7
Y 0 0 0 0 0 0 0 0 51010 7

At this point we’ve got an alignment and a score of 13, which is pretty rad. But how can we tell whether the score is statistically significant? Clearly we can’t make hard and fast rules, because the number that the score produces is relative to a lot of factors including the size of the matrix and the alphabet that was used. One option is to create a bunch (in practice a bunch usually means a minimum of 10^4) of random permutations of one of the strings that reuses all the characters and is the same length, align them with the other string, and then see how many alignments score equal to or better than the initial match. The resulting “P-Value” can be calculated by taking the number of better matches plus one divided by the number of attempts plus one (because we include the initial match in our set). Interpretation of the value depends on the context, but in our case we may decide that a probability value of .5 may indicate that the relationship is random while a value of 10^-4 may suggest a more compelling relationship between the strings. For the given example I tried running 10^5 different alignments while randomizing a string and didn’t see a single alignment with a score that was equal or better, which isn’t a surprise because the alignment that we produced is almost a direct match. The obvious conclusion is that the words almost certainly did evolve from the same origin at some point in human history, and thus likely share meaning.

Now that we’re armed with our nifty algorithm, where else can we put it to use? Some of the more interesting applications for string alignment occur when trying to sequence and compare DNA. At the end of the day, DNA is just a long string that consists of a 4 letter alphabet of nucleotides. DNA produces RNA, which in turn also produces proteins that are made up of a chain of amino acids. Each protein is also just a string that consists of a 20 letter alphabet of amino acids. DNA (and as a result, the proteins that it produces) has evolved through a series of insertions, deletions, and mutations just like written or spoken language. DNA evolution is actually much more pure than that of written and spoken language, because it’s much more difficult for an entire section of DNA to instantly mutate than it is for a brand new word to become adopted in a culture. A gene is the DNA equivalent of a word: it’s a section of DNA that is ultimately responsible for the production of a protein that may contribute to a physical trait. By aligning either DNA or protein sequences, we can make very educated guesses about whether genes are evolutionarily linked, and whether the proteins that they produce fold and function similarly. By looking at highly conserved regions of related proteins over time and making statistical observations about the likelihood of certain amino acids and common mutations, researchers have produced scoring systems like BLOSUM that are very effective when comparing proteins.

Another interesting application for our algorithm is in examining iterations to the source code of a program, or comparing multiple implementations of a program. A program is really just a string where the alphabet is generally defined by the context free grammar of a programming language. Suppose that we want to try to figure out whether a student in a computer science class copied his assignment from a peer, or did the work on his own. In this case we can simply align each two programs, look at the probabilities that the programs are related, and apply some manual investigation and common sense investigation to any results that appear closely related.

That’s a lot of information to pack into a single post, but I hope you find it helpful and are able to think of other creative applications for the algorithms and concepts. As always, I’ll look forward to reading your comments!

Getting Rock Stars Excited About Working For You

The most important thing that the manager of a software development team can do is to staff their team up with rock star developers. The “10x Software Development” principle that Steve McConnell and others brought to the mainstream is the biggest reason why: the best developers aren’t just twice as effective as average ones, they’re an order of magnitude better. This is old news, and we’ve all heard it before by now (and on a side note, it’s also why I’ve previously harped on the importance of building a network of talented folks). What isn’t old news is that there are a few straightforward steps that managers can take to maximize their chances of landing those amazingly talented developers.

I’ve managed teams at both Microsoft and Amazon. I’ve recruited the cream of the software development crop against virtually every other big tech company, exciting start-ups, and even against other teams within my company. I’ve won more of those battles than I’ve lost, and I’ve learned my share in the process. I’m going to share a few very tangible suggestions that you can put into practice today that I humbly submit will make you more effective at winning your own recruiting battles. Most of the tips are equally relevant whether you’re managing a team within a huge company, or hiring employee #1 at a start-up operating out of your garage.

To set the stage, let’s start by focusing on a scenario that you’re likely to come across in the recruiting process. Suppose that you’ve been fortune enough to find a rock star developer who’s interested in your team. She breezed through your interview process, and your HR folks (who incidentally may also be you at a small company) have reached out to the her and started to negotiate compensation. The word comes back from HR that the developer is interested in joining your team, but she has other offers out from two other companies with comparable compensation and she’s having a hard time making a decision. The good HR peeps inform you that you’re at the absolute top the compensation range, so there isn’t any way to sweeten the offer and blow her away with dollar bills. They’ve lined up a sell call for you to chat with the developer, answer her last few questions, and get her jazzed about your team. This is a position that I’ve personally been in at least 10 times over the past year.

In this situation there are two sets of inputs that will determine whether the candidate accepts your offer. The first is a set of things that are mostly out of your control at this point, for example: the brand image of the company that you work for, the location of the team offices, differences in the compensation packages being offered, and the appeal of the projects that each team is working on based on that developer’s particular experiences and interests. The second set is in your control, for example: the way that you come across in the phone conversation and any other communication with the candidate, the way that you choose to paint a picture of your team, and how well you and the person connect. The following are tips that I’ve found effective at maximizing that second set of inputs that are in your control and increasing your odds of landing a rock star:

  • Prepare a one pager that sells your team, today. Leave the stuff that isn’t sexy out. Don’t talk about your ops rotation or that legacy product that your team has to keep on life support. You should field questions about those areas honestly when asked, but the one pager isn’t the place to do so. Sell the long term team vision, and make it something that good developers want to sink their teeth into. Emphasize hard technical challenges. Be sure to call out who your customers are and what the impact of the technology is for them, and consider including a powerful quote from at least one of satisfied customer. Email the one pager to the candidate as soon as possible once you’ve decided to hire them and before the sell call to keep them thinking about your team and help them formulate follow up questions for the phone call.
  • Come to the phone call prepared. Know the candidates resume and background inside and out. You studied this extensively prior to interviewing the candidate, but be sure to refresh your memory immediately before the call. Understand that when you chat, you shouldn’t feel the need to cover absolutely everything that your team is doing; in most cases that will be more than a person can digest in a single phone call. Create a list of no more than 3 high level bullet points that you want to be absolutely sure to cover. Tailor those bullet points to the kind of technologies that the candidate is likely to be excited about.
  • Spend the first 5-10 minutes of the phone call learning more about the candidate. Be sure that you’ve accurately identified their interests. Be ready to adjust your message on the fly to emphasize the points that jive with their interests. Ask probing questions to show them that you’re really interested in what they care about, and to be sure that you’re correctly identifying their interests.
  • Walk around while you’re talking. Put the phone on speaker, or get a headset. Use your arms and other body language, even though the candidate can’t see you. It’s easier to speak naturally and get excited about what you’re talking about if you’re not chained to a desk and phone. Your enthusiasm about the team and products is absolutely key to getting the candidate excited, and it’s harder for most people to convey over the phone than in person. It’s also something that is very difficult to fake if you’re working on a product that you aren’t a fan of.
  • Try to take the phone call to a place where it turns into a technical discussion between you and the candidate. If you can get to a point where they’re excited about the technology, making suggestions, and engaging in a dialogue about where they would take the technology you’re in great shape. Intentionally nurture that kind of discussion.
  • End the call by answering questions and offering to connect them with additional resources. I personally always provide my contact info and encourage the candidate to get in touch with me if any future questions come up. I also offer them a chance to schedule a follow up phone call with a senior engineer from the team (that I trust to represent the team well) to dive further into the technical weeds if the candidate is interested. They will only rarely take you up on that offer, but making the offer still goes a long way to establish the technical credibility of the team.
  • If the opportunity presents itself, follow up with something memorable. When I was contemplating Amazon my baby was just a few months old, and the hiring manager sent me an Amazon branded onesie. We sometimes send candidates Kindles to keep Amazon front and center in their mind and show them how much we want them to join us. I had a candidate a few months ago that wanted to join the team but was hesitant about the weather in Seattle, so I took a picture out the office window on the next day (which happened to be beautiful and sunny) and emailed it to him. Anything that keeps your team and company front and center in the candidate’s mind is awesome. Be sure not to force it if the opportunity doesn’t present itself, because that kind of attempt will likely come across as desperate and produce the opposite effect.

I hope you find those ideas helpful and are able to implement them effectively, unless I happen to be managing the team that you’re competing against for that next rock star developer. 🙂 I’ll certainly welcome feedback and/or additional suggestions from others in the comments below. I’ll add the usual disclaimer that these ideas are all my own, and don’t reflect any of the opinions of my current or former employers.