Your Critical Data Isn’t Safe

I’m willing to bet that just about every working moment of your life up to this very instant has been an attempt to make money with the goal of accumulating enough wealth to live comfortably and achieve a set of objectives (retirement, travel, an increased standard of living). That nest egg is probably stored at a bank on a computer system as a set of 1’s and 0’s on a disk. Current analysis shows that in data centers on average somewhere between 2-14% of hard disks fail every year. In other words every single month x% of my monthly paycheck is removed for retirement savings and stored on a disk that has around a 1 in 20 chance of failing sometime this year, and if that money isn’t available in 30 years I’m hosed.

A dire situation indeed, but I’m obviously omitting a few important little details. Fortunately for me the bank has a government/shareholder vested interest in seeing me cash my retirement dollars out someday, so they’ve hired a small army of programmers to design systems that guarantee that my financial data is safe. The question is, how safe? Is my blind faith that my digitally stored assets will never be lost justified?


Let’s start by considering a few basic scenarios around data storage and persistence on a single machine. Suppose that I’m typing up a document in a word processing application. What assumptions can I make about whether my data is safe from being lost? Most modern hardware splits storage between fast volatile storage where contents are lost without power (memory), and slower non volatile storage where contents persist without power (disk). It’s possible that in the future advances in non-volatile memory will break down these barriers and completely revolutionize the way that we approach programming computers, but that’s a lengthy discussion for another time. For now it’s probably safe to assume that my word processor is almost certainly storing the contents of my document in memory to keep the application’s user interface speedy, so something as trivial as a quick power blip can cause me to lose my data.

One way to solve this problem is by adding hardware, so let’s say that I head to the store to buy a nice and beefy UPS. I’ve covered myself from the short power outage scenario, but what about when I spill my morning coffee on my computer case and short out the power supply? My critical document still only exists in memory on a single physical machine, and if that machine dies for any reason I’m in a world of hurt.

Suppose I decide to solve this by pushing CTRL+S to save my document to disk every 5 minutes. Can I even assume that my data is being stored on disk when I tell my application to save it? Technically no, it depends on the behavior of both my word processor application and the operating system. When I push save the word processor is likely making a system call to get a file descriptor (if it doesn’t already have one) and making another system call to write some data using that file descriptor. At this point the operating system still probably hasn’t written the data to disk; instead it’s probably written it to a disk buffer in memory that won’t get written to disk until the buffer fills up or someone tells the operating system to flush the buffer.

Let’s assume that I’ve actually examined the code of my word processor and I see that when I press save it is both writing data and flushing the disk buffer. Can I guarantee that my data is on disk when I press save? Probably, but it’s still possible that I will lose power before the operating system has the chance to write all of my data from the buffer to disk. People who implement file systems have to carefully consider these kind of edge cases and define a single atomic event that constitutes crossing the Rubicon, the point of no return. In many current file systems that event is probably the writing of a particular disk segment in a journal with enough data to repeat the operation: if the write to the journal completes then the entire write is considered complete, if it isn’t written then any portion of the write that has been completed should be invalidated.

What if I can somehow guarantee that the the disk write transaction has completed and my document has been written to the disk. Now how safe is my data? I’ve already touched briefly on hard disk failure rates. My disk could die for a variety of electronic or mechanical reasons, or because of non-physical corruption to either firmware or something like the file allocation table.

Again I turn to hardware and I decide set my computer up to use RAID 1 so that my data is saved to multiple redundant disks in the same physical machine. I’ve drastically reduced the chance of losing my data due to the most common disk failure issues, but my data remains at risk of being lost in a local fire or any other event which could cause physical damage to my machine. I may be able to recover the contents of one of the disks despite the machine taking a licking, but there aren’t any guarantees and even if I can recover the data it’s likely to take a significant effort and a lot of time.

I’ve pretty much run out of local options, so I run to the promise of the cloud. I script a backup of my file system to some arbitrary cloud data storage every N minutes. I decide that I’m alright if I lose a few updates between backups, and the data store tells me that it will mirror my data in on disks in separate machines in at least N geographically distinct locales across the globe. So what are the odds that I lose it? Obviously a world class catastrophe like a meteor striking earth could still obliterate my data, but in that scenario I probably wouldn’t be too stressed about losing my document. So what credible threats remain?

One of the biggest dangers for data stored in the cloud is the software that powers the cloud. A while ago I worked on a project (that I won’t name) that involved a very large scale distributed data store with geographic redundancy. We had fairly sophisticated environment management software that handled deploying our application plus data, monitoring the health of the system, and in some cases taking corrective action when anomalies were detected (for things like hardware failure, for example to reimage a machine when it first came online after getting a new disk drive). At one point a bug in the management software caused it to simultaneously start to reimage machines in every data center around the world. The next few days ended up being a pretty wild ones as we worked to mitigate the damage, brought machines back up, and worked through various system edge cases that we had never previously considered. We lost a significant amount of data, but we were fortunate because the kind of data that our system cared about could be rebuilt from various primary data stores. If that weren’t the case we would have lost critical data with significant business impact.

Another risk to data in any cloud is people with the power to bring that cloud down: a disgruntled organization member or employee, an external hacker, or even a government. When arbitrary control of a system can be obtained via any attack vector or even by physical force, one of the potential outcomes is intentional deletion of data. I’ve focused the thread on data safety (by which I mean prevention of data loss) rather than data security (which I would take to mean both safety and the guarantee of keeping data private), but malicious access to data tends to favor the latter since stolen data is lucrative. It’s perfectly plausible that future attacks could focus on trying to delete or alter data and destroy the means of recovering from the data loss, regardless of the degree of replication. Think digital Tyler Durden. People who stored data on MegaUpload probably never envisioned that they would lose it.

My main point is that whether data is held in local memory, on disk, replicated on a few redundant local disks, or distributed across continents and data centers, there is always some degree of risk of losing the data. Based on my anecdotal experience most people don’t associate the correct level of risk with data loss regardless of where the data lives. I think those kind of considerations will become increasingly important as more and more data moves to both public and private clouds with varying infrastructures. There is no such thing as data that can’t be lost, only ways to make data less likely to be lost.