Friday, October 28, 2011

Correlated failure in distributed systems

Every time I start to get mad at Google for being closed and proprietary, they release something really interesting.  The paper, Availability in Globally Distributed Storage Systems  is loaded with interesting data on component failures and it presents a nice framework for analyzing failure data.  The big takeaway is that naive reasoning about "HA" deployments can lead to inflated expectations about overall system availability.

Suppose you have a clustered pair of servers and each is 99% available.  What availability can you expect from the clustered pair?  "Obviously," $99.99\%.$

Unfortunately, that expectation assumes that the servers fail independently - i.e., component failures are not correlated.  The Google research shows that this can be a bad assumption in practice.  Referring to storage systems designed with good contemporary architecture and replication schemes, the authors say, "failing to account for correlation of node failures typically results in overestimating availability by at least two orders of magnitude."  They go on to report that due to high failure correlation, things like increasing the number of replicas or decreasing individual component failure probabilities have much weaker effects when you take correlation into account.

Steve Laughran does a nice job summarizing the practical implications of the specific area covered by this research.  What is interesting to me is the model proposed for quantifying correlation in component failure and estimating its impact on overall system availability.   Its easy to come up with scenarios that can lead to correlated component failures, e.g. servers on a rack served by a single power source that fails; switch failures; bad OS patches applied across a cluster.  What is not as obvious is how to tell from component-level availability data what counts as a correlated failure and how to adjust expectations of overall system availability based on the extent of correlation.

The first question you have to answer to get to a precise definition of correlation in component failure is what does it mean for two components to fail "at the same time" or equivalently what does it mean for two observed component failures to be part of a single failure event?  The Google authors define a "failure burst" to be a maximal sequence of node failures, all of which start within a given window-size, $w$, of one another.   They use $w=120$ seconds for their analysis, as this matches their internal polling interval and it also corresponds to an inflection point on the curve formed when they plot window size against percentage of failures that get clubbed into bursts.

We can define correlation among node failures by looking at how the nodes affected by bursts are distributed.  The practically relevant thing to look at is how often nodes from system architecture domains fail together - for example, to what extent do node failures occur together in the same rack.  If failures are highly rack-concentrated, for example, having system redundancy only within-rack is a bad idea.

Given a failure burst consisting of a set $N = {f_0,..., f_n}$ of failing nodes and a partition $D = {d_0, ... , d_m}$ of $N$ into domains, we will define the $D$-affinity of $N$ to be the probability that a random assignment of failing nodes across domains will look less concentrated than what we are observing.  High $D$-affinity means correlation, low means dispersion or anti-correlation.  If domains are racks, high rack-affinity means failures are concentrated within-rack.

To make the above definition precise, we need a measure of domain concentration.  The Google paper proposes a definition equivalent to the following.  For each $i = 0, ..., m$ let $k_i$ be the number of nodes in $N$ included in $d_i$.  So for example if the $d_i$ are racks, then $k_0$ is the number of nodes in rack $0$ that fail, $k_1$ counts the failures in rack $1$, etc.   Then set $x = \sum_{i=0}^{m}{k_i \choose 2}$.  This makes $x$ the number of "failure pairs" that can be defined by choosing pairs of failing nodes from the same domain.  Clearly this is maximized when all of the failures are in the same domain (every pairing is possible) and minimized when all failing nodes are isolated in different domains.  Increasing domain concentration of failures increases $x$ and disaggregating failing nodes decreases it.

Now let $X$ be a random variable whose values are the values of $x$ above.  For each possible value $x$ define $Pr(X = x)$ to be the likelihood that $X$ will take this value when failing nodes are randomly distributed across domains.  Then for each value $x$, define $r_x = Pr(X < x) + \frac{1}{2}Pr(X = x)$.  Then $r_x$ measures the likelihood that a random assignment of failing nodes to domains will result in concentration at least as large as $x$.  The $\frac{1}{2}$ is to prevent the measure from being biased, as we will see below.  A value of $r$ close to $1$ means that failures are highly correlated with respect to domain, while values close to $0$ indicate dispersion.  With domains equal to racks and $r$ called rack-affinity, the Google paper reports:
We find that, in general, larger failure bursts have higher rack affinity. All our failure bursts of more than 20 nodes have rack affinity greater than 0.7, and those of more than 40 nodes have affinity at least 0.9. It is worth noting that some bursts with high rack affinity do not affect an entire rack and are not caused by common network or power issues. This could be the case for a bad batch of components or new storage node binary or kernel, whose installation is only slightly correlated with these domains.
The authors point out that it can be shown that the expected value of $r$ is $.5$.  To see this, let $x_0, x_1, ..., x_t$ be the values of $X$ as defined above and for each $i = 0, ..., t$ let $p_i = Pr(X = x_i)$.  Then the expected value of $r$ is $$E(r) = \sum_{i=0}^{t}\left\{p_i \left(\sum_{j=0}^{i-1}p_j + \frac{1}{2}p_i\right)\right\}.$$Since $\sum p_i = 1$, we must have $(\sum p_i)^2 = 1$.  Expanding this last sum and the sum for $E(r)$, it is easy to see that $E(r) = \frac{1}{2}(\sum p_i)^2$.  Note that this applies to any discrete probability distribution - i.e., $r$ as above could be defined for any discrete distribution and its expectation will always be $.5$.  Note also that while $r$ can take the value $0$, its maximum value is $1 - \frac{1}{2}p_t.$  For $X$ as defined above, $p_t$ is the probability that all failures are in the same domain, which is $1/B_N$ where $N$ is the total number of nodes and $B_N,$ the $Nth$ Bell number, is the number of ways that the $N$ nodes can be partitioned.

Computing the value of $r$ given counts $c_0, c_1, ..., c_m$ of failing nodes by domain is non-trivial.  According to the Google authors,
It is possible to approximate the metric using simulation of random bursts. We choose to compute the metric exactly using dynamic programming because the extra precision it provides allows us to distinguish metric values very close to 1.
I have not been able to figure out a straightforward way to do this computation.  Maybe the Googlers will release some code to do the computation on Google Code.  The only way that I can see to do it is to fully enumerate partitions over the node set, compute $x$ for each partition and build the distribution of $X$ using frequency counts.  Patches welcome :)

The Google paper stops short of developing a framework for using estimates of node failure correlation in end-to-end system availability modelling.  That would be an interesting thing to do.  Here are some simple observations that might be useful in this regard and that also illustrate some of the practical implications.

Correlation cuts both ways - i.e., it is possible to do better than independence if a system's deployment architecture splits over domains with high failure affinity.  Consider, for example, an application that requires at least one database node to be available for it to provide service.  Suppose that database node failures are perfectly rack-correlated (i.e., all database node failures are concentrated on single racks).  Then if the application splits database nodes over racks (i.e. has at least one node in each of two different racks) it can deliver continuous availability (assuming the database is the only thing that can fail).

End-to-end HA requires splitting over all domains with high failure correlation. Suppose that in the example above, database node failures also show high switch affinity.  Then to deliver HA at the application level, you need to ensure that in addition to having database nodes in two different racks, you also need nodes connected to at least two different switches.

As always, correlation does not imply causation.  The Google paper makes this point in a couple of places.  Suppose that in our simple example all database failures are in fact due to database upgrades and the operational practice is to apply these upgrades one rack at a time.  That will result in high rack affinity among failures, but the failures have nothing to do with the physical characteristics or failure modes of the racks or their supporting infrastructure. 

The observations above are basic and consistent with the conventional wisdom applied by operations engineers every day.  In an ideal world, HA systems would be designed to split over every possible failure domain (rack, switch, power supply, OS image, data center...).  This is never practical and rarely cost-effective.  What is interesting is how quantitative measurements of failure correlation can be used to help estimate the benefit of splitting over failure domains.  Just measuring correlation as defined above is a good start.

Monday, October 17, 2011

Open source community economics

Much has been written about how to make money from open source and the impact of open source on commercial software markets.  The economic agents in these analyses are “open source companies,” traditional technology companies leveraging open source and individuals trying to make a living writing software.  I have not found much that looks at open source communities as agents.  Here are some pretty obvious things that call out the contrast between the interest of an open source community and the more traditional economic agents that engage it.  Throughout, I am assuming a real open development, open source community - not a corporate fishbowl or commercial consortium.

Membership is the bottom line
Companies eventually go out of business if they do not make money - i.e., if their revenues do not exceed their costs over time.  Open source communities go out of business if they do not “make members” over time - i.e., if they do not succeed in attracting and retaining volunteers faster than people leave to do other things with their time.  Just as commercial companies need to relentlessly focus on making sure they remain profitable, OSS communities need to constantly strive to remain interesting, attractive and welcoming.  Infusing paid volunteers is one way to keep “the books balanced” but it is analogous to borrowing capital in the commercial world - the result better be a healthier, more sustainable community; otherwise the “cash infusion” is just postponing demise.

Maximizing downloads is less important than maximizing engagement
While commercial interests and committers’ individual career goals may be enhanced by focusing on achieving the highest possible levels of downloads and industry buzz, this by itself does nothing for the community.  The community indirectly benefits as a result of users who come to know about it via the “buzz” and later decide to engage.  But if maximizing on sheer numbers of “free beer consumers” in any way reduces user engagement or discourages new volunteers from getting involved, it does harm to the community.  A critical juncture is what happens when an OSS project becomes commercially important.  Inevitably, the need for “stability” starts popping up in community discussions and someone proposes that the project should move to “review then commit” (nothing gets committed until *after* it has been reviewed by the project committers).  Then comes the decision to have a “high bar” for commit.  This will “stabilize” the code in the short term, allowing vast hordes of free beer drinkers to download and use it “with confidence” and generate ever more positive industry buzz.  But it will kill the community over time.  I am not suggesting that this *has* to happen.  The Apache httpd and Tomcat projects, and many others, have managed to have it both ways - lots of downloads and “buzz” and healthy communities.  But they have had to work at it and stay focused on maintaining an environment where new volunteers are welcomed into the community, processes are transparent, there is genuine openness to new ideas and it is not hard for new contributors to learn about the project and find useful things to work on.

Problems are worth more than solutions
At Apache, we often repeat the mantra, “community before code.”  That means that if you ever have to decide between the interests of the community and the most expedient way to ship working software, the community wins.  We take time to talk about things and come to consensus and we make technical decisions based on consensus - even if that sometimes takes longer or results in less elegant code committed to the repository.  From the standpoint of the community as economic agent, its ability to attract and retain volunteers is paramount, so it makes sense that we *use* the code to feed the community, and not vice-versa.  An interesting consequence of this is that problems - whether they be bug reports or ideas for enhancements - are more valuable to the community than “donations” of code.  Large “code dumps” of finished code actually have negative value, because they distract the community with generally boring things like licensing, repackaging, and documentation and add an additional support burden.  Small contributions of code or ideas that raise interesting problems are sustaining food for the community.  Here again, the actual interest of the community as an economic agent do not correspond exactly to the interests of those consuming its “product.”  This is not surprising, because the real “customers” of an OSS community are the members of the community itself, who are typically a tiny subset of the user base of the software that is produced.

Diversity is the risk management strategy
Just as corporations have to worry about mismanagement, market forces or other externalities destroying their business models, OSS communities have to worry about internal or external problems forcing them to lose their volunteers.  Just as businesses diversify to spread risk, OSS communities “diversify” - but in the case of OSS communities, diversification means cultivating a diverse membership.  Having all key contributors work for the same company, for example, represents a material risk (assuming they are all paid to contribute).  Or having them all share a narrow view of *the one right way* to do whatever the code does.  Diversity in open source communities is a natural hedge against technological obsolescence and collective loss of interest.  Software technology is faddish and talented developers are fickle - easily attracted to the next new shiny object.  To survive in the market for eyeballs and interest, OSS communities have to be places where new ideas can happen.  This means they have to attract people who can have different ideas than what the community already has - i.e., they have to be constantly diversifying.

Monday, October 10, 2011

Rethinking authentication

Authentication may end up being the next big thing for mobile devices after the "killer app" that led to us all walking around with them - i.e., voice.  The whole concept of "user authentication" is due for an uplift and all of us effectively becoming net POPs via our mobile devices opens up some interesting possibilities in this area.

Traditionally, authentication is something that punctuates and intrudes on our experience as we do things that only we should be able to do - e.g. access financial accounts, use credit cards, check in to hotels, flights, exclusive events, etc.  A person or automated system gating our access to something has to get to a sufficiently high level of confidence that we are who we say we are in order to let us in.  Authentication puts gates in front of us and we present credentials to get the gates to open.

Having an "always on" POP attached to us allows us to think about the problem differently.  Instead of authenticating at experience-interrupting gates, we can think about continuously updating our estimate of the probability that the person attached to a mobile device is the person who should be attached to that device (call this the right binding probability).  As I walk around and do stuff, take calls, get visually identified, etc., my device can provide a stream of "naturally authenticating information" (eventually based on biometrics, but also including behavioral information as well as the outcome of authentication / identification events).   When I want to do something that only I can do, my authentication state can be pushed in front of me, opening gates and eventually even eliminating most of them altogether in favor of challenges based on thresholds of the right binding probability.

There are obviously privacy considerations to think about here and at the end of the day, it will come down to how much "observation" we are going to allow in order to make authentication more convenient for us and our identities more secure.  Just allowing the phone to identify us via voiceprint and to report this event to an authentication service that we opt in to could provide a convenient second factor for financial transactions - again, without interrupting experience.

Updating right binding probabilities based on authenticating events presents an interesting mathematical modelling problem.  Each event should have an immediate impact, but its effect should decay over time.  A relatively strong event like voiceprint identification should create a significant bump and a weaker event like crossing a geo fence into a common haunt at a regular time should contribute less.  But how, if at all, should the second event affect the decay of the first event's effect?  It seems we need to keep a rolling window of recent events, including their times and an updating algorithm that looks at both existence of event types over backward-looking time intervals as well as sequencing.