Saturday, November 19, 2011

Thinking Together

"Nestor, gladly will I visit the host of the Trojans over against us, but if another will go with me I shall do so in greater confidence and comfort. When two men are together, one of them may see some opportunity which the other has not caught sight of; if a man is alone he is less full of resource, and his wit is weaker. "  Homer, Illiad 10  (Diomedes about to go out spying on the Trojans)
I have always loved the simplicity in the image above, which Plato paraphrased simply as "when two go together, one sees before the other."  Real collaboration works this way - like Diomedes and Odysseus, we think together, and do more, better, faster than we could alone.  For this to work, we have to be thinking about the same thing.  Rather than just periodically sharing fully working ideas, we need to think out loud, sharing the not-yet-sense that eventually workable ideas come from.

I remember when I was a mathematics grad student I thought I would never be capable of producing the brilliantly incomprehensible work that I saw published in professional journals.  It took me forever to read papers and there appeared to be so many little miracles embedded in them that there had to be some magical wizardry involved.  All of that changed when I started thinking together with some real mathematicians.  Seeing the thought process, the not-quite-sense at the base of the ideas, the stops and starts, the workarounds, let me see that while I would never be Diomedes, I could at least have a little fun as Odysseus.

The key here is to actually externalize and share the thought process underneath ideas in the making, rather than just handing them back and forth as "products."  I am belaboring this point because it bears on a dynamic in the open source world that I find alternatively exciting and troubling - the rise of distributed version control systems (DVCS).  Thanks to Git and GitHub, the process of contributing to open source has become much easier and the number of projects and contributors is exploding.  That is the exciting part.  The thing that I sometimes worry about is that the "forking is a feature" meme that can result from DVCS can take us away from thinking together and more toward "black box" exchange of completed ideas.  The traditional, central VCS, central mailing list model works more or less like this:
  1. Someone has an "itch" (idea for enhancement, bug fix, idea for a change of some kind)
  2. Post to the project development mailing list, talking about the idea
  3. Patch, maybe attached to an issue in the project issue tracker, maybe directly committed
  4. Commit diff from the central repo triggers an email to the list - everyone sees each small change
  5. Idea evolves via discussion and commits, all visible to all following the project
Using DVCS, the process can work differently:
  1. Someone has an itch
  2. Clone the repo, creating a local branch
  3. Experiment with new ideas in the local branch, possibly publishing it so others can see and play with it
  4. Submit a request for the changes in the local branch to be incorporated in the "mainline"
  5. Changes are eventually integrated in a batch
Now there is no reason that the second flow can't involve the full community and be done in a thinking together way in small, reversible steps; but it is also easy for it to just happen in relative isolation and to end with a big batch of "black box" changes integrated in step 5.  If this becomes the norm, then the advantages of OSS communities thinking about the same thing as they evolve the code is significantly diminished.  It's also a lot less fun and interesting to review large, unmotivated pulls than to participate in the development of ideas and work collaboratively on the code expressing them.  But most importantly, if the primary means of communication becomes patch sets developed by individuals or small subcommunities, and the only thinking together that the community does is after the fact review, we lose one of the biggest benefits of open source - the collaborative development of ideas by a diverse community influencing their origin and direction.

The DVCS train has left the station and as I said above tools by themselves don't determine community dynamics.  Its up to those of us who participate in OSS communities to keep engaging in the real collaboration that created the great software and communities-around-code that have made OSS what it is.  So lets take a lesson from Diomedes.  Great warrior that he was, he still thought it best to bring "the wily one" with him when he went out into new territory.

Friday, October 28, 2011

Correlated failure in distributed systems

Every time I start to get mad at Google for being closed and proprietary, they release something really interesting.  The paper, Availability in Globally Distributed Storage Systems  is loaded with interesting data on component failures and it presents a nice framework for analyzing failure data.  The big takeaway is that naive reasoning about "HA" deployments can lead to inflated expectations about overall system availability.

Suppose you have a clustered pair of servers and each is 99% available.  What availability can you expect from the clustered pair?  "Obviously," $99.99\%.$

Unfortunately, that expectation assumes that the servers fail independently - i.e., component failures are not correlated.  The Google research shows that this can be a bad assumption in practice.  Referring to storage systems designed with good contemporary architecture and replication schemes, the authors say, "failing to account for correlation of node failures typically results in overestimating availability by at least two orders of magnitude."  They go on to report that due to high failure correlation, things like increasing the number of replicas or decreasing individual component failure probabilities have much weaker effects when you take correlation into account.

Steve Laughran does a nice job summarizing the practical implications of the specific area covered by this research.  What is interesting to me is the model proposed for quantifying correlation in component failure and estimating its impact on overall system availability.   Its easy to come up with scenarios that can lead to correlated component failures, e.g. servers on a rack served by a single power source that fails; switch failures; bad OS patches applied across a cluster.  What is not as obvious is how to tell from component-level availability data what counts as a correlated failure and how to adjust expectations of overall system availability based on the extent of correlation.

The first question you have to answer to get to a precise definition of correlation in component failure is what does it mean for two components to fail "at the same time" or equivalently what does it mean for two observed component failures to be part of a single failure event?  The Google authors define a "failure burst" to be a maximal sequence of node failures, all of which start within a given window-size, $w$, of one another.   They use $w=120$ seconds for their analysis, as this matches their internal polling interval and it also corresponds to an inflection point on the curve formed when they plot window size against percentage of failures that get clubbed into bursts.

We can define correlation among node failures by looking at how the nodes affected by bursts are distributed.  The practically relevant thing to look at is how often nodes from system architecture domains fail together - for example, to what extent do node failures occur together in the same rack.  If failures are highly rack-concentrated, for example, having system redundancy only within-rack is a bad idea.

Given a failure burst consisting of a set $N = {f_0,..., f_n}$ of failing nodes and a partition $D = {d_0, ... , d_m}$ of $N$ into domains, we will define the $D$-affinity of $N$ to be the probability that a random assignment of failing nodes across domains will look less concentrated than what we are observing.  High $D$-affinity means correlation, low means dispersion or anti-correlation.  If domains are racks, high rack-affinity means failures are concentrated within-rack.

To make the above definition precise, we need a measure of domain concentration.  The Google paper proposes a definition equivalent to the following.  For each $i = 0, ..., m$ let $k_i$ be the number of nodes in $N$ included in $d_i$.  So for example if the $d_i$ are racks, then $k_0$ is the number of nodes in rack $0$ that fail, $k_1$ counts the failures in rack $1$, etc.   Then set $x = \sum_{i=0}^{m}{k_i \choose 2}$.  This makes $x$ the number of "failure pairs" that can be defined by choosing pairs of failing nodes from the same domain.  Clearly this is maximized when all of the failures are in the same domain (every pairing is possible) and minimized when all failing nodes are isolated in different domains.  Increasing domain concentration of failures increases $x$ and disaggregating failing nodes decreases it.

Now let $X$ be a random variable whose values are the values of $x$ above.  For each possible value $x$ define $Pr(X = x)$ to be the likelihood that $X$ will take this value when failing nodes are randomly distributed across domains.  Then for each value $x$, define $r_x = Pr(X < x) + \frac{1}{2}Pr(X = x)$.  Then $r_x$ measures the likelihood that a random assignment of failing nodes to domains will result in concentration at least as large as $x$.  The $\frac{1}{2}$ is to prevent the measure from being biased, as we will see below.  A value of $r$ close to $1$ means that failures are highly correlated with respect to domain, while values close to $0$ indicate dispersion.  With domains equal to racks and $r$ called rack-affinity, the Google paper reports:
We find that, in general, larger failure bursts have higher rack affinity. All our failure bursts of more than 20 nodes have rack affinity greater than 0.7, and those of more than 40 nodes have affinity at least 0.9. It is worth noting that some bursts with high rack affinity do not affect an entire rack and are not caused by common network or power issues. This could be the case for a bad batch of components or new storage node binary or kernel, whose installation is only slightly correlated with these domains.
The authors point out that it can be shown that the expected value of $r$ is $.5$.  To see this, let $x_0, x_1, ..., x_t$ be the values of $X$ as defined above and for each $i = 0, ..., t$ let $p_i = Pr(X = x_i)$.  Then the expected value of $r$ is $$E(r) = \sum_{i=0}^{t}\left\{p_i \left(\sum_{j=0}^{i-1}p_j + \frac{1}{2}p_i\right)\right\}.$$Since $\sum p_i = 1$, we must have $(\sum p_i)^2 = 1$.  Expanding this last sum and the sum for $E(r)$, it is easy to see that $E(r) = \frac{1}{2}(\sum p_i)^2$.  Note that this applies to any discrete probability distribution - i.e., $r$ as above could be defined for any discrete distribution and its expectation will always be $.5$.  Note also that while $r$ can take the value $0$, its maximum value is $1 - \frac{1}{2}p_t.$  For $X$ as defined above, $p_t$ is the probability that all failures are in the same domain, which is $1/B_N$ where $N$ is the total number of nodes and $B_N,$ the $Nth$ Bell number, is the number of ways that the $N$ nodes can be partitioned.

Computing the value of $r$ given counts $c_0, c_1, ..., c_m$ of failing nodes by domain is non-trivial.  According to the Google authors,
It is possible to approximate the metric using simulation of random bursts. We choose to compute the metric exactly using dynamic programming because the extra precision it provides allows us to distinguish metric values very close to 1.
I have not been able to figure out a straightforward way to do this computation.  Maybe the Googlers will release some code to do the computation on Google Code.  The only way that I can see to do it is to fully enumerate partitions over the node set, compute $x$ for each partition and build the distribution of $X$ using frequency counts.  Patches welcome :)

The Google paper stops short of developing a framework for using estimates of node failure correlation in end-to-end system availability modelling.  That would be an interesting thing to do.  Here are some simple observations that might be useful in this regard and that also illustrate some of the practical implications.

Correlation cuts both ways - i.e., it is possible to do better than independence if a system's deployment architecture splits over domains with high failure affinity.  Consider, for example, an application that requires at least one database node to be available for it to provide service.  Suppose that database node failures are perfectly rack-correlated (i.e., all database node failures are concentrated on single racks).  Then if the application splits database nodes over racks (i.e. has at least one node in each of two different racks) it can deliver continuous availability (assuming the database is the only thing that can fail).

End-to-end HA requires splitting over all domains with high failure correlation. Suppose that in the example above, database node failures also show high switch affinity.  Then to deliver HA at the application level, you need to ensure that in addition to having database nodes in two different racks, you also need nodes connected to at least two different switches.

As always, correlation does not imply causation.  The Google paper makes this point in a couple of places.  Suppose that in our simple example all database failures are in fact due to database upgrades and the operational practice is to apply these upgrades one rack at a time.  That will result in high rack affinity among failures, but the failures have nothing to do with the physical characteristics or failure modes of the racks or their supporting infrastructure. 

The observations above are basic and consistent with the conventional wisdom applied by operations engineers every day.  In an ideal world, HA systems would be designed to split over every possible failure domain (rack, switch, power supply, OS image, data center...).  This is never practical and rarely cost-effective.  What is interesting is how quantitative measurements of failure correlation can be used to help estimate the benefit of splitting over failure domains.  Just measuring correlation as defined above is a good start.

Monday, October 17, 2011

Open source community economics

Much has been written about how to make money from open source and the impact of open source on commercial software markets.  The economic agents in these analyses are “open source companies,” traditional technology companies leveraging open source and individuals trying to make a living writing software.  I have not found much that looks at open source communities as agents.  Here are some pretty obvious things that call out the contrast between the interest of an open source community and the more traditional economic agents that engage it.  Throughout, I am assuming a real open development, open source community - not a corporate fishbowl or commercial consortium.

Membership is the bottom line
Companies eventually go out of business if they do not make money - i.e., if their revenues do not exceed their costs over time.  Open source communities go out of business if they do not “make members” over time - i.e., if they do not succeed in attracting and retaining volunteers faster than people leave to do other things with their time.  Just as commercial companies need to relentlessly focus on making sure they remain profitable, OSS communities need to constantly strive to remain interesting, attractive and welcoming.  Infusing paid volunteers is one way to keep “the books balanced” but it is analogous to borrowing capital in the commercial world - the result better be a healthier, more sustainable community; otherwise the “cash infusion” is just postponing demise.

Maximizing downloads is less important than maximizing engagement
While commercial interests and committers’ individual career goals may be enhanced by focusing on achieving the highest possible levels of downloads and industry buzz, this by itself does nothing for the community.  The community indirectly benefits as a result of users who come to know about it via the “buzz” and later decide to engage.  But if maximizing on sheer numbers of “free beer consumers” in any way reduces user engagement or discourages new volunteers from getting involved, it does harm to the community.  A critical juncture is what happens when an OSS project becomes commercially important.  Inevitably, the need for “stability” starts popping up in community discussions and someone proposes that the project should move to “review then commit” (nothing gets committed until *after* it has been reviewed by the project committers).  Then comes the decision to have a “high bar” for commit.  This will “stabilize” the code in the short term, allowing vast hordes of free beer drinkers to download and use it “with confidence” and generate ever more positive industry buzz.  But it will kill the community over time.  I am not suggesting that this *has* to happen.  The Apache httpd and Tomcat projects, and many others, have managed to have it both ways - lots of downloads and “buzz” and healthy communities.  But they have had to work at it and stay focused on maintaining an environment where new volunteers are welcomed into the community, processes are transparent, there is genuine openness to new ideas and it is not hard for new contributors to learn about the project and find useful things to work on.

Problems are worth more than solutions
At Apache, we often repeat the mantra, “community before code.”  That means that if you ever have to decide between the interests of the community and the most expedient way to ship working software, the community wins.  We take time to talk about things and come to consensus and we make technical decisions based on consensus - even if that sometimes takes longer or results in less elegant code committed to the repository.  From the standpoint of the community as economic agent, its ability to attract and retain volunteers is paramount, so it makes sense that we *use* the code to feed the community, and not vice-versa.  An interesting consequence of this is that problems - whether they be bug reports or ideas for enhancements - are more valuable to the community than “donations” of code.  Large “code dumps” of finished code actually have negative value, because they distract the community with generally boring things like licensing, repackaging, and documentation and add an additional support burden.  Small contributions of code or ideas that raise interesting problems are sustaining food for the community.  Here again, the actual interest of the community as an economic agent do not correspond exactly to the interests of those consuming its “product.”  This is not surprising, because the real “customers” of an OSS community are the members of the community itself, who are typically a tiny subset of the user base of the software that is produced.

Diversity is the risk management strategy
Just as corporations have to worry about mismanagement, market forces or other externalities destroying their business models, OSS communities have to worry about internal or external problems forcing them to lose their volunteers.  Just as businesses diversify to spread risk, OSS communities “diversify” - but in the case of OSS communities, diversification means cultivating a diverse membership.  Having all key contributors work for the same company, for example, represents a material risk (assuming they are all paid to contribute).  Or having them all share a narrow view of *the one right way* to do whatever the code does.  Diversity in open source communities is a natural hedge against technological obsolescence and collective loss of interest.  Software technology is faddish and talented developers are fickle - easily attracted to the next new shiny object.  To survive in the market for eyeballs and interest, OSS communities have to be places where new ideas can happen.  This means they have to attract people who can have different ideas than what the community already has - i.e., they have to be constantly diversifying.

Monday, October 10, 2011

Rethinking authentication

Authentication may end up being the next big thing for mobile devices after the "killer app" that led to us all walking around with them - i.e., voice.  The whole concept of "user authentication" is due for an uplift and all of us effectively becoming net POPs via our mobile devices opens up some interesting possibilities in this area.

Traditionally, authentication is something that punctuates and intrudes on our experience as we do things that only we should be able to do - e.g. access financial accounts, use credit cards, check in to hotels, flights, exclusive events, etc.  A person or automated system gating our access to something has to get to a sufficiently high level of confidence that we are who we say we are in order to let us in.  Authentication puts gates in front of us and we present credentials to get the gates to open.

Having an "always on" POP attached to us allows us to think about the problem differently.  Instead of authenticating at experience-interrupting gates, we can think about continuously updating our estimate of the probability that the person attached to a mobile device is the person who should be attached to that device (call this the right binding probability).  As I walk around and do stuff, take calls, get visually identified, etc., my device can provide a stream of "naturally authenticating information" (eventually based on biometrics, but also including behavioral information as well as the outcome of authentication / identification events).   When I want to do something that only I can do, my authentication state can be pushed in front of me, opening gates and eventually even eliminating most of them altogether in favor of challenges based on thresholds of the right binding probability.

There are obviously privacy considerations to think about here and at the end of the day, it will come down to how much "observation" we are going to allow in order to make authentication more convenient for us and our identities more secure.  Just allowing the phone to identify us via voiceprint and to report this event to an authentication service that we opt in to could provide a convenient second factor for financial transactions - again, without interrupting experience.

Updating right binding probabilities based on authenticating events presents an interesting mathematical modelling problem.  Each event should have an immediate impact, but its effect should decay over time.  A relatively strong event like voiceprint identification should create a significant bump and a weaker event like crossing a geo fence into a common haunt at a regular time should contribute less.  But how, if at all, should the second event affect the decay of the first event's effect?  It seems we need to keep a rolling window of recent events, including their times and an updating algorithm that looks at both existence of event types over backward-looking time intervals as well as sequencing.

Tuesday, September 27, 2011

Client Oriented Architecture

For no particular reason, I have been thinking a fair amount recently about the CAP theorem and how the basic problem that it presents is worked around in various ways by contemporary and even ancient systems.

I remember years ago as a freshly minted SOA zealot, I was confused by the pushback that I got from mainframe developers who insisted that client applications needed to have more control over how services were activated and how they worked.  I always thought that good, clean service API design and "separation of concerns," along with developer education and evangelism would make this resistance go away.  I was wrong.

I still think the basic idea of SOA (encapsulation and loose coupling) is correct; but once you shatter the illusion of the always-available, always-consistent central data store, you need to let the client do what it needs to do.  The whole system has to be a little more "client-oriented."

The Dynamo Paper provides a great example of what I am talking about here.  I am not sure it is still an accurate description of how Amazon's applications work; but the practical issues and approaches described in the paper are really instructive.  According to the paper, Dynamo is a key-value store designed to deliver very high availability but only "eventual consistency" (i.e., at any given time, there may be multiple, inconsistent versions of an object in circulation and the system provides mechanisms to resolve conflicts over time).  For applications that require it, Dynamo lets clients decide how to resolve version conflicts.  To do that, services maintain vector clocks of version information and surface what would in a "pure" SOA implementation be "service side" concerns to the client.  To add even more horror to SOA purists, the paper also reports that in some cases, applications that have very stringent performance demands can bypass the normal service location and binding infrastructure  - again, letting clients make their own decisions.  Finally, they even mention the ability of clients to tune the "sloppy quorum" parameters that determine effective durability of writes, availability of reads and incidence of version conflicts.

Despite the catchy title for this post, I don't mean to suggest that SOA was a bad idea or that we should all go back to point-to-point interfaces and tight coupling everywhere.  What I am suggesting is that just having clean service APIs at the semantic, or "model" level and counting on the infrastructure to make all decisions on behalf of the client doesn't cut it in the post-CAP world.   Clients need to be allowed to be intelligent and engaged in managing their own QoS.  The examples above illustrate some of the ways that can happen.  I am sure there are lots of others.  An interesting question is how much of this does it make sense to standardize and what ends up as part of service API definitions.   Dynamo's context is a concrete example.  Looks like it just rides along in the service payloads so is effectively standardized into the infrastructure.



Sunday, September 4, 2011

Scaling DevOps

In a great InfoQ talk, John Allspaw (Flickr, Etsy) presents a compelling argument for really aggressively breaking down the barriers between IT development and operations.  The talk presents lots of simple examples that anyone who has ever managed development and/or operations can relate to.  It also shows great tech leadership attitude, IMO.

One thing that Allspaw mentions off hand is that the organizational scalability of the model is still being proved out.  It is ironic that the secret behind manageable massive scalability in some of the largest web sites may be small team size.  When the whole tech team consists of 20 people, it is not hard to get them all in a room and develop and maintain a completely shared vision.

There are two questions that I have been thinking about related to this.  First, how do you scale it organizationally - i.e., how do you make it work in large organizations with multiple applications?  Secondly,  while I think the basic principles of DevOps are great, how do you manage the collision with ITIL and other traditional management processes and structures?  I find myself more than a little amused by this post that basically points out the lack of any kind of blueprint or methodology worked out beyond "break down the silos" and do the obvious in terms of config management, continuous integration and "infrastructure as software."

The scaling problem here is the same problem faced by "new management" principles such as the transition from bureaucracy to dynamic linking advocated by Steve Denning.   Part of what is needed is the management equivalent of inversion of control - some kind of "execution framework" enabling small, collaborating but decentralized teams to achieve their goals without either having a static master plan or detailed knowledge of everything everyone else is doing [1].  The other is a scalable approach to prioritization, planning and performance measurement.  Again without a static top-down process, we need a way to develop plans and measures that maintain direct and strong connection between each small team with the end customer and product. 

The second question is even harder.  Allspaw acknowledges for example that trying to throw out traditional change management altogether is a bad idea, even with the greatest, most well-coordinated team.  For some changes, you need to, as he puts it "get out the stack of ITIL books" and follow a more traditional process.  The problem here is both what to aim for and how to engineer the transformation, especially when the point of departure is a large group with traditional processes and detached, bureaucratic management.


[1] For another way to see the analogy here, have a look at Robert C Martin's,  Dependency Inversion Principle.  Look at the small, self-directed teams as the "lower level modules," senior management as "higher level modules" and the "abstractions" as the higher level goals and values of the enterprise.

Friday, August 19, 2011

Engineering emergent behavior

Complex adaptive systems seem to be all the rage now.  This month's Harvard Business Review has several articles on managing complexity.   It has become fashionable to use computational complexity to explain failure or lack of engineering control in diverse areas, ranging from derivatives pricing to strong AI.

All of this is fun and interesting, though not exactly new.  It was new in the 1950's when Lorenz demonstrated the impossibility of predicting the weather.  Or in the 1970's when the "chaos cabal" began developing the ideas of self-organization and emergent behavior in nonlinear dynamical systems.

So what does all of this mean to us today?  In particular, how exactly are policy-makers and leaders supposed to "embrace complexity?"  The HBR articles make some reasonable practical recommendations that I am not going to repeat.  What I am interested in thinking about here is the general question of what it means to try to engineer or control emergent behavior and how the systems that we inhabit need to change to support meaningful attempts at this.

Unexpected outcomes are to be expected.
Emergent behavior is by definition unpredictable.  Broad patterns can be anticipated and to some extent engineered; but new and different behaviors need to be understood and exploited, rather than attempting to "minimize variance" or ensure mirco-level predictability.  In a recent article on open source product evolution, Tarus Balog talks about one way to cultivate what might be called emergent value by means of what he calls the "practice effect" - release early, release often and allow the community to work with and extend the product.  What this comes down to is presenting the interacting agents that make up a complex adaptive system with opportunities for productive evolution rather than trying to push completely predetermined outcomes through the system.  Commercial companies trying to leverage open source are learning the art of how to do this.  Open source community leaders are learning lessons that are broadly applicable from this experience as well.

Extreme sensitivity to initial conditions means small changes can create big effects.
Lorenz famously named this the "butterfly effect."  This is the essence of why long-term behavior of chaotic dynamical systems is not practically predictable and it is what makes managing complex adaptive systems extremely difficult.  The best strategy for dealing with this again comes from open source - move things along in what Stefano Mazzocchi dubbed "small, reversible steps."  At the Apache Software Foundation, most projects follow a process that we call "commit, then review."  Committers make changes to the source code and then the community reviews the changes.  "Patches" (where Apache got its name) get applied visibly in small increments and the community provides feedback.  If the feedback is not good, the patches get rolled back.  Big bang commits of very large amounts of code or sweeping changes are rare.  Proceeding via small, reversible steps and observing effects while reversibility is still possible is a great strategy for dealing with complex adaptive systems when this is practical.

There is no kernel.
The traditional systems engineering model starts with a "kernel" and builds manageable complexity on top of it.  Linux and the TCP/IP protocol stack are great examples.  Starting with a stable and solid foundation, we build specialized and advanced capabilities on top.  Similarly, governments, policies and organizations can be looked at this way, with a stable, centralized core forming the basis for bureaucratically articulated extension points.  Complex adaptive systems have no kernels.  Their core dynamics are driven by networks of relationships that evolve dynamically over time.  That means that to effectively drive change in these systems, you need to focus not on top-down, centralized programs but decentralized connection-oriented interventions.  See this article for an interesting analysis of a practical example in education reform.  As a side note, check out this fascinating analysis of how TCP/IP has itself evolved (or resisted evolution).

Broad and concurrent search for solutions.
Engineering well-conditioned, deterministic systems always comes down to posing and solving optimization problems sequentially.  Public policy and strategic planning in the command and control model of corporate governance has traditionally mimicked this approach.  Form a task force or planning team to analyze a problem.  Develop a plan.  Mobilize the organization to execute.  If things go awry, revisit the plan and try something else.  In computer science terms, this could broadly be described as depth-first search.  Given the weak and only extremely local engineering control available in complex adaptive systems, this approach tends to fail miserably.  What is needed is massively concurrent, breadth-first search.  Instead of a centralized, top-down approach to engineering emergent behavior, a loosely federated approach distributed across interaction points gets to better solutions faster.  Google works this way both literally as a search technology and by some accounts as an organization as well.

Friday, July 29, 2011

Courage and humility

...my dear hope is that political leaders will have the courage and the humility as well to overcome political sensitivity and concerns and doctrines, which are perfectly legitimate, for the sake of the entire country and for the sake of the global economy.   
IMF Managing Director Christine Lagarde in a PBS News Hour interview,  on the U.S. debt ceiling crisis.
Courage and humility.  Without these qualities, leaders simply can't be counted on.  Ms. Lagarde later says,
Political courage is required. And I'm not - I'm not suggesting that it is lacking, but it has to be demonstrated in moments of crisis. I was last week in Brussels, and there was a moment of courage and solidarity amongst the European leaders, members of the eurozone area. It comes in times in crisis. And when it does, it's quite extraordinary.
Well, we are all dearly hoping to see this real soon now from our US political leaders. 

The combination of these two qualities is what makes leaders who can be counted on.  To be humble, to acknowledge mistakes, to genuinely listen, to take accountability - all of these things require courage and all are necessary to gain confidence, followership and support in difficult times.  A great leader once said, "the role of a leader is to define reality and to give hope."  In difficult times, the "defining reality" part often requires accepting responsibility, acknowledging mistakes and sometimes being clear about uncertainty and personal shortcomings.  It also means asking for help and being open and humble in accepting it.  What some leaders don't understand is that showing this kind of courage and humility in defining reality provides a solid foundation for hope and when you give people that kind of hope, amazing things can happen.  

Ms. Lagarde's gentle but firm admonishment to the US political leadership really applies to all of us.  We have all enjoyed the "exorbitant privilege" that she (quoting Giscard d'Estaing) ascribes to our currency.  That "privilege" was earned by generations of Americans who built homes, families, careers, businesses and institutions showing these key qualities of leadership.  We need to get back to that.  Now.  Not just in politics, not just in government, but all of us, every day.

Those of you who are parents, teachers or leaders, think about the example that you are setting.  Think about what behaviors you are rewarding and who you are choosing to promote.  Will your children/students/proteges respond with courage and humility or CYA, obfuscation, finger-pointing or "positioning" when real problems arise?  We can only reverse this slide into weak, self-centered, myopic leadership by calling it out when we see it and stopping the cycle of rewarding it.

Thank you, Ms. Lagarde for calling us out!

Saturday, July 23, 2011

Saving the phenomenon of OSS

I just reread Dan Pink's Drive, a book that I recommend highly for anyone who leads people or is interested in how human motivation works.  The basic premise of the book is that there is a growing gap between "what science knows and what business does" to motivate people.  He calls the traditional setup that still dominates the mainstream corporate world "Motivation 2.0" and makes a compelling case that this throwback to the industrial age is in need of a major version bump - Motivation 3.0.    Drawing on a substantial body of behavioral science research, Pink demonstrates that the "If-then" reward structure that dominates how we try to motivate people today is not well-suited to the post-industrial workplace where "work" is no longer primarily made up of boring, repetitive tasks.  Instead, he argues that intrinsic rewards are far more effective in today's world.  He calls out three primary drivers of intrinsic motivation: mastery, autonomy and purpose.

What I found very interesting, and at the same time a little troubling, after reading this book, is how well it saves the phenomenon of open source contribution.  Pink mentions OSS as an example of the beginnings of  Motivation 3.0.  At least looking at my own involvement in OSS, the drivers above do a pretty good job explaining how it is that I am willing to spend so much time volunteering.  Much better than what other "anthropologists" have come up with, which generally devolves to Motivation 2.0-speak (we do it because it will enable us to make more money somehow).  Working with great developers on hard problems can provide a sense of mastery that is hard to attain otherwise.  Autonomy is also key.  We "scratch itches" that we have as - and when - we have them.  And finally, there is a sense of contributing to something larger than ourselves or our code - a purpose.

The troubling bit is that we stand at the crossroads now in open source.  We got where we are as a result of really forward-thinking community engagement - being on the vanguard of Pink's Motivation 3.0.   People are now, in large numbers, being paid to work on OSS projects and commercial software companies are "open sourcing" their products faster than established OSS communities can absorb them.   Will we be able to maintain the powerful intrinsic reward structure that has gotten us to this point and use our elevated position to drive positive change in the Motivation 2.0-dominated establishment?  Or will the "OSS phenomenon" gradually vanish as we collectively "grow up?"

With all hats that I may wear, I will be trying to lead the change toward environments that provide the kinds of intrinsic rewards that I have enjoyed in OSS and tried my best to bring to the corporate world.

Tuesday, July 19, 2011

Self-organization is "supra-optimal"

Thanks to a post shared by Eric Le Goff on Google+, I ran across a very interesting article describing some fundamental research on neuronal network organization.  The authors have developed a research "platform" that enables them to grow physical neuronal networks on chips.  They are using the platform to observe self-organizing behavior of the networks.  What is becoming clear from this and other research is that the large-scale architecture and macro behavior of our brains is most likely the result of self-organizing collaboration among a hierarchy of network clusters, each "operating" with locally conditioned architecture and control.

After reading the article, I was embarrassed to notice that I had not yet heard of Plos One, where it appeared.  Poking about Plos One, I stumbled on yet another article on self-organizing behavior.  Somebody must be trying to tell me something.   The second article is a nice visceral illustration of how we need to think differently about how to manage complex systems.   The authors look at train scheduling in public transportation systems.  They show that "global" scheduling systems for train arrival and departure times can be beaten by a system that imitates ant colony behavior - trains emitting "anti-pheromones" that signal trains behind them to modify their behavior.   What is interesting here is that independent decision-making based on locally available information can result in better global outcomes than a globally engineered solution.

These ideas are not new.  What these two articles illustrate, however, is how much change is going to be required in how we think about modelling and engineering to take advantage of them.  For example, the whole concept of "control" in the pure engineering sense is going to have to be transformed to engineer the kinds of systems that the second article hints at.  Imagine trying to apply six sigma to that transportation system.   Similarly, the first article indicates that it is likely hopeless to try to get much more than we already have from macro-level decomposition of the brain or functional/reductionist analysis of isolated neuronal networks.

I also notice that Plos.org has opened up their search APIs and (jointly with Mendeley) launched "a contest to build the best apps that make science more open."  Hmm...somebody is really trying to tell me something here.

Saturday, July 16, 2011

Remembering how to count

Today, rummaging through the boxes of books that follow me from house to house, I stumbled on a first edition, 1940 hardbound copy of Mathematics and the Imagination by Edward Kasner and James Newman that my mother gave me years ago.  Kasner is probably best known for having come up with the name - or more precisely goading a child into inventing the name - "googol."  The book begins with a delightful account of how the very large, but finite is sometimes confused by adults and "moderns" with the infinite, but small children and the ancient Greeks could see things more clearly.  I like this quote:
The Greeks had very definite ideas about the infinite.  Just as we are indebted to them for much of our wit and learning, so are we indebted to them for much of our sophistication about the infinite.  Indeed, had we always retained their clear-sightedness, many of the problems and paradoxes about the infinite would never have arisen.
The authors do a really nice job motivating what was to me as a student just a definition - that cardinality is based on one-to-one correspondences.  What they bring out really nicely is how this really is just counting.

Another great point that comes out in many places in the book is how mathematicians often choose what turn out to be stupid names for things, because we can name them before we know them.  A great example is "transcendental numbers."  Such a name suggests something exotic or rare.  It turns out that almost every real number is transcendental.

What does that mean, exactly?  I am going to try to explain this in less time than it takes you to get bored - in just a short walk through a truly beautiful neighborhood built by Georg Cantor.

A transcendental number is a number that cannot be expressed as a solution to a polynomial equation with rational coefficients.  The square root of two, for example, is not transcendental because it is a solution to the equation x2 - 2 = 0.   It turns out that both e and π are transcendental. Initially mathematicians thought this was a very special property worthy of such an exalted name.

Cantor saw two things: 1) algebraic numbers (the complement of transcedental numbers) can be put into one-one correspondence with the set of natural numbers (i.e., they are countable) 2) the whole set of real numbers is uncountable (cannot be put into 1-1 correspondence with the natural numbers).  That means that all but a comparatively tiny set of real numbers are transcendental!  You can take away all of the algebraic numbers without affecting the size of the set of real numbers (or leaving any "holes" in the real line).

Imagine a spreadsheet with an infinite number of rows and columns:
A0 A1 A2 A3 ...
B0 B1 B2 B3 ...
C0 C1 C2 C4 ...
D0 D1 D2 D3 ...
...
Cantor realized that as long as the row and column indexes were all natural numbers,  the entire set of cells could be put into 1-1 correspondence with the natural numbers (i.e. "counted") by just starting in the top left corner and following a path that snakes outward along the diagonals:  A0, A1, B0, C0, B1, A2, A3, B2, C1, D0, D1, C2, B3, A4, ...

Since the rational numbers are just numerator/denominator pairs of integers, the technique above can be used to count them - i.e., to show that the rationals are countable.  But really, what it shows is that 1) the set of ordered pairs of any countable set is countable and 2) the union of countably many countable sets is countable.  The second is clear when you think of the rows of the spreadsheet as enumerations of the elements in the sets over which the union is taken.

Now imagine a "spreadcube" - a rubix-cube like thing that has little subcubes for cells.  Again, imagine it as countably infinite in all three dimensions.  The "pick a corner and snake out" game works there too.  So the set of ordered triples (a, b, c) from any countable set is also countable.  To get to the next dimension, you can either just test your 4-dimensional intuition (too hard for me) or just look at (a, b, c, d) as ((a, b, c), d) and use the fact that there are only countably many (a, b, c) (note that we could have done this for n = 3 instead of imagining the spreadcube).  Obviously, there is no stopping us here, so we see that for each n, the set of n-tuples of elements from any countable set is countable.   Since the coefficients of a polynomial determine the polynomial, that means that for each n, there are only countably many polynomials of degree n with rational coefficients.  By 2) above that means that there are only countably many polynomials with rational coefficients.  Each of these has only finitely many roots, so the union of all of the sets of roots of all of the polynomials with rational coefficients is countable - i.e. there are only countably many algebraic numbers.

Now the coup de gras.  Since we seem to be able to count everything, lets suppose that via snakes, pairing or some other ruse we have managed to enumerate the real numbers - i.e., we have put them in a list that goes r0, r1, r2 ... and eventually hits every real number.   Every real number has a decimal expansion.  Some terminate, like 5.2, some repeat, like 0.333333 and some, like π do exotic things that keep computers busy for indefinite amounts of time; but all have decimal representations and two real numbers are equal if and only if they have exactly the same values in their decimal expansions in every digit [1].   Suppose that our enumeration looked like this:

r0 = 12.238769045
r1 =  3.1415923988888
r3 = 23.1783459823999
...
Consider the number r = 0.359 in relation to the three numbers above.  It is obviously not equal to any of them.  It was constructed to be different from any of them as follows.  Put a 3 in the first decimal digit because r0 has a 2 there; put a 5 in the second decimal digit because r1 has a 4 there; put a 9 in the third decimal digit because r3 has an 8 there.   Now imagine the list continuing.  We continue to define the decimal expansion of our "special" number r making it different from the nth element in the list in the nth decimal place.  The result will be a number that differs from every number in the list.  This is a contradiction,  proving that no such enumeration can exist.

This is not a satisfying place to leave Mr. Cantor's neigborhood, since we have not succeeded in counting the real numbers at all.  So even though you are likely, like my dog Casper, getting a little tired of this walk, I need to stop by one more place on the way home.   To count the real numbers, lets start by convincing ourselves that it suffices to count the elements of the open interval (0, 1).  Look at the middle section of the graph below of the tangent function


The middle section of this graph shows that (-π/2,π/2) can be mapped onto the full real line. Every y value in the graph corresponds to some x value in the bounded open interval above.   The graph depicts a one-to-one correspondence between the whole real line (the y axis) and the little π-length interval in the middle of the picture along the x axis. Just squishing the picture a little and moving it to the right shows that (0, 1) could similarly be mapped onto the whole real line.

Now, similarly to decimal representation, every real number has a binary expansion.  The number 1/2 is 0.1 in binary, 0.11 is 1/2 + 1/4 = 3/4, 0.101 is 1/2 + 0/4 + 1/8 = 3/8 and so on.  Like decimal expansions, binary expansions are unique, some terminate and some don't and two numbers between 0 and 1 are equal iff their binary expansions have 1's (or 0's) in all the same places.   Therefore, there is a one-one correspondence between the real numbers between 0 and 1 and the set of subsets of the natural numbers.  Let each real number correspond to the set of natural numbers where its binary expansion has a "1".  So 1/2 corresponds to {0}, 3/4 corresponds to {0, 1}, 3/8 corresponds to {0, 2} and so on.  The set of all subsets of a set is called the power set of the set.  The above shows two important things: 1) the power set of the natural numbers in uncountable and 2) the power set of the natural numbers has the same cardinality as the real numbers.   An interesting question is does there exist a size in between the natural numbers and the real numbers.  We are not going to see an answer to that question in this neighborhood.

[1] Strictly speaking, some numbers have two infinite decimal expansions.  For example 0.9999999 (repeating indefinitely) equals 1.  We are assuming above that all of the representations selected are minimum length (i.e. if there is a finite representation, choose that).