Kudos to Dan Cohn and Tom Killalea of the amazon.com for the title and their impressive keynote presentation at the Nanog meeting this month. It gave us a glimpse to their amazing computing infrastructure, its design principles and some (visualization) tools they use to manage it. However what stunned me was that how familiar their challenges were and how easily I could sympathize with their advice. My background is in managing considerably smaller systems but yet it seems problems I have encountered will scale to even to a planetary-scale distributed system. :) From now on my checklist will explicitly have: Massive scale services need to be simple, policy that is not enforced by tools is useless, automation of everything that does not need human operator, common metrics for all parts of the operation, hire good people. However that doesn't do justice to the actual content so couple of verbatim quotes follow:

Simplicity isn't achievable as a passive goal. It’s a force that must be actively applied -- Charles Moore, father of FORTH

Always anticipate the next order of magnitude of growth, even if it’s a struggle.

The network is the only authoritative resource that exists.

It can take multiple iterations to get to “simple”

Applications meet the Network - Complementary instrumentation on both sides can contribute to automated recovery

It is reassuring to hear that simple is beautiful even in the planetary scale. :)

-- fenris 2008-02-19 08:45:15

