The Nerdings

tails of a devops dude @ particle.io

Cloud Operations Is a Game of Pandemic

Three days ago, I had the privilege of participating on an epic 24 hour cloud-ops swarm session troubleshooting various issues introduced with the rollout of the Particle’s new Pricing scheme and website. In the delirious hours before I feel asleep in between PagerDuty alerts, I thought, dang, this sucks, but it’s kind of awesome too, like a game of Pandemic…

Cloud operations is a game of Pandemic.

A contagion leaks into a complex system, creating havoc in many different places.

Source unknown, direction of casuality unknown, root causes unknown, symptoms intermittent.

What do you do?

You’ve got N people each with their own unique specialized skillset

  • An Operations Expert connecting dots and bringing minds together so it functions as a hive
  • A Dispatcher orchestrating communications and facilitating low friction group behavior
  • A Scientist rapidly perceiving causes of multi-factor failures and providing complex cures quickly
  • A Medic handling localized crises quickly without fear, killing small problems before they spread, and contributing to the broader strategy
  • A Researcher that provides valueable knowledge, precise observations, and actionable insights when partered with the right person
  • A Contingency Planner queuing up tasks in case what’s happening now doesn’t work or the virus spreads.
  • A Quaranteen Specialist: I have no metaphor for this character at this time (this is the Internet so it’s cool like that.)

There are many ways to die and 1 way to win

  • Win by stopping the contagion from killing everything.
  • Loose by getting consumed by it.

Under conditions of crisis, you don’t know what N is going to be and what the threats are

  • Is it one engineer trying to play all of these roles at 3am in response to a page?

  • Is it 5 engineers swarming on a crazy failure situation all day and night enumerating mitigations as crisis after crisis compound and new variables come and go?

Conclusion

You never know which roles will be available in a crisis situation. The best team capable of preventing the broadest range of failures in a complex system is the one in which the greatest number of individuals can play the greatest number of roles using the best available technology.

The cultural practice of DevOps, in the right organizational environment, gives rise to high performing teams able to manage the routine failures that complex software systems imply quickly and efficiently under pressure while having fun and feeling good about it when it’s all said and done after a good night’s sleep.

  • Are you a spectacular, collaborative communicator that likes to play fun, intense, complex games? (read: you love what your do and like to work with smart people)

  • Do you have an insanely deep specialized skill and/or absurdly broad software engineering breath (read: can you play the Dispatcher and the Scientist?)

  • Wanna play Pandemic at Particle? (read: a cloud ops job)

We’re hiring for Cloud Ops/DevOps/Platform Reliability Engineers. If you answered yes to any of those questions, please get in touch with me. And if not, make sure to make time to play Pandemic with friends :).