Saturday, December 11, 2010

There's no rollback?

Preventing bugs before production is really, really hard. Basically, your code moves from the warm, safe, controlled environment of your dev and testing environment to the cold, harsh reality of production. Since you know some of your deployments are going to fail, and you can't predict which ones are going to be a failure, you always need to be able to rollback.

Last night, a vendor completed a hardware migration that hosed a critical component of one of my applications. I woke up this am to a flurry of crisis emails. My response was simple, so... "rollback." To my surprise, rolling back was not an option. How in the world can any vendor, with customers, release a change where rolling back is not an option? The outage lasted over 3 hours - for those keeping score at home, one of those outages moves your app from a 99.99% reliability to slightly better than 99.95%. For a portion of our students, it meant a rescheduled class, which really sucks.

Rollbacks require a little bit of additional planning and sometimes additional work to support, but they are always, always, always worth the additional expense. When a bug occurs, you can't possibly know how long it will take to fix, just ask tumblr.
your software without rollbacks.

Here are some basic tips for ensuring rollback:

  1. Ask the question to yourself or your engineers, "How do we rollback?"
  2. Make sure your app and database changes are backwards compatible.
  3. Deploy and validate database (and any other infrastructure) changes prior to any user-facing app changes.
  4. Deploy and validate an app at a time.
  5. Script all of your deployments and rollbacks. We have our own in-house built deployment system, but chef and puppet seem to be popular tools for creating these scripts.
You can check out a reminder I wrote to myself about how to identify when to rollback here.

There is exactly one type of change (that I've encountered) that is actually hard to rollback from, public DNS changes. So try really hard to not screw those up or set your TTLs really low.

On behalf of all of your future customers, please make sure you have a way to rollback your software.