Building in a Burning House: Tools for Damn’ Downtime
By Chrissie Brodigan
16 December 2010 | Category: Web Apps
When you provide a web service, uptime is a big f&$king deal.
If you build a web app and grow a core and vibrant user base, at some point you might find yourself building inside a burning house.*
$hi% happens, and it can happen during the most innocuous upgrades and maintenance, and working with your designers to design for downtime (ahead of time) can be tremendously helpful and set you apart from your competitors. You can have uptime somewhere on the web, you just need to design for its delivery.

Photo By 111 Emergency
Downtime Happens to the Best & Brightest of Us
By now, I’m sure you’ve experienced downtime somewhere on the interwebs, this seems to be the season for some of my most dependable and favorite web apps (not just Twitter) to be dealing with suffering some pretty disruptive downtime.
Just this past week, beloved Tumblr suffered more than 24 hours of an outage, and while there were a lot of sympathetic tweets “hang in there guys” there were also a lot of gripes. There’s a better way to handle downtime by designing ahead of time.

David Karp, ceo of Tumblr, addressed one of his company’s biggest problems—rapid growth:
Frankly, keeping up with growth has presented more work than our small team was prepared for — with traffic now climbing more than 500M pageviews each month. But we are determined and focused on bringing our infrastructure well ahead of capacity as quickly as possible.
We’ve nearly quadrupled our engineering team this month alone, and continue to distribute and enhance our architecture to be more resilient to failures like today’s.
Startups would be foolish to over-staff or build out in bulk too early, but smart developers (especially those on your sys ops team) can predict what it will take to ensure a stable future 3-6 months in advance and their designer counterparts can help create canned experiences for crisis, but also great experiences for messaging around growth and upgrades. Planning once you have capacity issues is both harder and expensive. It’s also a time when you expose your vulnerabilities to competitors, disgruntled hackers, and more, stress out your team, and disappoint your users.
Twitter’s troubles represent an edge case. Most startups aren’t going to break Cassandra, but if they’re lucky a lot of startups are going to outgrow their founding database and search structures, hardware, and hosting solutions. Some apps, like Forrst and Carbonmade rebuild (to be better, faster, stronger, and smarter) and migrate entirely.
5 Ideas for Proactive Designers to Soften the Pain of Downtime

Screen capture from help desk software provider Zendesk’s recent migration
#1. Create Accurate & Time Stamped Messaging
Users get truly frustrated when they don’t know why a site is down, what is wrong, or when things will be back up and running again. From the outside there’s no real way to know if something is a quick update or a full-on meltdown.
Be clear, transparent, and on top of managing user inquiries. Leaving up a single message for the full outage can cause confusion and frustration. Alternate your messaging, as you understand the problem, and design with a time stamp.
#2. Leverage External Communication Assets
Work with your users to prepare them for your downtime. By letting your customers know that there’s scheduled maintenance, you minimize their frustration, but you also give them a chance to create and deploy customer service solutions on their end (e.g. they can create custom maintenance messages, make announcements, manage staffing, etc.).
Also, leverage “must have” external & internal assets:

Screen capture from Zendesk’s status page
- Create & maintain a Twitter “Ops” account (you don’t necessarily need or necessarily want to market your downtime in your company’s main Twitter stream, but you can certainly use your company’s main Twitter stream to refer people to the Ops account for ongoing information)
- Create a trust or status page that tracks your uptime (this is a best practice for SaaS and you can do this with Stashboard as well or like Form Spring does host your uptime elsewhere). If you’ve created an ecosystem around an API, this will be useful for your counterpart developers and their users.
- Guide your users to your support presence(s) where they can indulge in self-help or at least get a sense of the steps you’re taking to fix the problem (this might be a blog, a hosted Zendesk or help desk, a Get Satisfaction forum, or a Twitter account)
- Make sure your customers can email, phone, or comment to you, even if it’s in the comments section of a personal blog or a sympathetic pre-recorded voice message on a line, conversation provides clarity, connection, and reassurance
- Use video well and wisely. Imagine how awesome it would have been if David Karp had created a quick video talking to his users that he put up on YouTube, addressing the issues and providing transparency and being personable at a really sucktastic moment.**
#2. Engineers, Designers, & Customer Support Coworking
As a part of your “designing for downtime” (scheduled and unscheduled) strategy, you should have a plan in place that both customer support and development/design teams have collaborated on. Customer support team members are usually able to pull edge cases out of your user base that no one would have even thought about, they will also be super helpful and proactive during any unusual outages. Basically, reach out to your customers v. waiting around for them to reach out distressed to you.
Development and design teams are strongest when they work with their customer support counterparts (e.g. when your site depends on an external app like Twitter’s API, you can plan for messaging specific to what isn’t working and why, as well as what is working, and be careful to not place blame, but create transparency, reduce tension, and recover well).
#4. Schedule Some Sleep & Slack
During a crisis, do your best to give your team members rest, good food, and moderate caffeine, your best people are working hard, but hard under stress, and more mistakes can be made during an extended adrenaline-driven push.
Photo by NetDiva
When you have scheduled downtime, no matter how tight your plan is, something will pop up when you least expect. If possible, aim for a “no new work” week prior to major events like migrations, and make sure your local pubs enact a #noserve till’ uptime policy.
#5. Users Notice Your Downtime, Not Your Uptime
Uptime is critical to your success. Downtime is like a ticking clock and the longer it runs on the more negative content your users can generate, business gets lost, and relationships and trust break down. You’re not the only pretty girl out there on the market.
Photo By Bart Hiddink, zoutedrop
Turns Out, Uptime is Customer Service & Uptime Can Be Messaging During Downtime
Customers don’t always get to see or hear from the development team (depending on their unique levels of user-friendliness this is generally a good thing!), but uptime is their responsibility and it is customer service regardless of whether downtime is planned or unplanned.
Having a plan to be “up” somewhere during scheduled and unscheduled downtime is your responsibility when it comes to loving your users.
*The burning house metaphor is credited to Zendesk’s Tim Sturge. Until meeting and working with Tim and the rest of the Zendesk team, I never understood how important sys ops and design can actually work together on creative solutions. I owe him a tremendous “thanks” for all he has taught me. Tim and Zendesk’s other talented engineers blog here.
**David Karp and Tumblr’s team are among the most talented and devoted to the Tumblr community. If you ever get the chance to meet him, he’s so much more than content!
Follow @thinkvitamin on Twitter Please check out Treehouse

