Cloud outages are never going to go away. The minute the IT downtime problem is solved, we'll all have to find something else to complain about, after all. But still, unexpected cloud downtime remains one of the larger pain points skeptics point to when they get their hackles up regarding cloud computing.

These same people seem to forget about the amount of downtime they experience within the four walls of their organizations—but there are IT people to blame at that point. The faceless cloud is something else. Still, cloud services fared well last year, according to research from RightScale, which recently released an infographic with all kinds of data points regarding cloud outages in 2012.

RightScale found that there were 27 notable publicly-reported cloud outages around the world. We reported on many of these outages on Talkin' Cloud. Six of them were actually caused by Hurricane Sandy. However, when we're talking about these outages, we're included hosting providers, as well. Based on the RightScale information, 26 percent of outages in 2012 were private data centers, another 26 percent were public clouds, 7 percent were SaaS offerings, and 41 percent were hosting providers.

The average time to recover? A whopping 7.5 hours. Depending on what went down and how your or your customers' businesses were affected, that's either a minor inconvenience or a serious loss of revenue. As more business-critical applications move to cloud and hosted models, there is a need to keep things up and running as best as possible. Frequently, public cloud providers take the largest share of blame, but clearly they're no more likely to cause outage issues than private data centers. And keep in mind these are publicly announced (or discovered) outages—information the average private business isn't about to divulge to the world.

The biggest causes included power loss or failed backup (33 percent), natural disaster (21 percent), traffic and DSN routing (21 percent), software bugs (12 percent), human error (6 percent), failed storage systems (3 percent) and network connectivity issues (3 percent).

"I think the takeway is there are all kinds of causes," said RightScale CEO Michael Crandell in an interview with Talkin' Cloud.

The causes don't appease those businesses suffering from downtime, though, but with some resiliencey planning and some SLAs, businesses can give themselves the best chance of staying up.

"Everybody has outages, and the real trick is you need to plan for them and design for them," Crandell said.

Charles King, principal analyst at Pund-IT, said the topic of cloud outages recently came up as an important subject at this week's IBM Pulse conference.

"One problem I see (besides the continuing hype around cloud) is what might be called a misalignment between what cloud providers can deliver and what customers require," King told Talkin' Cloud. "In the case of service uptime, working with well-known and trusted vendors who are willing/able to abide by strong, binding SLAs is about the best thing customers can do. Even if things go south (and they usually do at some time or another) at least both sides know what to expect in terms of disaster response."

King also noted that the most publicized cloud outages are those from the biggest providers. When Amazon Web Services suffers an outage, it can take the likes of Netflix or Heroku. When a smaller provider goes down, it's less noticeable—and certainly not top news.

"It's a bit like what happens in a utilities blackout—if just one building is affected, its inhabitants are likely the only ones who notice. But if the lights go out in a major city or geographic region, headlines are soon to follow," King said.

For a customers, though, King said it's wise to recognize the complexities and limitations of any given service. As usual, do some due diligence.

"While cloud can provide significant benefits, including relief from the complexities and responsibilities of IT infrastructure management, its flip side is a loss of control. Before organizations can expect to enjoy the former, they need to consider and understand the implications of the latter," King said.

But what about gaining some resiliency? RightScale provided a five-point checklist:

  • Create a cost-vs.-downtime equation.
  • Validate application architecture for failover.
  • Examine public cloud vendors, internal capabilities and costs to reduce cascading depencies.
  • Automate everything and test for workload portability.
  • Determine an appropriate approach to managing infrastructure.

Unlike the Dungeons & Dragons magic system, cloud computing is not "fire and forget." Solution providers should keep that in mind when working with their customers on their cloud strategies.