Jun 16, 2010

"They said it couldn't happen"

One of our data centers has suffered repeated incidents over the last months. Much more than is expected and much more than our other data centers. To sort things out, we spent the better part of the day running a full audit: infrastructure, procedures, etc.

Here are a few random thoughts about what we learned. Nothing extraordinary, but does it ever hurt to go back to basics ? I think not.
  • A data center is a complex beast. Make sure you understand at least the basic concepts of power management, HVAC management, etc... or you will be fooled. There is plenty of good resources on the web: read and learn!
  • Trust nothing, especially redundancy claims (i.e. "n+1", "n+n"). Check EVERYTHING in person.
  • Beware of fake redundancy. Can 2 power lines coming into the building at the exact same spot be called redundant ? Nope. The same goes for fiber/network connections. Once again, check everything.
  • Ask for technical diagrams. Look for them in all rooms. Check that they tell the truth.
  • Check if all equipment is really online. An extra transformer / generator won't be very useful if it's shut down, will it?
  • Ask for maintenance logs. Look for maintenance tags / stickers on all equipment: they usually have handwritten information about the last / next maintenance date.
  • If a water pipe reads "cold water" and feels warm to the touch, what does it tell you ?
  • Are technical rooms clean ? dusty ? greasy ?
  • Does your hosting contract mention support / quality procedures ? Did you ever receive them? And if so, did you read them ? No? BIG mistake.
  • Talk to the security guard. Ask some basic questions about access control. Ask for the access log to your suite in the last week. 
  • As a matter of fact, talk to every staff member you come across. You will be surprised.
  • Check the Meet Me Room / Operator Room. If it's mess of fiber cables, what does it tell you?
  • Try to wander in the data center. Try to open all doors that you shouldn't be able to open.
  • Try to get in without any ID. Try with drinks and food. Does anyone stop you?
The list goes on. This may sound harsh or paranoid... and maybe it is.

However, can you NOT consider the worst scenarios and see how well the data center will survive them ? 

What will your tell your CEO and your customers when power fails at the busiest time of the year? Or when planned maintenance goes wrong ? Or when public construction cuts your "redundant" fibers ?

"They said it couldn't happen" ? That just doesn't work.

Remember: the more you sweat in training, the less you bleed in combat. Make sure you and your team sweat a lot.

PS: kudos to the audit team today (you know who you are). You made me proud :)