Knight-time Ramblings: Failure in "the Cloud" for Some Australians

The Primus Data Centre in Melbourne lost power earlier today - around 2pm AEDT. No big deal you say, it's got UPSes and gensets. What's the problem?

The problem is that either the gensets or UPSes failed big time. It's now 6pm AEDT and the data centre is still having power issues. Thanks to Internode via their network status page and PIPE Networks via their CEO Bevan posting a status report at Whirlpool for letting us know what's going on.

The even bigger problem is the number of customers affected. Netspace - an ISP - were down nationally for around an hour. A number of hosting companies are still down, and a number of ISPs servicing Tasmania transit through this data centre, so their Tasmanian customers are isolated from the Internet.

It's understandable that the data centre lost power, given the recent heat wave in Victoria and the associated infrastructure problems (power, rail) that has caused. It's also understandable that UPSes - even arrays - fail, as do generators. What's not understandable is the number of hosting companies and ISPs that don't provide redundancy for their own infrastructure. Assuming that a data centre is always going to be up and running is a really bad assumption.

The simplest form of redundancy is having an offsite DNS server. That way you can at least respond to DNS queries and gives you options for swinging in temporary services at short notice to explain to customers what is going on. The same can be done for mail using offsite MX and some offsite Web presence, especially for support services.

So if you're expecting 24x7 from a hosting provider, you probably need to ask them how many data centres they're running on and how much redundancy across data centres there is. Even a data centre is a point of failure.

Knight-time Ramblings

Sunday, February 01, 2009

Failure in "the Cloud" for Some Australians

1 comment: