Disaster Recover - Perceived Risk Vs Cost

Posted by Optimation Editor on 16 March 2012

Disaster Recovery is about ensuring your business can continue running after a disaster strikes. It is important to understand the different types of disaster your systems may suffer and decide whether they are a significant risk to your business or not.

There are many different types of disaster possible. Some examples include:

1) Data loss requiring recovery from a backup
Data loss could be data corruption or a disk failure which results in the need to recover your data from a backup. Assuming the backups are readily available and up-to-date, a restore should get your systems back up and running within a matter of hours.

2) Hardware loss requiring hardware replacement
Depending on your infrastructure, the recovery time from a hardware loss could be quite similar. Methods to recover from this range from moving the services to other hardware all the way to getting replacement hardware from the vendor installed and the services installed on it (probably from your backups).

3) Data centre outage e.g. an outage of power, cooling or network connectivity
Data centres spend a lot of of time and money engineering redundancy into their cooling, power supply and network infrastructure to reduce the likely-hood of outages. Unfortunately, the best laid plans do sometimes fail. Whether that is caused by diesel fuel contamination for the backup generators or a contractor somehow managing to cut through the redundant fibre links for internet access. Accidents do happen! Outages can run from minutes to days in this scenario.

4) Physical loss such as fire, theft or sabotage
Physical loss to fire and theft are also normally mitigated by the hosting provider and the facilities at their data centre through the use of physical security and fire suppression systems. These types of disaster are comparatively rare.

5) Physical loss such as an earthquake or tsunami
Physical loss due to a natural disaster like an earthquake is considered the worst type of disaster. Generally the down time after an event of this type is measured in weeks rather than hours or days and like fire and theft, can mean there is little chance of data recovery from the original hardware.

In all of these scenarios, your first line of defense is your backups. The importance of a sound backup and recovery strategy cannot be stressed enough! Without a good backup of your systems (including not just your data but your configuration and system installs), the ability to recover from any other type of disaster is GREATLY reduced. A discussion on backup techniques is outside the scope of this blog entry but the main point to bear in mind is that if you're worried about either physical loss type disaster, offsite backups are key.

To reduce the risk of hardware failures affecting your systems, consider deployment on enterprise class hardware. This hardware tends to have built in hardware redundancy e.g. redundant power supplies, disk controllers etc.

Another option is duplicate/parallel system deployments on additional hardware or even running your systems in a redundant virtualised environment (e.g. Amazon AWS, Iconz Versa etc). These strategies rely on having your system running on more physical hardware (or being able to be moved to other hardware quickly) thereby removing the risk of a single hardware point (e.g. a power supply) bringing everything down. While this does introduce more cost (hardware, licensing, development and management effort), it can also introduce more flexibility and increase your system capacity.

To reduce the risk of a physical loss type disaster, your systems must be deployed at more than one data centre. From a hardware perspective, this involves having your system deployed on duplicate hardware. The additional costs come in keeping the installations between the two data centres synchronised. There can also be additional overhead in maintaining a relationship with more than one hosting provider etc.

Finally, in order to reduce your business risk when a natural disaster strikes, your data centres need to be geographically separated enough to not both be affected by the same event. If your data centres are spread across significant distances e.g. more than one country, this makes keeping them in sync even more tricky and potentially expensive.

There are some critical questions to be considered before improving your disaster recovery plans. They are:

1) What is your businesses appetite for risk?
2) What is the longest your business can survive with its IT infrastructure unavailable?
3) What is the perceived likelihood of each type of disaster?

and finally

4) In the event of a natural disaster (earthquake, tsunami etc), is your business able, and does it need, to keep operating?
In conclusion, all these risks can be engineered for, to reduce your business risk in the event of a disaster. The "million dollar" question is, does the perceived risk justify the cost?