A Worrying Situation
On the 14th of January, the offices of the Brisbane branch of the Milton Financial Group (MFG)** were flooded, destroying furniture, telecommunications equipment and causing structural damage to the office itself. Luckily, the MFG employees (40+) work primarily remotely with notebook computers and their operations manager had the foresight to remove the email and data server (a three year old IBM x3500 tower) from the office when flood warnings were reported, so damage to computer equipment was minimal. MFG were very lucky that the server was removed from the site, as if it had been destroyed, recovery time was likely to be in the timeframe of weeks, if not months, with Australian hardware stocks being very quickly drained as IT suppliers scrambled to supply replacement hardware to affected businesses.
The implications of this were clear. MFG could not rely on the operations manager being in the right place at the right time if a disaster were to occur again. Whilst backups are regularly completed and maintained on a nightly basis, if there were to be a fire, or a theft, the only data MFG would have access to would be on the offsite removable hard drive. The CEO of MFG approached CCSiT with these concerns requesting a revisit of a Disaster Recovery Plan (DRP) proposal submitted by us some 8 months earlier. Our engineers immediately began formulating an updated DRP for MFG.
“Knowing without a doubt that no matter what disaster befalls our office will only cause a minor inconvenience to our operational capacity is worth the investment one hundred fold. Being informed that a disaster has occurred at our office, and being informed in the same call that the operation is already back up and running with no loss of data is undoubtedly the most enjoyable disaster information I have ever received.”
After consultation with MFG management, it was decided that the following conditions for the DRP must be met:
- The period of downtime following the loss of data must be as minimal as possible. MFG provides services 24/7 so any downtime becomes costly very quickly. The maximum allowable downtime was determined to be 4hrs.
- Regular backups need to be maintained and verified at least once daily. Completion of these backups was deemed acceptable to be performed overnight.
- Historical backups need to be maintained monthly, for up to 6mths.
- All backup data is to be housed offsite at a suitably tiered DataCentre. It is expected that MFG staff will not be relied upon to maintain or verify backup data.
- As replacement server hardware may not be available in a timely fashion, virtualisation technologies will be utilised to provide temporary solutions in the event of a total disaster.
As virtually all backup software packages provide the facility to perform backups on flexible schedules, daily and monthly historical backups could be configured with ease, therefore satisfying the scheduling requirements of Recovery Conditions (RC) 2 and 3.
The Veeam software suite for backup and recovery was determined to be the most appropriate solution for the requirements of MFG and was configured for use on the Primary Server located in the MFG Brisbane office.
Appropriate storage would still need to be sourced, however, which was addressed concurrently with RC4, with the provision that offsite backups would be housed in a datacentre. A Qnap Network Attached Storage (NAS) device was installed into a datacentre and linked via secure connection directly to the MFG Brisbane office.
To address RC5, the Primary MFG Server was migrated to a virtualised platform using the same physical hardware, thus allowing the use of Veeam to provide recovery options within a very short timeframe (typically around 1-2hrs). As this timeframe falls within the required range of allowable downtime, Veeam also satisfies the requirements of RC1.
With all RCs satisfied, the overall network configuration would look like this:
The required storage capacity of the Qnap NAS was determined to be 7.8TB (or 7800GB), comprising of 6 monthly historical backups plus 7 daily backups, with each backup requiring 600GB of disk space. To allow for significant growth, the Qnap NAS was installed with a storage capacity of 12TB (or 12000GB).
In the event of file corruption, or accidental file deletion, the Veeam software retrieves the data directly from the backup instance date specified from the Qnap NAS installed in the DataCentre. It then restores the recovered data onto the Primary MFG Server for use by staff on the Internal MFG Network. In the event of a total system failure, a virtual instance of the server can be made available for direct access by staff.
This backup and recovery system, as per the negotiated DRP, went live on 22nd February 2011.
In The Nick Of Time
Only a few months after the new DRP was implemented, disaster struck once again. On Saturday, 14th May 2011, severe thunderstorms hit the city of Brisbane and its surrounding suburbs. The MFG Brisbane building was struck by lightning at 1:30pm, sending an electrical surge throughout the entire complex. Once again, staff notebooks were located offsite and were unaffected. The Primary MFG Server, however, suffered a critical hit and its componentry was irreparably damaged. The server was connected to an Uninterruptible Power Supply (UPS), but while these devices provide protection against 99% of power-related threats, they simply cannot protect connected devices from direct lightning strikes.
The CCSiT After-Hours Emergency Service was contacted and an engineer was immediately dispatched. The engineer arrived onsite at 2:15pm and found that in addition to the server being destroyed, the surge also destroyed the onsite backup devices (removable hard drives), the ADSL2+ modem and the network switch. Fortunately the physical phone lines were found to be unaffected.
The engineer remotely accessed the DataCentre via a notebook computer and initiated the first stage of the Veeam recovery protocols, as defined in the MFG DRP.
A virtual instance of the MFG Primary Server was booted and went live. The engineer replaced the onsite ADSL2+ modem to allow connection to the remote virtual server instance, and the network switch to allow connections internally for staff to use. The overall network configuration was altered to look like this:
It was confirmed emails and data were now available to all remote staff members, as well as to users on the Internal MFG Network. This phase of the DRP protocol was completed at 3:15pm, less than 2hrs after the initial incident was reported. Whilst the performance of the temporary virtual server instance in the DataCentre is reasonably slow (when compared to a physical onsite server) it did allow access to all emails and data within a very short timeframe, ensuring that the disruption to MFG day-to-day operations was minimal.
Now that MFG could operate as per normal (albeit at slower data transfer speeds) the final phase of the DRP was initiated. Replacement IBM hardware was procured by CCSiT and installed less than two weeks following. A recent backup of data from the DataCentre was retrieved and restored to the new Primary MFG Server. For all intents and purposes, the MFG network was now restored to 100% operational capability. Backups using the Veeam backup and recovery system re-initiated as per the original configuration.
Having invested the time and budget to prepare for a disaster – prior to actually experiencing a disaster – allowed MFG to enjoy the following benefits;
- The operational capacity was off-line for a total of only 2 hours after complete disaster struck,
- The disaster became a step by step process rather than a frenzied panic,
- ‘Suitable’ rather than ‘Available’ replacement hardware was purchased,
- There was no reliance on having to contact one key person to make decisions, and
- Not a single item of data was lost.
The investment allocated by MFG towards an appropriate DRP was realised in just one implementation of the recovery features. According to the CEO, that is not the best part.
In conclusion, whilst investing in a suitable DRP may seem to be an exorbitant ‘insurance’ against an unlikely disaster, the cost of not investing in, and regularly testing and maintaining a DRP, could be catastrophic. It is highly recommended to any corporate entity that relies on electronic data storage that they invest in a suitable DRP without further delay.