Should there be a ‘gold standard’ for disaster recovery testing?

Absolute reliability in your systems and in your data recovery is non-negotiable in today’s business environment where cyberattacks, human error and systems failure are all too common.

If there’s any doubt, it’s an open door for problems, and needs to be identified and addressed before disaster strikes.

Putting in place a well-structured recovery environment to optimise data recovery testing that can be conducted in the least disruptive way to the business, is absolutely critical. Unstructured testing of disaster recovery plans, without full failover potential while conducting a test, could conceal severe weaknesses when confronted with a genuine disaster scenario.

Unfortunately, testing is often pushed down the list as organisations struggle to stay on top of other priorities. In our recent study, we found that one in five IT professionals in the UK admit to testing their data rand disaster recovery systems once a year or less. Just 5% test monthly.

Non-disruptive

The challenge is that many technologies deployed today to recover a company’s systems and data – resulting from a disaster – do not allow for non-disruptive testing. This means that while testing can be carried out, these tests can never be thorough enough without significant disruption to the business and, as a result, deliver a compromised test.

The introduction of uncertainty about recoverability then places the commercial viability of a business in jeopardy if a significant data disaster does occur.

Every day that passes after a test has been conducted, there’s the possibility of a corruption, error or something more malicious being introduced to the systems and data, which could go completely unidentified by the team responsible. If the tests are not thorough and frequent, this risk increases significantly.

We advocate for a ‘gold standard’ for disaster recovery testing – twice-yearly, non-invasive full failover tests supported by monthly system boot tests and data integrity checks. In addition to rigorous data validation, testing the ability of workloads (applications and data) for failover capabilities needs to be designed into the recovery plan. This should also allow for network and connectivity testing, a critical and often overlooked component in the testing process. 

Different skillsets

Often, technical teams in an organisation tasked with maintaining the IT infrastructure for business-as-usual services having skillsets aligned to the daily demands of the business. The skills and experience required to bring IT and the business back online after a disaster vary from the day-to-day, so it’s not uncommon for IT staff to be unaccustomed with the demands suddenly placed on them in an extremely stressful situation.

This sentiment is supported by our survey of CIOs, CTOs, and IT managers. where close to 40% described a lack of technical skills as a major concern.

When the stakes are high, and the disaster is an aggressive ransomware attack, a very different approach is required demanding experience, confidence and agility to adapt as the problem unfolds. The attack may have compromised production data, and more frequently, backup data as well (on average, 94% of respondents to a recent industry survey indicated that their backups were also attacked and 57% of those backup compromises were successful).

Readiness to tackle a multi-faceted attack of this nature, and successfully bring systems back online, can only be achieved if teams are fully conversant with the recovery systems deployed and confident in their ability to recover their systems and data. Once again, our survey indicated that more than half of respondents were not confident in the recovery systems deployed.

Experience and confidence come from testing, and doing this frequently and thoroughly, leaving no opportunity for surprises or weaknesses when they are least expected.

Stress-testing

Amidst an unfolding disaster, very few businesses get a second chance for recovery when there are serious flaws in the technology and the planning and orchestration for recovery have not been fully stress-tested. Setting a standard that can be recorded and reported ensures that technically the recovery can be delivered when needed. From a senior reporting perspective, it also reassures stakeholders that the business is fully protected.

This gold standard in testing is not always achievable with the technology deployed within a business. But what it does it to set a metric, which when accomplished, puts the business in a much better state of readiness to recover from a ransomware or other cyberattack or indeed any other disaster.

The data recovery technology is available today with many solutions allowing non-disruptive and frequent testing. Whether it’s the technology preventing non-disruptive testing, available budget or the business’s recovery plan not factoring in this crucial phase of the process, readiness for absolute recovery is a choice.


About the Author

Stephen Young is Executive Director at Assurestor. Stephen is a seasoned business owner and entrepreneur, innovation in technology has been central to his career for over 30 years. Across varying facets of IT, his experience covers infrastructure, software development, data centres, service and support, IT governance combined with management, finance and business development. With roots in software development and service and support, Stephen’s commitment to detail, thoroughness and uncompromising customer support has been a continuous thread through his businesses and has been a major factor to their success.

Featured image: Adobe

more insights