Your data is critical – do you have the right strategy in place for resilience?

Companies today rely on data to run their operations and to support their customers.

This leads to a lot of spending on analytics technologies that aim to harness all this data. According to KPMG and Ipsos, 77 percent of retailers have put more emphasis on using their data to understand their customers’ behaviour after the pandemic, while PwC found that 38 percent of manufacturers plan to roll out analytics to improve their profitability amid rising energy costs. These investments are all part of a wider trend of digital transformation, which IDC estimates will be worth $3.4trillion by 2026.

As all this data becomes more valuable, it also needs to be protected. The systems that create, manage and use data have to be resilient against failures or the value they provide will disappear. According to the Business Continuity Institute, more than three-quarters (76.6 percent) of organisations have an operational resilience programme, but they also find it difficult to get the support they need – just over half (52.4 percent) found that embedding operational resilience into the fabric of the organization was a major issue, while having the required staff and resources to implement policy was also a problem for 50.9 percent of respondents.

To solve these problems, companies have to understand the principles involved around business continuity. Using these approaches, organisations can improve their resiliency and prevent downtime.

Business continuity, high availability and disaster recovery

There are a lot of terms used around this area, and there are some nuances that need to be understood in order to pick the right approach. Business continuity is the overall approach and refers to planning and processes that deliver more resiliency for the organisation as a whole. Alongside the technology side, it covers the people and processes that should be in place to keep things running smoothly even when there is a problem.

Underneath this, there are the technical steps that you can take to harden your systems and protect your data. High availability (HA) covers any steps taken to prevent problems from affecting an application or its components. This approach should ensure that there are no single points of failure that can lead to an application failing or losing data. Related to this is disaster recovery (DR), which covers all the processes and technologies used to get things back to normal after a disaster occurs.

Traditionally, HA and DR implementations were used by large enterprises that had millions of customers to serve. Critical national infrastructure services and key companies in areas like banking and financial services have to invest in these areas due to regulations. However, as more companies have moved their businesses online and serve ever more volumes of customers, they too face the challenges of supporting applications that have to be available around the clock and that have to work through any issues.

Today’s IT services are more resilient than in the past. Options like cloud computing operate at a huge scale and provide service level agreements for availability – for example, AWS aims for 99.95 to 99.99 percent uptime across its cloud and infrastructure services in its agreements. In practice, AWS provides service credits for any instances of downtime longer than four minutes 21 seconds during a given month. Similarly, hard drives used within servers are more reliable now than they were in the past – cloud backup provider Backblaze found that it saw an annualised failure rate (AFR) of 1.54 percent in Q1 2023 compared to 5.74 percent AFR in 2015. AFR provides insight into the percentage of devices that fail within a given time period, and can be influenced by how old the devices were at the beginning of any measurement, but looked at over time, it shows how reliability has gone up in general.

What does this mean in practice? Overall, reliability for technology has improved. However, we can’t assume that this will provide the level of reliability that we need for our critical applications.

Preparing your approach around data

For data, HA and DR implementations should provide more resiliency during normal operations, and provide the tools to recover if disaster strikes. Roles like Data Scientists, Site Reliability Engineers, DevOps and Software Developers are taking on elements of data management. This means that responsibility for databases is spread out, rather than being the preserve of specific database administrators (DBAs). Many of the essential skills that a DBA would be responsible for, like HA and DR planning, can therefore be overlooked or assumed to be in place when they are not.

For example, implementing databases is now easier than ever. Developers can use open source databases and automate their implementations as part of wider software deployment pipelines. However, this does not look at areas like performance or HA/DR support. Implementing HA functionality like clustering ensures that databases can operate through a failure. Running a database instance that has multiple nodes can help the application continue running when a node fails, but it can also lead to issues when recovering and getting back to normal.

However, this is an area where you do need to think through and understand your DR processes. Recovering multi-master databases requires specialist skills and understanding to prevent problems around concurrency. In effect, this means having one agreed list of transactions rather than multiple conflicting lists that might contradict each other. Similarly, you have to ensure that any recovery brings back the right data, rather than any corrupted records. Planning ahead on this process makes it much easier, but it also requires skills and experience to ensure that DR processes will work effectively. Alongside this, any DR plan will have to be tested to prove that it will work, and work consistently when it is most needed.

Any plan around data has to take three areas into account – availability, restoration and cost. Availability planning covers how much work the organisation is willing to do to keep services up and running, while restoration covers how much time and data has to be recovered in the event of a disaster. Lastly, cost covers the amount of budget available to cover these two areas, and how much has to be spent in order to meet those requirements.

To justify your investments, look at the value of the business that you are processing and the impact that any downtime would have. For companies that rely on data to pass into their pipelines and create value for customers, those systems have to be up and running to make money. Protecting your core asset – your data – therefore involves looking at how to protect and run your core systems around that asset. For databases, this will include clustering for availability and testing for recovery. Most importantly, it involves helping colleagues to understand that the value of data is only possible with preparation and planning to keep things running smoothly.

About the Author

Charly Batista is PostgreSQL Tech Lead at Percona. He has more than twelve years of experience in various areas of IT including Database Administration, Data Analysis, Systems Analysis and Software Development, based on strong analytical skills combined with experience in object-oriented programming techniques. Charly has held the role of Technical Leader for more than four years for the Developer Team. Prior to Percona, Charly held technical and software development roles for companies in China and Brazil.

Percona is widely recognized as a world-class open source database software, support, and services company. The organization is dedicated to helping businesses make databases and applications run better through a unique combination of expertise and open source software. Percona works with numerous global brands across many industries creating a unified experience to monitor, manage, secure, and optimize database environments on any infrastructure.

Featured image: ©PAOLO