Disaster recovery is the process of getting services back online after a failure or outage has occurred, which may have been caused by anything from a natural disaster to a cyber attack. This can be a largely manual process that requires a high level of human orchestration and has long been the way most banks deal with major system outages. However, more recently there has been a significant shift towards focusing on resilience rather than recovery.
The resilience approach focuses on protecting core services and preventing issues before they occur, rather than dealing with them after the fact. This can involve identifying the risks and vulnerabilities associated with the services that support critical business processes and performing detailed risk assessment of the impact of an outage. Measures can then be taken to remove these risks such as removing single points of failure or adding the ability to automatically scale services e.g. load-balancing servers.
The shift in focus towards resilience rather than recovery has become increasingly important due to changes in customer demands and habits. The digital banking customer is ‘always on’ and expects uninterrupted access to their bank at any time, so they are more likely to notice and be significantly affected by an outage. Due to the increased rate of change also caused by digital customer demands, there is now arguably a greater risk of change-related outages than before. This is the critical focus of the DevOps movement, ensuring that automated test and release processes remain robust, particularly around regression.The threshold for acceptable levels of service is higher and most banks are updating their strategy accordingly to deal with this by focusing their efforts on resilience.
The provision of robust resilience and recovery tools and processes are underpinned by the business risk assessments that directly point to the acceptable levels of service that need to be maintained. Knowing the system, its weaknesses and both the resilience and recovery requirements are essential for understanding what needs to be done to make the system truly resilient. This is also an opportunity to identify which processes are the most critical to the business and should take priority.
The resilience processes also need to be constantly reviewed so they can be improved and updated to increase the level of resilience over time. There may be new threats to the system, weaknesses as technology changes or increasing demands due to new products and services being launched. Monitoring the system and collecting data can help to continuously assess possible weaknesses and gradually make it as resilient as possible.
Resilience is the main focus for most organisations at the moment as it presents the most desirable option — avoiding outages entirely rather than having to invoke recovery processes to fix them in the event of a major incident. However, while prevention is better than cure no system can be 100% resilient and there will always be a need for disaster recovery events that require high levels of human orchestration.
When these disaster recovery events have to be invoked, Cutover can be used to test specific pre-prepared service recovery plans for disasters such a data centre going down so that when a disaster like this does occur the recovery can run efficiently just like a planned event. The tool can also be used to store more general disaster recovery plans that can be invoked in the event of a real disaster and updated to fit the specific scenario. Cutover facilitates the human orchestration involved in these events and provides real-time status visualisation.
While focusing on resilience is the smart thing for banks to do at the moment, having a good backup recovery process is still essential for protecting the business and its customers.