The problem: Uncoordinated and manual failover procedures
A financial services company had to perform data center disaster recovery failover procedures every six months involving around 50 applications that supported their loans and fees services. They were performing simulated failovers to test this around once a week but were not always consistent. The highly manual failover simulation process involved over 14 teams that needed to be coordinated for successful failover recovery.
Challenges with manual failover processes
The process and execution tools were separate, as information, such as the list of steps required to carry out the failover testing, was held in Confluence while the steps were completed using various software tools such as Jenkins and Rundeck. Preparing for the test involved going between the source of information and the tools to check that everything was planned correctly.
Uncoordinated manual processes
The Application Owner then had to set up a bridge call with the Application Team and spend three to four hours running the plan, during which time everyone involved had to be on the call and couldn’t work on anything else.
Post-event reporting was also highly manual and therefore time consuming, taking a further three to four hours after the test was complete and carrying a high risk of human error.
The solution: Automated runbooks and comprehensive dashboards for data center failovers
Now, Cutover acts as the central planning and execution hub for simulated and real data center failovers. Cutover hosts the process and connects to the technologies that are used during the event.
Cutover automated runbooks: Replacing manual failover procedures
After building their failover recovery plans in Cutover’s automated runbooks to replace the manual failover procedures, the Application Team can now kick off an automated runbook, and, rather than spending the entire duration of the test on a bridge call, the people involved can work on other things and are notified when they need to log in and complete their tasks.
Thanks to integrations with Jenkins and Rundeck, all the information in those systems such as code and scripts is available in the Cutover runbook so it can be accessed without leaving the platform and is still mastered in one place. They have also integrated their Cutover account with Microsoft Teams to further improve communication and visibility across the organization during a failover recovery.
Comprehensive dashboards for easy failover management and monitoring
Cutover’s dashboards provided huge value for measuring recovery time objectives (RTOs) against recovery time actuals (RTAs). To do this, the Application Team built a stream containing all the tasks they wanted to measure and the dashboard of the stream summary provided a forecast duration which they used to make a plan that fit their RTOs. At the end of the failover simulation, they compared this to the actual duration of the runbook, which is the RTA, to see if they were able to meet their desired RTO in practice.
Post-event, the Application Team has to record the RTA and the preparation and validation steps that were carried out in Fusion. The Application Owner can now download the task list from Cutover and import the files straight into Fusion, reducing post-event reporting from a three to four-hour process to five to ten minutes.
Enhanced coordination and automation
“The ability to leave comments in a centralized place and access performance metrics post-event has changed the game for our post-mortem activities and preparations for the next event.” - Application Owner
Centralizing all communications, information, and data in one place has been hugely impactful for the Application Team and they are now running tests once a week or more.
The outcome: Accelerated recovery and improved efficiency
Substantial reduction in failover recovery time
Many of the activities for the failovers were run by scripts with some manual intervention. Cutover now enables the Application Team to integrate with those scripts to mature them past manual intervention, reducing the amount of time taken to complete those tasks.
Operational efficiency and error reduction
The Application Team estimated that Cutover is ten times better than their previous manual way of working. The dashboards and post-event review capabilities have saved them around three hours per event of post-execution evidence gathering.
Positive impact on failover procedures
They now have a version-controlled source of truth for the failover runbook and the ability to review and approve the entire process within Cutover.
Let Cutover runbooks help you with your own failover strategy
Find out more about Cutover for IT disaster recovery or book a tailored demo to see the platform in action and find out how we can help you with your specific failover needs.