The problem: A fragile, hero-based recovery model
An American investment firm lacked a formal, dedicated IT Disaster Recovery team. Instead, the entire DR function rested on the shoulders of two software engineers, who supported various application teams in building recovery templates.
Before adopting Cutover, the firm’s recovery process was defined by:
- Manual orchestration: Relying on massive Excel spreadsheets that were difficult to track in real-time.
- Long recovery times: Planned DR tests typically took over 3 hours, with unplanned outages expected to take significantly longer due to the "chaos factor."
- Hero dependency: Recovery required constant "hand-holding" from the two core engineers to ensure steps weren't missed.
While they had built recovery templates that they used for their IT DR testing, the true test came on a Friday at 8:00 PM, when a critical service supporting 70+ applications failed. This was the first time they has used Cutover for an unscripted outage.
The solution: Automated, executable runbooks
When monitoring tools flagged the outage, the application team didn't panic or get the DR specialists on the phone. Instead, they leveraged the work invested: a highly automated Cutover Recover runbook.
Key technical integrations
The firm had moved beyond simple checklists by integrating Cutover with their technical stack:
- Jenkins and Ansible: Used to trigger automated scripts for database failover, storage migration, and application down-scaling.
- Automated workflows: The team simply used the pre-approved template for recovering the affected applications, verified the pre-recorded process, and pressed play.
- Human-in-the-loop monitoring: While scripts handled the heavy lifting, the app owners acted as orchestrators, monitoring the Cutover dashboard to ensure each automated step validated successfully.
The "unseen" response
Because the runbook was so clearly defined, the primary DR lead wasn't even paged. The application team had the confidence and the tools needed to execute the recovery entirely on their own.
The outcome
The results of the first "real-life" execution far exceeded the performance of previous manual attempts and even improved upon recent planned tests.
- Drastic speed increase: The recovery was completed in just 1 hour and 28 minutes, a significant reduction from the 3+ hours required by the old manual methods.
- Zero escalation: The recovery was so seamless that the DR owners were not required, waking up only to a notification that the event had already been successfully resolved.
- Regulatory readiness: Cutover automatically generated a parent runbook report, capturing the exact recovery time and providing an immutable audit trail for compliance and regulatory proof of recovery.
- Cultural shift: The team moved from a "fear-based" response to a "certified" one. They are now preparing for a "failback" (returning to the original site) using the same predictable, automated process.
