Gartner® report: 9 Principles for Improving Cloud Resilience
Download
No items found.
Blog
June 17, 2024

Key components to include in an IT disaster recovery plan checklist: 6 critical points

As with any important business plan, an IT disaster recovery plan needs to be reviewed and updated to consider key operational, personnel and technology changes within the next year. While you can’t anticipate every scenario before the next IT disaster recovery (IT DR), this article provides a checklist of critical points to consider in your IT or cloud based disaster recovery

6 critical points in a disaster recovery plan checklist

Every business has some level of nuance to consider in a disaster recovery (DR) plan however, there are critical points that every DR plan needs. Here are six disaster recovery checklist items to consider.

Inventory applications by criticality


When a large-scale disaster event or outage happens, it’s critical to have a comprehensive and up-to-date inventory of all of your applications. Each application should be categorized by priority: mission critical, business critical and non critical. This provides the order of importance based on which applications are most crucial to get the business back up and running during a recovery. 

If your workloads are in the cloud, it’s important to know how to bring the workloads and services to full recovery after any automatic failover or backups are complete. 
Once you define your applications organized by tier, then you can structure disaster recovery plans considering each scenario and category.

Build automated and executable runbooks

A comprehensive IT disaster recovery plan includes both the technical and business steps that will need to be taken to recover important business systems and services. As part of your disaster recovery plan checklist, you should build automated and executable runbooks. The runbook provides a centralized source of execution for all automated and manual tasks so you can accurately monitor and manage recovery activities. 

The runbook becomes your dynamic and executable DR plan. This helps to ensure that all tasks are completed, in the right order, to reduce risk and accelerate the recovery process.

Manage and track recovery metrics


To ensure you can recover, tracking critical recovery metrics should also be a part of every IT disaster recovery checklist. Specifically, recovery time objectives (RTOs) and recovery point objectives (RPOs) are two critical metrics to understand, manage and track.
If your workloads are in the cloud, cloud service providers provide various automatic failover strategies that range in RTOs, RPOs and costs. As mentioned above, the tiering of your applications by criticality provides organization and prioritization for recoveries. It also enables you to stagger RTOs to focus first on the most important or mission critical applications have the smallest or near zero RTOs. The remaining business critical applications can have longer recovery times allowing your teams to focus efforts on what’s most important. 

During a live recovery or test event, track your recovery time actual (RTA) or the actual time it took to complete the recovery. Then, compare this actual result (RTA) against your estimated target (RTO). This provides a concrete evidence that you can achieve your RTO in the required time. 

If you’re currently in progress of a cloud migration, this adds an extra layer of complexity to a cloud disaster recovery plan as you need to understand the location of all applications - in the cloud (region, availability zone, etc.) or on-premises - and each associated RTO and RPO. You also need to consider any interdependencies between applications and how that impacts the recovery plan.

Integrate the technology recovery stack


When undergoing a live disaster recovery or test scenario, you will likely need to use data from various IT service management (ITSM), business continuity management (BCM), infrastructure as code (IaC), and communication platforms. For your IT disaster recovery plan checklist, integrating across the technology recovery stack can add automation to the DR process, increasing efficiency and productivity. This level of integration provides:  

  • Enhanced or streamlined communication between teams 
  • Reducing or removing manual, repetitive tasks 
  • Increased accessibility of critical data from configuration management database (CMDB)
  • Faster provisioning of new infrastructure and applications or virtual server in the cloud  
  • Automatic system health checks by monitoring of recovery activities and the health of applications

Reference the organizational design and personnel plan


Automation is important, but it’s critical to also keep people in the process loop. As part of your disaster recovery checklist include the organizational structure of how the teams are structured to understand operations, communications, roles and responsibilities, and decision making models.

While the DevOpsSec teams often run operations of applications, a resilience role will likely provide authority and guidance on how to make an application or service resilient. It’s important to have clear guidelines on who does what between these and any adjacent teams that are involved in the recovery process. Otherwise, chaos could ensue, which likely leads to a delay.

It’s also important to understand the personnel plan for the upcoming year and how anticipated new hires/teams or reductions in staff impact the disaster recovery plan. If your teams scale up or down, those additions or removals can cause breakdowns in processes which can be detrimental during a recovery.

It’s not just full-time employees, you need to consider part-time, contractors and consultants. If you work with an IT contractor, they need to be available should a recovery occur and have the appropriate credentials and access to systems. They also need to be included on relevant communications and involved in any post-event debriefs. 

Conduct post recovery event reviews


As with any disaster recovery event or test, it’s crucial to understand what worked well and what needs improvement. As part of your disaster recovery checklist, reviewing post-event metrics with real data helps you: 

  • Understand if specific tasks or teams took longer than expected
  • Pinpoint potential breakdowns in the process
  • Compare RTAs against RTOs
  • Validate if recovery timelines are realistic
  • Identify areas for improvement and implement automation

Regular audits, testing and updates


An IT disaster recovery plan only proves fruitful if it is regularly reviewed, tested and updated with any lessons learned. 

Considering the six critical points in the IT disaster recovery checklist above - it’s important to ensure your plan accurately depicts the current state of your business. A rule of thumb is to review and test your IT disaster recovery plan at least once per year. Regular audits, testing and updates enable you to: 

  • Understand the overall process flow of the IT DR plan 
  • Pinpoint areas of weakness and bottlenecks
  • Identify if individual tasks were missed or overrun 
  • Brainstorm areas for improvement

As with anything, audit the IT disaster recovery plan checklist and make necessary adjustments and updates. 

Automate your IT disaster recovery plan with CutoverIt’s difficult to track and maintain up-to-date IT disaster recovery plans. With hundreds or thousands of applications, ensuring you have accurate plans that are in automated, executable runbooks helps standardize and accelerate IT disaster recoveries. Learn more about how Cutover can help, schedule a demo today.

Kimberly Sack
IT Disaster Recovery
Latest blog posts