The shift from applications residing in highly redundant data centers to complex, distributed cloud architectures brings resiliency and high availability but also complexity. Managing cloud-native architectures is completely different from on-premises data centers and there’s a learning curve in understanding how to handle failures in large and complex cloud networks that include availability zones and multiple regions.
When dealing with such complexity, the risk of human error increases. Significant IT outages are caused by an array of factors, including power, network, and IT system failures and cooling, security, and third-party issues. Uptime Institute estimates that human error plays a role in two-thirds of all outages and 40% of human error-related outages are caused by staff incorrectly following processes or procedures.
This is where automation comes into play: a recent Harvard Business Review Pulse Survey, Taking the Lead on IT Automation: IT Leaders as Evangelists for Their Automation Strategies, states:
- 80% of respondents say adopting IT automation is “extremely important” or “very important” to the future success of their organization
- 68% of respondents agree that in the past 12 months, IT automation has shifted from nice to have to must have
Cutover’s recent webinar ‘Automating cloud disaster recovery’ explores how automation across teams and technology can benefit enterprises. Here are three key takeaways from the webinar.
1. Understand the shared responsibility of cloud disaster recovery
Before you plan your disaster recovery (DR) procedures and define recovery time objectives, it’s critical to fully understand who owns the recovery of your cloud applications and services.
When using a public cloud provider for infrastructure-as-a-service (IaaS), your provider protects their infrastructure, storage, and network. However, you, as the enterprise, manage the workloads, security, middleware, and guest operating systems.
This means that you own the availability and recovery (including recovery time objectives and recovery time actuals) of the workloads, security, middleware, and guest operating systems.
Figure 1 below illustrates the responsibility of managing applications and services in the cloud. As you migrate to the cloud, your disaster recovery procedures require updates.
2. Increase visibility with a cloud disaster recovery template repository and automated recovery engine
Regardless of whether your applications are in the cloud or on-premises, you should apply the same engineering discipline for your technology resilience.
With a source of truth repository, such as Github, and an automated recovery engine, you can store and edit files in the CI/CD pipeline and provide visibility to everyone involved, bridging the gap between engineering and non-engineering teams. Source control helps you understand how and what your automation is delivering to your service. If something goes wrong you can reference the log, pinpoint the configuration file and make any necessary fixes.
Technology resilience procedures should be captured in scripted runbooks, tested, and their execution automated to respond to threats when appropriate. This can help you achieve operational efficiencies, such as reducing cost and removing configuration drift, across different cloud domains.
3. Automating cloud DR requires a strategy, not just a tool
Gartner recently shared, “Without automation, you can’t manage cloud at scale.” While automation is becoming a “must have”, it can’t be a one off. There needs to be a clearly defined strategy that considers the enterprise’s overall requirements, not just the tools used to automate manual tasks.
Focus on people and processes over technology early on in your cloud migration journey. Build an automation strategy that includes a portfolio of initiatives across both operations and deployment domains. Recognize that successful automation requires knowledge of automation value possibilities and attention to the related people and processes. Drive prioritization decisions with a long-term perspective to produce a flexible and interconnected portfolio of initiatives.
Remember, just because you can automate a task, doesn’t mean you should. The more you automate, the more software you need to manage. Get started by identifying the weakest link in your resilience process. Automate repetitive processes like:
- Alert to turn services off
- Provisioning databases
- Copying data
- Deploying code
- Maintaining canaries that constantly monitor and test applications
- Performing regular automated failover recovery testing to ensure that each part of an application performs properly under all conditions
Learn more about automating cloud DR
Cutover works with enterprises to turn complex disaster recovery processes into automated, executable runbooks that provide visibility and control of recovery processes.
Watch the on-demand webinar recording below or contact us to learn more.