Don’t let Cyber Monday bring you down: Ensure IT resilience with Cutov

It’s the busiest online shopping day of the year: Cyber Monday. The expectation is online sales will reach $13.2 billion, per Adobe Analytics, so it’s critical for retailers to ensure that their IT systems can handle the expected surge in online shopping traffic.

Should an IT outage or disruption occur, you need to be able to seamlessly recover your affected applications and fail over to a standby or secondary site, minimizing any downtime or disruptions to the business or consumers.

This article overviews a multi-region cloud disaster recovery (DR) process for an enterprise retail application. We’ll show how Cutover’s automated runbooks provide complete visibility across the entire process and accelerate your recovery to increase IT resiliency.

The manual nature of multi-region disaster recovery

During a recovery, coordinating across teams, services, and components is complex with hundreds of manual tasks and potential for human errors. Additionally, keeping DR documentation up to date and testing DR plans is time consuming and difficult.

Introducing: Cutover Recover for multi-region cloud failovers

Cutover Recover helps automate, accelerate and scale multi-region cloud disaster recovery processes. A SaaS platform with dynamic, automated runbooks, Cutover enables enterprises to:

Simplify and accelerate a multi-region cloud (or any) DR process with complete visibility, streamlined communications, and advanced automation.
Orchestrate the DR process with added control by combining human decision making and automated tasks in one platform.
Map and execute the DR process in runbooks for one source of execution to manage the entire process.

Define your DR plan in runbooks

To help you codify and standardize your application DR process, Cutover Recover includes runbook templates to address multiple disaster recovery strategies and infrastructure types. You can then customize your DR runbooks for your unique needs - for example, runbooks enable you to build more organized DR processes with workstreams and application infrastructure layers. Common workstreams for a multi-region DR strategy for an application are:

Pre-failover
Failover
Post-failover validation
Pre-failback
Failback
Post-failback validation
Cleanup

*Figure 1: Get a centralized view of the entire multi-region DR process for an application*

Automate manual, repetitive tasks

During the busiest online shopping day of the year, you want to make sure that if you have to recover your applications, you do it as quickly and efficiently as possible. With Cutover, you can integrate with multiple cloud services and other applications in your technology stack via simple REST API calls and authorization methods. The Cutover platform then acts as the single point of execution, so you avoid managing tasks in multiple consoles, reducing “Clickops” (and headaches). Cutover runbooks enable you to organize complex technical DR tasks and teams so you avoid having one person as a single point of failure.

Execute a multi-region failover with Cutover

Communication during a multi-region cloud failover and recovery is essential to making sure the process goes smoothly and there is no lag time between task handoffs between various teams (Platform, Network, Database, Services, etc.). Cutover provides automatic communications sent via email and through collaboration platform channels via integrations (MS Teams, Slack, etc.).

In a Cutover runbook, the Platform Team starts the pre-failover, failover and post-failover validation process with one click.

Pre-failover tasks:

The runbook outlines all tasks for the Platform and Network teams to confirm the standby region is available for failover. Then, they “hand over” responsibility to the Database, Application, and Service Teams to start completing their tasks in parallel. With the runbook, all teams can monitor the pre-failover progress in real time.

Once all pre-failover tasks are completed, the first runbook milestone is complete and Cutover sends an automated email and communication platform messages via the integration.

Failover tasks:

Next, the Platform Team is notified to make a go-no-go (GnG) decision about failing over to the standby location. The Cutover runbook includes a manual validation task for the GnG which ensures that people are always in the loop for crucial decision making.

Once the failover decision is made, the following sub-stream of tasks can be kicked off and communications sent out (similar to pre-failover):

The Network Team is notified to stop accepting traffic in the current active region.
The Services Team checks that active and failover (standby) regions are completely synced.
Then, the Services Team stops cross-cluster replication between the active region (leader domain) and standby region (follower domains). This way, you can advance your follower domain to become a leader domain and route your write traffic to the follower, which helps avoid data loss for new sets of changes and updates.

Post-failover validation

The Platform Team begins the post-failover validation process, monitoring application availability and accessibility and performance metrics including CPU utilization, memory usage and response time.

If your teams need additional descriptions on the post-failover validation tasks (or any tasks), you can pre-configure them with links to monitoring dashboards or additional details.

Once the validation activities are completed, the runbook reaches another major milestone and the application is confirmed to be working as expected in the failover region with documented metrics and dashboards in each task.

Return to home (pre-failback, failback, validation and cleanup)

Before the failback can happen, the Platform Team ensures the failback region is scaled appropriately to handle the current application traffic. This is often a manual stage in the process as the team needs to:

Match the scale of the two clusters
Sync the delta from the follower domain in the failover region to the leader domain in the failback region

Pre-failback

During the pre-failback stage, the Services Team prepares the active region for failback and the synchronization of data. Once complete, stakeholders are automatically notified via Cutover notifications and communication alerts that pre-failback is complete and failback can begin.

Failback

The failback stage is typically similar to failover activities, just in the reverse order. Communications will automatically be sent out both at the start and completion of failback. Once complete, the runbook reaches another major milestone and the application is back and operating in its original location, while scaled to handle the current demands.

Similar to the post-failover validation stage, validation still needs to be performed post-failback to ensure that it is working as expected. Once these validations are complete, the runbook reaches its sixth major milestone, which then leads us into the final stream of tasks for the Platform Team.

*Figure 3: After turning off routing control in the failover region, confirm that the data has synched between the two regions before starting the failback*.

Cleanup

The cleanup of a failover location is often one of the most neglected parts of a DR strategy. A good cleanup strategy ensures that you are keeping your costs in check by avoiding running any unnecessary services in the standby cloud region.

Cutover runbooks ensure that cleanup activities are performed after post-failback validation or after you choose to abandon failover activities and before the DR runbook is considered complete.

Because there isn’t a one-button failover, the Services Team once again needs to carry out specific tasks for the application. This includes setting up and starting cross-region replication tasks from active (failback) to standby (failover). These tasks are the same automated tasks from the pre-failback stream.

Once the cleanup activities are complete, the standby region returns to its original state as shown in Figure 1 and the runbook reaches its seventh and final milestone and is complete after the communications are sent out.

Benefits of using Cutover for multi-region disaster recovery

If an outage hits on a day like today, you need to ensure that your teams are prepared with comprehensive, well-documented and executable DR runbooks. With Cutover, you can standardize and automate DR plans with runbooks and:

Keep all teams (Application, Database, Platform, Services, etc.) on track for enhanced governance
Orchestrate all tasks, both manual and automated from one centralized platform
Automatically capture every event and task with the immutable audit log to satisfy regulatory and compliance requirements, without any manual overhead
Improve reliability, gain confidence in your DR plan, and recover up to 50% faster

Don’t wait until next year to test your recovery

Ideally, before an IT disaster event, you’ve tested your DR process to ensure that your plan is up to date and accurate.

Cutover helps streamline a multi-region failover and failback simulation that mirrors a real recovery. Test and validate a multi-region application DR process and then prove your ability to recover to regulators with real-time dashboards and automatically generated audit log, which automatically captures every action within the platform.

Ensure IT resilience with Cutover automated runbooks

On Cyber Monday, you can’t risk any website downtime - every second counts!

Contact Cutover and learn how we help you simplify the complexity of a multi-region disaster recovery process. Learn more about how Cutover Recover can help with your multi-region DR, book a demo.

Kimberly Sack

Cyber recovery

IT disaster recovery

Don’t let Cyber Monday bring you down: Ensure IT resilience with Cutover Recover