Guidance for Deploying Elastic Disaster Recovery for Cross-Availability Zone Disaster Recovery

Summary: This implementation guide provides an overview of Guidance for Deploying Elastic Disaster Recovery for Cross-Availability Zone Disaster Recovery, its reference architecture and components, considerations for planning the deployment, and configuration steps for deploying the Guidance name to Amazon Web Services (AWS). It introduces core concepts and step-by-step instructions for designing, deploying, and managing an Elastic Disaster Recovery implementation.

Overview

This implementation guide details how to deploy AWS Elastic Disaster Recovery for customers who are looking to protect applications that currently operate within an AWS Availability Zone (AZ) and choose to recover to another AZ in the same Region for disaster recovery. This guide will complement the publicly available documentation, available on the AWS documentation library. This document introduces important concepts, provides specific guidance on configuration of Elastic Disaster Recovery for a cross-AZ use case, and step-by-step instructions on how to design, deploy, and manage your Elastic Disaster Recovery implementation.

By following this guide, you will be able to:

Familiarize yourself with the concepts of Elastic Disaster Recovery.
Learn where Elastic Disaster Recovery fits in your overall disaster recovery design.
Deploy Elastic Disaster Recovery to protect applications across AZs.
Recover source servers in an AWS Region using best practices.
Examine your Elastic Disaster Recovery testing process.
Understand how to recover your servers during a disaster.
Understand how to failback your servers after you have alleviated the disaster in your source AZ.
If needed, properly clean-up and remove servers from Elastic Disaster Recovery.

What is AWS Elastic Disaster Recovery?

Elastic Disaster Recovery minimizes downtime and data loss with fast, reliable recovery of on-premises and cloud-based applications running on Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Elastic Block Store (Amazon EBS) using affordable storage, minimal compute, and point-in-time recovery. Elastic Disaster Recovery continuously replicates source servers to your recovery target within AWS, allowing you to prepare your environment to recover within minutes from unexpected infrastructure or application outages, human error, data corruption, ransomware, or other disruptions.

Elastic Disaster Recovery provides a unified process to test, recover, and fail back any application running on a supported Operating Systems (OS). Elastic Disaster Recovery supports large, heterogeneous environments with mission-critical workloads. Additionally, this service can support recovery point objectives (RPO) of seconds, with recovery time objectives (RTO) of minutes, reducing overall disaster recovery costs.

Core Concepts

Below is a high-level overview of the Core Concepts that are incorporated in Elastic Disaster Recovery. We recommend you also familiarize yourself with core AWS functionality such as Amazon Identity and Access Management (IAM), AWS networking, Amazon EC2, and general disaster recovery concepts.

The main goal of disaster recovery is to help your business prepare and recover from unexpected events in an acceptable amount of time. This means you need to determine which applications deliver the core functionality required for your business to be available and define the appropriate RTO and RPO required for these applications.

Availability Zone (AZ)

An Availability Zone (AZ) is one or more discrete data centers with redundant power, networking, and connectivity in an AWS Region. AZs give customers the ability to operate production applications and databases that are more highly available, fault tolerant, and scalable than would be possible from a single data center. All AZs in an AWS Region are interconnected with high-bandwidth, and low-latency networking, over fully redundant, dedicated metro fiber, providing high-throughput, low-latency networking between AZs. All traffic between AZs is encrypted at the physical layer. The network performance is sufficient to accomplish synchronous replication between AZs. AZs make partitioning applications for high availability easy. If an application is partitioned across AZs, companies are better isolated and protected from issues such as power outages, lightning strikes, tornadoes, earthquakes, and more. AZs are physically separated by a meaningful distance (many kilometers) from any other AZ, although all are within 100 km (60 miles) of each other.

Elastic disaster recovery for cross-AZ

In AWS, an AZ is an isolated location within an AWS Region, with redundant power, networking, and connectivity. AZs give users the ability to operate production applications and databases that are more highly available, fault tolerant, and scalable than would be possible from a single data center. A disaster recovery strategy across multiple AZs within a single AWS Region can provide mitigation against disruptions such as fires, floods, and major power outages. If it is a requirement to implement protection against an unlikely event that an entire AWS Region is unavailable, then you can opt for a DR strategy that uses multiple AWS Regions. This trade-off should be considered in the context of the overall business continuity plan, objectives, and risks that you are trying to mitigate.

One benefit of a cross-AZ disaster recovery solution is its reduced complexity, provided by leveraging localized networking constructs across AZs. This simplifies the overall configuration and deployment of Elastic Disaster Recovery by allowing data replication to occur over a shorter distance. Additionally, using cross-AZ replication can help reduce costs associated with data transfer between Regions, as data transfer within the same region (across AZs) typically incurs lower charges.

In certain situations, parts of an applications can only run in a single AZ and are unable to take advantage of a multiple AZ design pattern to provide high availability. In these scenarios, customers can use Elastic Disaster Recovery to provide cross-AZ disaster recovery for applications that are constrained to run in a single AZ, while other parts of the application can use multi-AZ for high availability.

The deployment of Elastic Disaster Recovery across multiple AZs also provides enhanced resilience and flexibility for disaster recovery operations. By designating the source environment in AZ-1, the staging environment in AZ-2, and the recovery environment in AZ-3, customers can ensure that a failure in any single AZ does not impact the overall disaster recovery capabilities. This cross-AZ configuration helps isolate the different components of the Elastic Disaster Recovery infrastructure, thereby reducing the blast radius in the event of a AZ impairment. Additionally, this multi-AZ approach can address regulatory requirements around data residency, as the data remains within the same AWS Region but is distributed across multiple AZs.

Recovery point objective (RPO)

RPO defines how much data loss your application can tolerate and determines what is considered an acceptable loss of data between the last recovery point and the interruption of service.

Recovery time objective (RTO)

RTO is defined by the organization on a per application or workload level. RTO is the maximum acceptable time between the interruption of service and restoration of service. This determines what is considered an acceptable time window when service is unavailable after a disaster.

Source server

The source server refers to the instance or server that you want to protect and recover in the event of a disaster. Elastic Disaster Recovery can be used to recover Amazon EC2 instances (referred to as Recovery Instances on Elastic Disaster Recovery) in a different AZ within the same AWS Region or a different AWS Region. Elastic Disaster Recovery can also protect applications hosted on physical infrastructure, VMware vSphere, Microsoft Hyper-V, and cloud infrastructure from other cloud providers.

Recovery subnet

The recovery subnet is the virtual network segment hosted within an AZ and hosts the recovered source servers in the event of a disaster. For a cross-AZ deployment, the recovery subnet range should not overlap with the source subnet range. If this is required, please review the information in the Advanced Topics section.

AWS Replication Agent

The AWS Replication Agent is a lightweight software package. It must be installed on each source EC2 instance that you want to protect using Elastic Disaster Recovery. The agent performs two main tasks: 1/ Initial block-level replication of disks by copying the state of the disk on the source server and transmitting this data to the staging environment where this data is persisted on EBS volumes that logically map to the source disks, and 2/ Real-time monitoring and replication of all block-level changes once the agent has completed the initial synchronization process.

Staging area subnet

In the selected AWS account and Region, the subnet selected to host the replication infrastructure is referred to as the staging area subnet. Elastic Disaster Recovery uses low-cost compute and storage hosted on the staging area subnet to keep the data in sync with the source environment. Replication resources consist of replication servers, staging volumes, and EBS snapshots.

Replication server

The replication server is responsible for receiving and storing the replicated data from the source server. The replication server is an EC2 instance to which the staging EBS volumes are attached. The AWS Replication Agent sends data from the source server to the replication server during the initial synchronization process or when blocks change on the source server. Replication servers will take frequent snapshots of the staging EBS volumes attached to them.

Point in time snapshots (PiT snapshots)

These are periodic backups taken by the replication server at specific intervals to capture the state of the source server and its data. The intervals are:

Once every 10 minutes for the last hour
Once an hour for the last 24 hours
Once a day for the last 7 days (unless a different retention period is configured, 1-365 days).

These point-in-time (PiT) snapshots are used during recovery or recovery drill when you do not wish to recover the most recent data, such as in the case of ransomware.

Conversion server

The conversion server is a component that makes all the necessary modifications to allow the target instance to boot and run, including pre and post boot scripts.

Drills

Drills refer to scheduled or ad-hoc tests performed to validate the effectiveness of your disaster recovery plan, without affecting the production environment. Elastic Disaster Recovery allows you to conduct drills to simulate recovery scenarios without impacting the production environment or replication state.

Recovery instance

During an actual recovery, a recovery instance is provisioned in the recovery subnet. The recovery instance is an EC2 instance and a fully functional copy of the source server that allows you to recover operations in the recovery AZ.

Drill Instance

A drill instance is an instance that has been launched using Elastic Disaster Recovery for the purpose of a drill or “test”. The goal of launching a drill instance is to test and validate your disaster recovery plan before an actual disaster. This instance is meant to be launched while your source server remains active. Customers may choose to activate this instance for production use by shutting down the source server and redirecting traffic to this instance.

Failover

Failover is the process of initiating a recovery in Elastic Disaster Recovery to launch an EC2 instance and restore your data based on the PiT snapshot selected to an EBS volume. This process would include failing over to a new AZ, launching the recovery instances and validating the application is ready to receive traffic. Additional steps are often required to prepare the recovery environment for a failover, and these are often documented and executed as part of a disaster recovery runbook.

Failback

Failback is the process of returning to normal operations at your source AZ. This includes replicating data back to the source environment, bringing the source servers back online, and redirecting user traffic back to these machines (redirection of traffic, as well as other configuration operations, are handled outside of Elastic Disaster Recovery).

Planning

Elastic Disaster Recovery is only one part of a larger disaster recovery strategy, and being prepared for these unforeseen events requires proper coordination across people, process and technology. The recovery plan should be documented along with clearly defined stakeholders, with roles and responsibilities, and the steps that should be taken in the event of a real disaster. Below is a checklist of key concepts to consider as part of the planning process.

Who are the stakeholders?

Identify all individuals and stakeholders who should be involved and informed when a disaster occurs. Consider using tools such as a responsibility matrix that provide a method to define who is responsible, accountable, consulted, and informed during a disaster. In many situations, we tend to focus on technical stakeholders, who are involved with responding to the actual disaster, but we should also consider other stakeholders, such as vendors, third-party suppliers, public relations, marketing teams, and even key customers. We recommend keeping a registry of all stakeholders with their defined responsibilities and contact information. One of the most critical roles when preparing for a disaster is defining the individual(s) who will make the final decision on declaring a disaster and initiating the Business Continuity or Disaster Recovery Plan.

Establish communication channels

Once you have identified and documented all relevant stakeholders, it will be necessary to define the proper communication channels to keep everyone informed. Part of this process should be establishing a chain of command and defining well understood escalation paths. We generally recommend the use of dedicated communication channels and hubs, such as an onsite situation room where everyone will gather to respond to the disaster. Video conferencing and instant messaging can also be used to facilitate virtual meeting rooms. It is highly advised that executive leadership is kept informed throughout the process.

Maintain up to date documentation

Disaster might be hard to predict, but how you respond to these types of events should be predictable. Once it has been determined that you will be activating your disaster recovery response, it is critical to follow the procedures that have been tried and tested. In all cases, this should start with up to date documentation detailing all steps to be followed. Although your operations and engineering teams are skilled and knowledgeable, the pressure that comes with a disaster is high.

The documentation should include: information on configuration state (mapped network connections, with functioning devices and their configurations, the entire setup of systems and their usage: operating system (OS) and configuration, applications versions, storage and databases (how and where the data is saved, how backups are restored, how the data is verified for accuracy), architecture diagrams, vendor support contacts and the responsibility matrix. It should contain everything IT related that your business relies upon. Keep hard copies of the documentation, as outages may knock your internal systems offline.

When to activate the disaster recovery plan

It is critical to quickly detect when your workloads are not meeting business objectives. In this way, you can quickly declare a disaster and recover from an incident. For aggressive recovery objectives, this response time coupled with appropriate information is critical in meeting recovery objectives. If your RPO is one hour, then you need to detect the incident, notify appropriate personnel, engage your escalation processes, evaluate information (if you have any) on expected time to recovery (without executing the disaster recovery plan), declare a disaster and recover within an hour.

Key performance indicators (KPIs) are quantifiable measurements that help you understand how well you’re performing. It is critical to define and track KPIs in order to determine when your business processes are impaired and determine the cause. In this way, you can quickly declare a disaster and recover from an unexpected event. For aggressive recovery objectives, the time to detect an event, declare a disaster, and respond with your recovery plan will determine if your recovery objectives can be met.

Define action response procedure and verification process

After declaring a disaster, the recovery environment should be activated as soon as possible. An action response procedure outlines all of the necessary steps for recovering at the disaster recovery site. Ensure that your action response procedure is documented and provides details on how the necessary services will be started, verified, and controlled. It is recommended that automation be used whenever possible to minimize the impact of human error. Having all services up in the recovery site, is not enough to declare success. It is critical to have a verification process that tests that all of the required data is in place, network traffic has been redirected, and all of the required business applications are functioning properly.

Perform regular disaster recovery drills

Many organizations do not perform disaster recovery drills on a regular basis, because their failover procedures are too complex and they have concerns that failover tests will lead to a disruption of their production environment (and possibly data loss). Despite these concerns, it is important to schedule frequent disaster recovery drills, to build confidence in the plan, build comfort within the team, and identify gaps. People will play a large part in any disaster recovery plan, and only by rehearsing the steps and procedures can we ensure that they can respond quickly and accurately to a real event. Furthermore, as the state and configuration of our systems change over time, only by conducting such exercises can we identify unexpected impact. In many cases, planned drills can be scoped down to focus on specific parts of the response plan. When using Elastic Disaster Recovery, these drills can be conducted in an isolated manner, in such a way that production is not impacted.

Stay up to date

Many companies maintain a risk register that tracks and quantifies potential risks to the business. They often include an analysis of current threats, previous disasters, and lessons learned. The risk register should have stakeholders that extend outside of the technology and operations teams and include the business, risk and executive leadership. It is important to be aware of how you handled previous disasters, as well as how you performed during more recent drills. All documentation should be up to date, reflecting the current environment, processes, and procedures.

Recovery operations

In cross-AZ recovery scenarios, once the disaster has been resolved, it will be important to determine whether you can operate in the recovered AZ as your primary site or if you will need to failback to the source AZ. In most situations, you should be able to operate in the recovered AZ without any impact as the AZs are hosted in the same Region. In these situations, you would then need to re-protect these recovered servers so they could failover if any future disaster occurs. If you choose to recover to the source AZ, then you would need to failback using Elastic Disaster Recovery. Elastic Disaster Recovery supports both options to allow for flexibility.

Technical Prerequisites

Implementing Elastic Disaster Recovery is a critical step in ensuring business continuity and resilience against unexpected disruptions. To achieve a successful deployment, it is essential to meet specific technical requirements that encompass various aspects of the system. These requirements range from network settings and communication protocols to supported operating systems, Regions, and installation prerequisites.

The following sections provide a detailed overview of the technical requirements necessary for the implementation of Elastic Disaster Recovery. They include guidelines for staging area subnets, network requirements, S3 bucket access, operational subnets, supported AWS Regions, general installation requirements, and specific considerations for Windows and Linux systems.

Administrative rights - Elastic Disaster Recovery can only be initialized with AWSElasticDisasterRecoveryConsoleFullAccess permissions for the AWS account in the target Region.
1. If you are using Single Sign On (SSO), please use this link for more information
Multi-Account Requirements Reference
- Staging Account Planning and Limitations: Due to AWS account wide API limitations, Elastic Disaster Recovery is limited to protecting 300 source servers per AWS account. In order to replicate more than 300 servers, you would be required to create multiple staging area AWS accounts. It would still be possible to recover all of your servers into a single recovery environment. Elastic Disaster Recovery can recover up to 3,000 servers into a single target AWS account.
Network Requirements Reference
- Preparation: Create a dedicated staging subnet for data replication from source servers to AWS.
  - This subnet should have a Classless Inter Domain Routing (CIDR) range that meets the following criteria:
    - Does not overlap with the source server CIDR ranges.
    - Have enough IP addresses for 1 replication server per 15 source volumes, or dedicated replication servers for highly transactional sources.
    - Support 1 conversion server per source server to be launched.
- Staging subnet access requirements: The staging area subnet requires outbound access to the Amazon EC2, Amazon S3, and Elastic Disaster Recovery service endpoints within the target Region. Customers can create private link endpoints or use public internet access to communicate with these AWS services.
- Communication over TCP Port 443: All communication is encrypted with TLS. All control plane traffic is handled over TCP port 443 and should be permitted for the following:
  - Between the source servers and Elastic Disaster Recovery.
  - Between the staging area subnet and Elastic Disaster Recovery.
  - The Elastic Disaster Recovery AWS Region-specific Console address:
    - (drs..[amazonaws.com](http://amazonaws.com/) _example: [drs.eu-west-1.amazonaws.com](http://drs.eu-west-1.amazonaws.com/)_)
  - Amazon S3 service URLs (required for downloading Elastic Disaster Recovery software).
  - The AWS Replication Agent installer should have access to the S3 bucket URL of the AWS Region you are using with Elastic Disaster Recovery.
  - The staging area subnet should have access to the regional S3 endpoint.
  - The staging area subnet requires outbound access to the Amazon EC2 endpoint of its AWS Region.
- Communication over TCP Port 1500: All data replication traffic is transmitted between the source servers and the staging area subnet using TCP Port 1500; this communication is also encrypted.
- Bandwidth Requirements: The average network bandwidth must exceed the peak write rate of the source servers to ensure successful replication in Elastic Disaster Recovery. Adequate network capacity is critical to maintain continuous data protection and meet your recovery point objectives.
S3 Buckets Reference
- Access Requirements: Agent installation and replication server components require S3 bucket access.
- VPC Endpoint Policy: Ensure that the relevant VPC endpoint policy includes access to all required Amazon S3 buckets. Refer to the example policy for replicating to us-east-1 and Amazon S3 documentation for policy requirements.
Operational Subnets Reference
- Drill and Recovery Subnets: Customers should create recovery subnets (and optionally drill subnets), before attempting to launch recovery instances. Instances are launched in a subnet specified in the Amazon EC2 launch template associated with each source server.
Supported Elastic Disaster Recovery AWS Regions Reference
- Please see Elastic Disaster Recovery supported regions reference for an up to date list of all supported Regions.
Supported Operating Systems Reference
- Elastic Disaster Recovery supports many versions of Windows and Linux operating systems, some of which are not natively supported by Amazon EC2. Please see an up-to-date version of supported operating systems here.
Windows Installation Requirements Reference
- Supported Operating Systems: Ensure that your source server operating system is supported.
- Free Disk Space: At least 4 GB of free disk space on the root directory (C:\Windows by default).
- Free RAM: At least 300 MB of free RAM.
- MAC Address Stability: Ensure that the MAC addresses of the source servers do not change upon a reboot or any other common changes in your network environment. The AWS Replication Agent may use the MAC address in its process to link the source server to its replication infrastructure.
Linux Installation Requirements Reference
1. Supported Operating Systems: Ensure that your source server operating system is supported (referenced above).
2. MAC Address Stability: Ensure that the MAC addresses of the source servers do not change upon a reboot or any other common changes in your network environment. The AWS Replication Agent may use the MAC address in its process to link the source server to its replication infrastructure.
3. Python: Python 2 (2.4 or above) or Python 3 (3.0 or above) must be installed on the server.
4. Free Disk Space: At least 4 GB on the root directory (/), 500 MB on the /tmp directory.
5. GRUB Bootloader: The active bootloader software must be GRUB 1 or 2.
6. tmp Directory: Mounted as read+write and with the exec option.
7. Sudoers List: The Linux account that is installing Elastic Disaster Recovery needs to be in the sudoers list.
8. dhclient Package: Ensure that the dhclient package is installed.
9. Kernel Headers: Verify that kernel-devel/linux-headers are installed and match the running kernel version.
10. Symbolic Link Considerations: Ensure that the content of the kernel-devel/linux-headers is not a symbolic link.
  1. Sometimes, the content of the kernel-devel/linux-headers, which match the version of the kernel, is actually a symbolic link. In this case, you will need to remove the link before installing the required package.
    1. To verify that the folder that contains the kernel-devel/linux-headers is not a symbolic link, run the following command:
      1. On RHEL/CENTOS/Oracle: ls -l /usr/src/kernels
      2. On Debian/Ubuntu/SUSE: ls -l /usr/src
  2. If you found that the content of the kernel-devel/linux-headers, which matches the version of the kernel, is a symbolic link, you need to delete the link.
    1. Run the following command: rm /usr/src/
      1. For example: rm /usr/src/linux-headers-4.4.1
11. Kernel Headers Installation: For the agent to operate properly, you need to install a kernel headers package with the exact same version number of the running kernel.
  1. To install the correct kernel-devel/linux-headers, run the following command:
    1. On RHEL/CENTOS/Oracle/SUSE: sudo yum install kernel-devel-uname -r
    2. On Debian/Ubuntu: sudo apt-get install linux-headers-uname -r
  2. If no matching package was found on the repositories configured on your server, you can download it manually from the Internet and then install it.
    1. To download the matching kernel-devel/linux-headers package, navigate to the following sites:
      1. RHEL, CENTOS, Oracle, and SUSE package directory
      2. Debian package directory
      3. Ubuntu package directory
AWS Specific Considerations
1. Number of disks per server
  1. Elastic Disaster Recovery uses Amazon EBS and Amazon EC2 for the replication infrastructure.
    1. Because of this, Elastic Disaster Recovery is limited to the amount of disks that can be added to the replication servers.
      1. For Nitro replication instances (such as t3.small), we are limited to source servers with less than 26 volumes.
      2. For Xen replication instances (such as t2.small), the limitation is 40 volumes per source server.
2. Maximum source disk size
  1. Elastic Disaster Recovery uses Amazon EBS and Amazon EC2 for the replication infrastructure.
    1. Because of this, Elastic Disaster Recovery is limited to the 16TB for each disk on the source machines being protected.

Design Guidance

Cross-AZ

We recommend deploying Elastic Disaster Recovery components across multiple AZs to enhance resilience and provide greater flexibility. Using different AZs for the source, staging, and recovery environments isolates components and reduces the blast radius if any single AZ is impaired.

For example, if the AZ hosting the staging environment experiences an issue, customers can recreate the staging infrastructure, including the replication servers and staging volumes in a new AZ with no disruption to the source environment or the replication process. Depending on the Region selected, this could be using a new AZ or reusing the recovery AZ. We would not recommend using the source AZ, as this is environment we are trying to protect.

Another benefit from separating source, staging and recovery is to protect users from the anti-pattern where recovery and staging are in the same AZ, and during failback replication, the new source is now in the same AZ as staging. Separating means the user does not have to redeploy the staging area infrastructure to a different AZ during failback.

When considering the cross-AZ design pattern, we recommend that you have a unique subnet range for the recovery subnet. If you choose to maintain the same IP address in the recovery subnet, please review the guidance provided in the Advanced Topics section found at the end of the document.

Security

Security is a critical consideration when implementing a disaster recovery solution like Elastic Disaster Recovery. Protecting your sensitive data and ensuring the integrity of your recovery environment is paramount, as a security breach or misconfiguration during a disaster event could have severe consequences.

Elastic Disaster Recovery includes several built-in security features to help mitigate risks. All data replicated by the service is encrypted in transit using TLS 1.2 or later, providing a secure channel for transmitting your critical information. Additionally, all EBS volumes created in the staging area are automatically encrypted at rest using AWS Key Management Service (AWS KMS), using customer-managed KMS (CMK) keys.

While these encryption capabilities are a strong foundation, Elastic Disaster Recovery does not provide a comprehensive security solution on its own. It’s essential to work closely with your organization’s security team to validate the overall security posture of your disaster recovery implementation. This may include reviewing network configurations, access controls, logging and monitoring, and compliance requirements to ensure the solution aligns with your broader security policies and best practices.

Encryption In transit

All data replicated by Elastic Disaster Recovery is encrypted in transit, using TLS 1.2 or later.

Encryption at Rest

All EBS volumes that Elastic Disaster Recovery creates in the staging area are automatically encrypted by default with a KMS key the service creates in your account. You can also choose an existing CMK or create one for this purpose if needed. The chosen key must be selected in the EBS encryption section of the replication settings for Elastic Disaster Recovery to use it. EBS volumes that are launched during a drill or recovery, will be encrypted using the same key, unless otherwise specified in the EC2 Launch Template.

If you have specific compliance requirements, you can also use CMKs instead of the default keys created by Elastic Disaster Recovery to handle the encryption of the staging volumes, as well as the volumes of drill or recovery instances.

Separate disaster recovery account

The best practice for Elastic Disaster Recovery is to use separate AWS accounts for the Elastic Disaster Recovery staging network (VPC and subnet) and recovery network. Using a separate AWS account specifically for your disaster recovery solution allows for better segmentation and separation of your critical replicated data.

Networking

Network Connectivity

Elastic Disaster Recovery can replicate over public or private connections, including public internet, virtual private network (VPN), AWS Direct Connect, and AWS Transit Gateway, and this replication network path must be determined before installing the Elastic Disaster Recovery agent. While public internet replication is supported, using private connectivity options is recommended for enhanced security and performance.

For cross-AZ replication within a single VPC, the setup is straightforward, and data transfer is always private. In multi-VPC scenarios, using VPC peering or Transit Gateway ensures efficient, secure, and stable data transfer. As the number of source servers grows, private connectivity options like Direct Connect can provide greater stability and performance for high-throughput requirements.

Network bandwidth

AWS Elastic Disaster Recovery will use as much of the network as possible when replicating the data from your source environment. Due to this, you will want to ensure you have enough bandwidth to support your source change rate (ensuring you can maintain continuous data protection), and you will want to monitor your network to ensure there is no congestion being caused by the replication process. If you need to throttle Elastic Disaster Recovery, you can do so at the service or machine level. In order to calculate the bandwidth required for your particular workloads, please see the Elastic Disaster Recovery documentation here.

Installation

The recommended guidance for installing Elastic Disaster Recovery on EC2 instances is to use instance profile roles. This will utilize Amazon Systems Manager to install the agent to all of the requested EC2 instances. If there are instances that are not managed by Systems Manager, you will need to install the Systems Manager agent (SSM agent) on those instances manually and then attach the appropriate instance profile before adding them to Elastic Disaster Recovery.

Ensure Instance Profile Permissions:

Verify that the instances you want to protect have an instance profile with the following policies:
- AmazonSSMManagedInstanceCore
- AWSElasticDisasterRecoveryEC2InstancePolicy
If the instance profile is not present, you can create the default instance profile by following these steps:
- Go to the “Instance profile role installation” section.
- Click the “Install default IAM role” button to create the default instance profile.

Assign Instance Profiles:

In the “Instance profiles” section, and verify that all the instances you want to protect have the required instance profile assigned.
If any instances do not have an instance profile, you can assign the default instance profile by clicking the “Attach profiles to all instances” button.

Set Target Disaster Recovery Region:

In the “Target disaster recovery region” section, select the AWS Region where you want to set up the disaster recovery.
If the selected Region is not initialized for Elastic Disaster Recovery, click the “Initialize and configure Elastic Disaster Recovery” button to set it up.
**NOTE: As this guide is based on a cross-AZ deployment pattern, the recovery Region you select should be the same as the Region your source EC2 instances are deployed. **

Protect instances with Elastic Disaster Recovery:

In the “Add instances” section, click the “Add instances” button.
Elastic Disaster Recovery will list all the instances that are currently managed by Systems Manager and will attempt to install the AWS Replication Agent on them.
Once the AWS Replication Agent is successfully installed, the instances will be added as source servers to Elastic Disaster Recovery.

Monitor the Process:

In the “Add instances result” page, you can view the progress and status of the AWS Replication Agent installation on the instances.
For instances where the installation was successful, you can find a link to the source servers page in the “Details” column.
For instances where the installation failed, you can find a link to the run log on the Systems Manager console.

Verify Instance Management by AWS SSM:

After attaching the instance profile, allow a few minutes for Elastic Disaster Recovery to detect if the instances are managed by Systems Manager.
The marker near the instance ID will change to indicate if the instance is currently managed by Systems Manager.

Remember, if there are instances that are not managed by Systems Manager, you will need to install the SSM agent on those instances and then attach the appropriate instance profile before adding them to Elastic Disaster Recovery. Should you not wish to utilize Systems Manager for this process, please see the following link for instruction on how to manually install the AWS Replication Agent.

Manual Link
If you are using a third-party software deployment process, please consult with the team that manages it to find if it can be used to deploy Elastic Disaster Recovery.
When installing the AWS Replication Agent, you may run into unforeseen installation issues, based on multiple factors. Please see the troubleshooting guide in the Resources section of this guide for guidance on common issues that can be encountered during the installation process. If you are unable to resolve the using the information provided, please create an AWS support ticket and include the following:
- What part of the installation process is failing
- Confirmation that you have followed the troubleshooting guide
- Attach the agent log from that specific server _ The agent log can be located at the following locations _ https://docs.aws.amazon.com/drs/latest/userguide/agent-logs-location.html _ Linux: /var/log/awsdrs-agent/agent.log _ Windows: C:\ProgramData\AWSDRS\Logs\agent.log
- Once the installation has completed, the Elastic Disaster Recovery console will show the following stages:
  - Initiating
    - This shows that the agent has been installed successfully on the source server, and Elastic Disaster Recovery is now moving on to the next steps of configuring replication for that server. To see what step the service is currently on, please select the server name, and check under Data replication status, as shown here
  - Initial sync, {xx% done } time left
    - This is the amount of the known blocks that will be replicated.
      - Please note that you may see the time left for replication fluctuate by large margins. This is due to how reading block storage is accomplished, and we are unable to predict how many future blocks may need to be replicated.
      - You can estimate the amount of time required to complete this step by analyzing the amount of storage that needs to be replicated and the available bandwidth available to transmit this data.
    - During this initial sync process, you may see backlog in the same line.
      - Backlog is the amount of new data that has been written, waiting to be added after initial sync. Once initial sync has completed, you will see the backlog amount start to reduce as the agent replicates those newer blocks.
  - Initial sync 100% done Creating Snapshot
    - All blocks have been replicated from the source machine to the staging area, and we are now creating the baseline EBS snapshot for that volume.
      - Please note, if the service is stuck in the stage for a long time, please confirm the replication server has 443 outbound access to regional EC2 endpoint.
  - Healthy
    - All data has been replicated to the staging area, and the replication server has enough bandwidth to replicate the changes being generated at the source environment.
- Should there be an issue with the replication process after you have completed the initial sync phase, you will see an error in the same location.
- Other states you might see are:
  - Rescan
    - This means that something has interrupted the agent’s ability to validate the block map, usually caused by an unplanned reboot of the source machine (such as a power outage, pulling the plug, or terminating an EC2 instance).
  - Lag
    - Lag is the amount of time since the server was last in continuous data protection (CDP) mode. Lag typically leads to backlog, which is the amount of data that has accumulated and still needs to be replicated. The longer the lag, the larger the backlog that needs to be cleared.
    - This can be caused by many items, and troubleshooting steps can be found here (https://docs.aws.amazon.com/drs/latest/userguide/Other-Troubleshooting-Topics.html#Replication-Lag-Issues).
    - Potential solutions:
    - Make sure that the source server is up and running.
    - Make sure that Elastic Disaster Recovery services are running on the source server.
    - Make sure that TCP Port 1500 is not blocked outbound from the source server to the replication server.
    - If the MAC address of the source had changed, that would require a reinstallation of the AWS Replication Agent.
    - If the source machine had a spike of write operations, the lag will grow until Elastic Disaster Recovery service manages to flush all the written data to the drill or recovery instance replication server.
  - Backlog
    - Backlog is the amount of data that was written to the disk and still needs to be replicated in order to reach CDP mode. backlog can also occur without lag. This can happen due to various reasons, such as:
      - Temporary network interruptions or bandwidth limitations that prevent the data from being replicated in real-time.
      - Spikes in data volume that exceed the processing capacity of the system, leading to a backlog.
      - Scheduled maintenance or other operational activities that temporarily pause the replication process.
    - Even if there is no lag, meaning the server or service is in the desired state, a backlog of data can still build up that needs to be processed. For example, a server generating traffic at a lower rate than the network bandwidth, resulting in no lag, but there could still be a backlog of data that needs to be replicated.
Once the installation process has been completed across all needed servers, you can move on to the next section, where you will configure monitoring and notifications.

Monitoring

Monitoring will play a critical role when defining a disaster recovery strategy. The ability to observe, monitor, and alert on resources and system performance at multiple levels is required to operationalize you plan.

Configure replication monitoring and alerting

Elastic Disaster Recovery can utilize Amazon CloudWatch to assist with monitoring of the disaster recovery solution. CloudWatch is a monitoring service that helps you monitor AWS resources as they are being consumed within your account. When these two services are integrated, you can monitor Elastic Disaster Recovery events with CloudWatch to build a customizable and detailed dashboard for Elastic Disaster Recovery. These services can be extended further with Amazon EventBridge and Amazon Simple Notification Service (Amazon SNS), to get real time alerts and automate responses.

Creating CloudWatch Dashboards to Monitor Elastic Disaster Recovery

You can visualize and share your metrics using CloudWatch dashboards. There are many metrics available within CloudWatch to help you monitor and manage the state of your disaster recovery operations. With CloudWatch, you can include metrics to monitor your source server count, time since last successful test, and lag of source servers (when the Elastic Disaster Recovery service is no longer in continuous data protection mode and should be investigated for root cause). We recommend using CloudWatch to setup dashboards and notifications to alert you on any possible replication issues. Please follow the steps below:

Navigate to CloudWatch dashboard.
Under Dashboards, select Automatic dashboards.
Filter for and select Elastic Recovery Service.
1. You will be taken to a default dashboard that monitors several aspects of Elastic Disaster Recovery. These metrics are based on the replication instances you have running in the AWS Region you currently have selected.
  1. LagDuration: Average
    1. This is the average time of “Lag” on your replication severs. Anything higher than 0 should be investigated for possible issues, but we recommend monitoring for lags larger than an hour (or your RPO if close to an hour).
  2. Backlog: Average
    1. This is the average amount of “backlog”. Backlog is generated when the service is unhealthy, but is still seeing data being written to source, that it is unable to replicate
  3. DurationSinceLastSuccessfulRecoveryLaunch: Maximum
    1. This is the maximum amount of time since the last successful launch of Elastic Disaster Recovery machines.
  4. ElapsedReplicationDuration: Maximum
    1. This is the amount of time Elastic Disaster Recovery has been replicating data.
  5. ActiveSourceServerCount: Average
    1. This is how many source servers have had Elastic Disaster Recovery installed on them and are currently replicating data.
  6. TotalSourceServerCount: Average
    1. This is how many source servers have had Elastic Disaster Recovery installed on them.
Choose Add to dashboard.
1. You can either select an existing dashboard, or choose Create new.
  1. If you decide to create a new one, you will be taken to the next screen to enter a name, and select Create.
2. Select Add to dashboard.
You will now have a dashboard monitoring Elastic Disaster Recovery under your Custom dashboards section in CloudWatch.

Configuring your Amazon SNS topic

Amazon SNS will be used to alert a specific inbox or distribution list when any Elastic Disaster Recovery source machines are experiencing a stalled replication that must be addressed. Doing so will help to identify and remediate issues quicker, so that your RPO goals can be maintained. Stalled replication is the main indicator of replication issues and can indicate multiple issues.

Navigate to Amazon SNS.
Choose Create Topic.
Under Details and Type, choose Standard.
Under Name, enter a name for this topic (for example, “drs-replication-monitoring.”)
(Optional) – Enter a display name for SMS messages to mobile devices.
- Note: As of June 1, 2021, US telecom providers no longer support person-to-person long codes for applications-to-person communications. See the Amazon SNS Developer Guide for more information.
(Optional) – For Tags, enter a key-value pair for easy identification later.
Select Create topic.
Once the topic is created, select drs-replication-monitoring from the list.
- Choose Create subscription.
- Validate that the Topic ARN under Details is the same as drs-in-lag.
- From the Protocol dropdown, choose email.
- Under Endpoint, enter the email or distribution list to receive these alerts.
Choose Create subscription.

Create a rule using the console

The next step is to configure EventBridge to monitor for specific Elastic Disaster Recovery events related to replication health. Should EventBridge receive an event for unhealthy replication status for Elastic Disaster Recovery, it will notify the Amazon SNS topic. This, in turn, notifies the subscribers of that topic.

Open EventBridge.
Choose Create rule.
- Under Name and description, enter the name for this rule (for example, “drs-replication-monitoring”).
- Under Define pattern, choose Event pattern.
  - Select Pre-defined pattern by service.
    - From the dropdown menu for Service provider, choose AWS.
    - Under the Service name dropdown, choose Elastic Disaster Recovery Service.
    - Under Event type, choose DRS Source Server Data Replication Stalled Change.
  - Under Select targets and Target, choose SNS topic.
    - For Topic, choose the SNS topic created earlier, drs-replication-monitoring.
  - Choose Create.

You have now created a dashboard that will monitor your Elastic Disaster Recovery replication infrastructure, and notify you if there are any stalled replication servers that would cause you to miss your RPO.

Cost Monitoring

Configure cost monitoring

There are several configuration strategies possible with Elastic Disaster Recovery. Understanding what makes up the associated costs of using Elastic Disaster Recovery is an important consideration when deciding how to further optimize the system for performance versus cost, while maintaining your resilience objectives. This may include decisions on retention periods, region selection, network design, and infrastructure configurations.

The following section provides the steps to activate cost allocation tags, creating and saving a custom report, and exporting the report data. This will provide insight into the overall costs of the Amazon EC2, Amazon EBS, and EBS snapshot resources provisioned by Elastic Disaster Recovery.

Activate cost allocation tags

This section walks through the process of enabling user-defined cost allocation tags for Elastic Disaster Recovery. Once enabled, you can use these tags on your cost allocation report to track costs.

Log in to the AWS Management Console, and search for Billing.
Select cost allocation tags on the left.
Under User-defined cost allocation tags, find the AWSElasticDisasterRecoveryManaged tag.
Select the checkbox for this tag, and choose Activate in the top right.
Choose Activate from the pop up. It may take a couple of hours before the tags are available.
Where to optimize:
1. Use Default instance types for replication servers unless source servers are often in lag.
2. Use automated disk type.
3. Lower snapshot retention to minimal needs.
  1. Call out that Elastic Disaster Recovery should not be used for long term retention, and they should use a backup solution for long term storage.

Create cost categories

This section walks through the process of creating cost categories. This allows you to map Elastic Disaster Recovery costs and usage into meaningful categories using a rules-based engine.

Log in to the AWS Management Console, and search for Billing.
Select Cost Categories on the left, and select Create Cost Category.
Provide a Name (for example, DRSCost) to the cost category, and select Next.
Choose Rule type as Inherited value, and Dimension as Cost Allocation Tag, Tag key as AWSElasticDisasterRecoveryManaged, and select Add rule.
Choose Rule type as Regular and DRSCost Value as License.
Under the Dimension 1 section, choose Service. For Operator select Is and for Service Code choose AWSElasticDisasterRecovery. Then, select Next.
Select Create cost category. It will take up to 24 hours for the cost category to be available in AWS Cost Explorer.

Create Cost Explorer report

This section walks through the steps to create a customized Cost Explorer report for Elastic Disaster Recovery. It uses the filters and cost categories created in the preceding section.

Log in to the AWS Management Console and search for AWS Cost Explorer. Open the AWS Cost Management dashboard, and select Cost Explorer.
Under FILTERS, select Cost Category.
1. Select the cost category (for example, DRSCost) which was created in the Create Cost Categories section.
2. Select two checkboxes: License and drs.amazonaws.com. Select Apply Filters.
On the top left, select the Group by: Usage Type.
Go to the time ranges and select the time for which you would like to see the data. In the following example, we set it to Last 7 Days with the time granularity as Daily.
Select Bar style type for the chart.
You will see the cost breakdown of the staging area. This includes the cost of replication servers, Amazon EBS volumes, Amazon EBS snapshots, other services, and AWS resources.
Select Save as, which is located near the top left corner and assign the new report a name. For example, Elastic Disaster Recovery Service Costs_<date>. Select Save Report.

View and export saved Cost Explorer report

This section walks through the steps to view the Cost Explorer report and export it to a CSV file that can be shared with stakeholders.

Log in to the AWS Management Console.
Search for AWS Cost Explorer, and open the AWS Cost Management dashboard.
Select Reports. 1. Select the report that was previously saved. 2. The total cost of Elastic Disaster Recovery is included in that report. 3. The report can be further customized by using the ‘Group by’ options near the top or any of the other filters available in AWS Cost Explorer.
You export your data for further analysis by choosing the Download CSV button. Download the CSV file to a location on your computer.

How can you optimize costs?

Utilize the default replication server instance types at first. Allow Elastic Disaster Recovery to replicate your initial dataset, then ensure no source servers are stating that they are in “Lag”. If there are any source servers that are in lag, please follow the troubleshooting section at the end of this document. This section may conclude that you need to increase the size and performance of the replication server or provide a dedicated replication server.
Use the “Auto volume type selection” option for your replication servers.
1. When choosing “Auto volume type selection” the service will dynamically switch between performance/cost optimized volume type according to the replicated disk write throughput.
Lower snapshot retention to minimal length requirements. Based on the changed rate of your dataset, this can have a large impact on overall Elastic Disaster Recovery costs.
1. Note that if you have compliance requirements and require your snapshots for long-term retention, you should use a long-term storage and backup solution like AWS Backup. Elastic Disaster Recovery is not intended to act as a backup or archive storage service, and hence is not a suitable solution for long-term retention of your snapshots and other data.

Cost Optimization

There are multiple configuration strategies possible with Elastic Disaster Recovery. Understanding what makes up the associated costs of using Elastic Disaster Recovery is a key step in targeting efforts to reduce cost without sacrificing resilience. This includes things like using the most relevant resilience strategy and retention periods, driving redundancy, selecting the Region, and right-sizing infrastructure.

The method to reduce operational costs when using Elastic Disaster Recovery is to perform a combination of the following:

Evaluate the retention period required for point-in-time snapshots. How far into the past do you need to retain the ability to do a full server restore, as opposed to restoring from a backup? Make sure to consider applicable compliance and regulatory requirements.
For those servers being covered by Elastic Disaster Recovery, consider whether there are redundant drives mounted that are no longer in use and do not need to be replicated. These can either be unmounted or excluded when installing the replication agent.
Right-size the target failover infrastructure by selecting the appropriate EC2 instance type in the EC2 launch template. You can use the instance right-sizing feature to map to an instance type that closely follows the source infrastructure, however you should use operational data in the source environment to right size these resources.

The size of the underlying disks (for example, the entire disk and not just partitions) directly dictates the amount of data that is replicated over into AWS during the initial sync process. As a result, right sizing and being selective of workloads, as per RPO/RTO objectives, gives the benefit of both monetary and saving time.

Resources:

Drill Planning

Drills versus Planned Disaster Recovery Events: In some situations, the disaster recovery test will be a failover of the production environment in a planned event, with apps running in production in the recovery Region. It is advised to do a full production test on an annual basis, to capture any blockers, and to be familiar with the process, in the event of an actual disaster.

Testing your disaster recovery implementation is the only way to validate that your RPO and RTO objectives can be met when a real disaster occurs. Elastic Disaster Recovery natively supports the ability to launch drills without affecting your production environment. However, conducting a drill and launching a server as an EC2 instance is not adequate to declare success. It’s important to test at an application or business process level, to ensure that the end to end service can be delivered when the disaster recovery plan is activated. It is a best practice to perform drills regularly. There are a few things to note before launching an Elastic Disaster Recovery drill:

When launching a drill or recovery, you can launch up to 500 source servers in a single operation. Additional source servers can be launched in subsequent operations.
It is a best practice to perform drills regularly. After launching drill instances, use either SSH (Linux) or RDP (Windows) to connect to your instance and ensure that everything is working correctly.
Take into consideration that once a drill instance is launched, actual resources will be created in your AWS account and you will be billed for these resources. You can terminate the operation of launched recovery instances once you verify that they are working properly without impact in order to data replication.
We recommend that customers test as often as possible. Customers should test once a year at a minimum, even if it means reducing scope and testing a portion of the application or business function portfolio. This ensures the team is comfortable with the disaster recovery plan, while also allowing them to identify any issues or required changes.

When preparing for disaster recovery test, it is critical to ensure that your drill environment is configured properly. A drill will be conducted while the production environment remains intact. In order to minimize impact to the production environment, we recommend the following:

Network Considerations
- Subnet configuration
  - CIDR range
    - You will want to ensure that your drill subnet is configured with the same CIDR range size as your failover subnet. This will ensure that the subnets are sized properly, and any IP adjustments to the drill or failover machines remains the same.
    - When conducting a drill, we recommend that you configure some mechanism to provide network isolation to ensure your drill subnet is isolated from your production environment. This will ensure there are no IP address conflicts or routing collisions during testing. This can be accomplished by utilizing a separate VPC or configuring security groups and access control lists.
- Routing
  - If your drill requires access to services or dependencies outside the drill subnet, you should ensure the appropriate routing policies and rules are configured in the drill subnet to support this connectivity.
  - Updating Launch Template to the drill subnet
    - By default, you will want to have the Launch Templates configured for your failover subnet. During a drill, you will need to change that section of the Launch Template, to the drill subnet. Please see this link for steps to complete this. Additionally, Launch settings can be changed for a single server or for multiple servers through the Elastic Disaster Recovery console. This option allows you to quickly make changes to multiple servers at once. Please refer to this link for more details on making bulk changes to your Launch Templates.
Infrastructure Services (Active Directory, DNS, and more)
- Depending on the criteria for a successful drill, you may need your drill servers to connect to services such as Active Directory (AD) or other infrastructure services in order to complete a drill. This might require additional scripting (or usage of appropriate Systems Manager documents to automate the usage of AD after launch).
  - With Elastic Disaster Recovery, you can replicate all applications and services, including AD. With this approach, it is recommended to launch the drill version of AD first, and wait until the service is up and running. Once the service is up, you can start to launch the other applications or servers. This will ensure that the AD servers are ready to provide critical functions and services, like authentication and authorization.
  - An alternative approach is to extend AD to the drill subnet. It is advised to work with your system administrators to define the best method for your use case.

Prior to launching a drill instance, ensure that your source servers are ready for testing by looking for the following indicators on the Source servers page:

Under the Ready for Recovery column, the server should show Ready. This means that initial sync has been completed and all data from the source server has been replicated to AWS.
Under the Data Replication Status column, the server should show the Healthy status, but you can also launch the source server if the system is undergoing Lag or even Stall, but in that case, the data may not be up to date. You can still launch a drill instance from a previous point in time.
Under the Pending Actions column, the server should show Initiative Recovery Drill if no drill instances have ever been launched for the server. Otherwise, the column will be blank. This helps you identify whether the server has had a recent drill launch.

Launching drill instances

To launch a drill instance for a single source server or multiple source servers, go to the Source servers page and check the box to the left of each server for which you want to launch a drill instance.

Open the Initiate recovery job menu, and select Initiate drill.

Select the PiT snapshot from which to launch the drill instance for the selected source server. You can either select the Use most recent data option to use the latest snapshot available or select an earlier specific PiT snapshot. You may opt to select an earlier snapshot in case you wish to return to a specific server configuration before a disaster occurred. After you have selected the PiT snapshot, select Initiate drill.

The Elastic Disaster Recovery Console will indicate Recovery job is creating drill instance for X source servers when the drill has started.

Choose View job details on the dialog to view the specific Job for the test launch in the Recovery job history tab.

Successful drill instance launch indicators

You can tell that the drill instance launch started successfully through several indicators on the Source servers page.

The Last recovery result column will show the status of the recovery launch and the time of the launch. A successful drill instance launch will show the Successful status. A launch that is still in progress will show the Pending status.
The launched drill instance will also appear on the Recovery instances page.

Recovery planning

In order to be able to launch your recovery instances quickly, you should preconfigure how those instances are to be launched and perform drills in order to make sure that all of your network and application settings are properly configured. You can configure how your instances will be launched by editing the Launch settings for each source server. Launch settings can be configured immediately when a source server has been added to Elastic Disaster Recovery; there is no need to wait for the initial sync process to finalize. Performing frequent drills is key for failover preparedness. Elastic Disaster Recovery makes it easy for you to launch drill instances as frequently as you want. Drills are non-disruptive and do not impact the source server or ongoing data replication. If you experience a disaster in the middle of a drill, you can launch a new recovery instance from the source server’s current state or keep the instance you launched during the drill.

Preparing for recovery

Configure your launch templates for each server you want to protect
Under the Ready for recovery column, the server should show Ready. This means that the initial sync has been completed and all data from the source server has been replicated to AWS.
Under the Data replication status column, the server should show the Healthy status, but you can also launch the source server if the system is undergoing Lag or even Stall, but in that case the data may not be up to date. You can still launch a drill instance from a previous point in time.
Under the Pending actions column, the server should show Initiative recovery drill if no drill instances have ever been launched for the server. Otherwise, the column will be blank. This helps you identify whether the server has had a recent drill or recovery launch.

Performing recovery

Prior to launching a recovery instance, ensure that your source servers are ready for a drill or recovery by looking for the following indicators on the Source Servers page:

Under the Ready for recovery column, the server should show Ready.
Under the Data replication status column, the server should show Healthy status.
Under the Last recovery result column, there should be an indication of a successful drill or recovery instance launch sometime in the past. The column should state Successful and show when the last successful launch occurred. This column may be empty if a significant amount of time passed since your last drill instance launch.

To launch a recovery instance for a single source server or multiple source servers, go to the Source servers page, and check the box to the left of each server for which you want to launch a recovery instance.

Open the Initiate recovery job menu, and select Initiate recovery.
Select the PiT snapshot from which to launch the recovery instance for the selected source server. You can either select the Use most recent data option to use the latest snapshot available or select an earlier specific PiT snapshot. You may opt to select an earlier snapshot in case you wish to return to a specific server configuration before a disaster occurred. After you have selected the PiT snapshot, choose Initiate recovery. Learn more about PiT snapshots.
The Elastic Disaster Recovery Console will indicate Recovery job is creating recovery instance for X source servers when the drill has started.

Select View job details on the dialog to view the specific Job for the test launch in the Recovery job history tab.

Please note, Elastic Disaster Recovery is only one part of your disaster recovery plan. There are likely to be many other dependencies and services that will play a role in recovering from a disaster, and this should be factored in when conducting a drill or actual failover.

Group launch templates

Amazon EC2 launch templates control how instances are launched in AWS and each source server has its own launch template. You can edit the launch templates for multiple source servers at once by selecting the relevant servers on the Source servers page, then choosing “Edit EC2 launch template” from the Actions dropdown.

To edit the launch template, automated launch settings, or Instance type right-sizing, inside the Elastic Disaster Recovery launch settings must first be set to Inactive else you will receive an error.

DRS Template Manager is an opensource solution available on GitHub that can automate management of launch templates with the use of a single JSON file as a baseline template. This file can be replicated, edited, and used for each source server tagged with a corresponding key in the Elastic Disaster Recovery console.

Protecting Your Recovered Instance

In cross-AZ scenarios, where the source and recovery Regions are the same, the process of failback is referred to as “Protect recovery instance”. Customers using Elastic Disaster Recovery for cross-AZ disaster recovery can protect their recovered instances by reversing the replication direction back to the original AZ. This ensures continuous protection during a recovery event.

Prerequisites:

EC2 instances that have failed over must resolve through DNS to the regional Elastic Disaster Recovery endpoint. The resolved endpoint must be accessible from the EC2 Instance via TCP 443.
Ensure that AWSElasticDisasterRecoveryRecoveryInstancePolicy is attached to the instance profile role of the failed over EC2 instance.

Protecting your recovered instance also stops the replication of the original EC2 instance.
Starting the replication for a recovered instance only initiates a rescan of the differences between the latest PiT snapshot and the current source server data, instead of a full synchronization. This approach saves time and resources while retaining all PiT snapshots, configurations, and job logs.

Once a recovery instance has been successfully launched inside a recovery availability zone and failed over, this recovery instance should be protected by Elastic Disaster Recovery . Follow these steps to protect the recovered instance:

Replication Settings: Replication settings can remain unchanged, meaning you can continue using the same staging area subnet.
Launch Settings:
1. Create a new version of the Amazon EC2 launch template for the source server to launch a recovery instance in the source AZ subnet, instead of the current recovery AZ subnet.
2. Update the security groups as necessary and save changes.
Protect Your Recovered Instance:
1. Select the source server, and from the Replication drop down, choose Protect recovered instance.
2. This action will change the status of the source server to “re-scanning,” which means it will only replicate the changes (deltas) made since the last snapshot, rather than performing a complete replication.

Conclusion

This guide walks through design patterns for provisioning key Elastic Disaster Recovery resources, such as staging area subnets, recovery subnets, and EBS volumes. It also provides the necessary technical background to understand the core principles and implementation process for the various stages of Elastic Disaster Recovery. This ensures the provisioned resources remain robust, even when facing unforeseen obstacles.

The guide also covers observability best practices, highlighting the importance of leveraging the right tools and services. This includes monitoring replication health through metrics and alerts, as well as optimizing costs by analyzing usage data. These observability practices enable proactive actions based on the insights gained. Furthermore, the guide emphasizes the critical role of testing your disaster recovery strategy through regular drills. It discusses key considerations when performing an actual failover in response to a real disaster, as well as the process of restoring normal operations using the failback functionality.

Overall, this guide aims to serve as a comprehensive foundation for deploying and managing Elastic Disaster Recovery across the entire disaster recovery lifecycle. This empowers businesses to be prepared and resilient against all types of adversity.

Advanced Topics

Recovery plans (Step Functions, Elastic Disaster Recovery API, and Lambda)

When performing a disaster recovery at scale, there are often servers that have dependencies on other servers in the environment. For example, application servers that connect to a database on boot or servers that require authentication and need to connect to a domain controller on boot to start services. With AWS Lambda, AWS Step Functions, and the Elastic Disaster Recovery API, you can sequence your disaster recovery launch.

You can sequence your disaster recovery launch to work based on a single API call to execute the state machine. On this architecture, Lambda functions are used to call on the Elastic Disaster Recovery API and launch the recovery instances. Tagged servers being protected by Elastic Disaster Recovery are used by Step Functions to trigger the launch sequence.

Step Functions is a serverless orchestration service that lets you combine AWS Lambda functions and other AWS services to build business-critical applications. Through Step Functions’ graphical console, you see your application’s workflow as a series of event-driven steps.

Network replication

The network replication feature in Elastic Disaster Recovery automatically tracks and replicates changes to your network configurations, such as security groups, network ACLs, and routing tables, between your source and recovery environments. This helps prevent configuration mismatches during recovery, ensuring your recovery instances are launched with the correct network settings.

For example, if you update a security group to allow additional access, the network replication feature will automatically apply that change to the corresponding security group in your recovery environment. This maintains consistency between your source and recovery environments, enhancing security and reducing the risk of issues during failover.

Beyond security groups, the feature also replicates changes to other network resources, like network ACLs and routing tables. By automating these updates, Elastic Disaster Recovery helps you maintain compliance and avoid the need to manually configure individual launch templates for your recovery instances. Steps to implement the replication of your source network can be found here.

Post Launch Validation Automation

Post-launch Actions in Elastic Disaster Recovery allow you to automate actions after a drill or recovery instance is launched. These settings are based on the Default post-launch actions. Available post-launch actions include:

Process status validation: Ensures critical processes (for example, database and application services) are in a running state after the instance is launched. You can specify a list of processes to verify and how long the system should wait before testing.
EC2 connectivity checks: Conducts network connectivity checks to a predefined list of ports and hosts to ensure the instance can communicate as expected.
Volume integrity validation: Ensures the launched EBS volumes are the same size as the source (rounded up), properly mounted on the EC2 instance, and accessible.

You can also run any available Systems Manager document, including public, custom, or shared documents. To create, edit or delete custom actions, make sure post-launch actions are activated for the source server. Custom actions are automatically added to new source servers.

Maintaining same IP in the cross-AZ use case

It is recommended to use a unique CIDR range when recovering your EC2 instances with Elastic Disaster Recovery. In certain situations, it might be required or preferred to maintain the same IP address on the recovery instance. This capability can be supported with Elastic Disaster Recovery, but requires a specific network configuration. AWS will not allow the same subnet to exist in two different AZ within the same VPC. In order to support this design, we will need to configure a different VPC to host the recovery subnet, which will allow the same CIDR range to exist across different AZs.

Furthermore, you’ll need to plan how to route traffic to the recovered instance, which will be using same IP address as the original source server. This could require updating DNS, configuring network ACLs and security groups, or other network changes to direct traffic properly. Careful IP address planning is crucial to minimize downtime and ensure a successful failover to the disaster recovery environment. The specific steps required will depend on your network architecture, application dependencies, and other factors.

Authors

Daniel Covey
Sravan Rachiraju

Contributors

Stuart Lupton
Priyam Reddy
Dusty Poole
Nick Kniveton
Sonia Mahankali
Jason Perry

Notices

Customers are responsible for making their own independent assessment of the information in this document. This document: (a) is for informational purposes only, (b) represents AWS current product offerings and practices, which are subject to change without notice, and (c) does not create any commitments or assurances from AWS and its affiliates, suppliers or licensors. AWS products or services are provided “as is” without warranties, representations, or conditions of any kind, whether express or implied. AWS responsibilities and liabilities to its customers are controlled by AWS agreements, and this document is not part of, nor does it modify, any agreement between AWS and its customers.