AWS 专用数据库恢复:业务连续性与灾难恢复策略指南

来源: AWS 数据库

原文

Modern applications often use multiple AWS purpose-built databases to handle different workloads efficiently. When disruptions occur, maintaining data consistency and alignment across these databases becomes complex.

This post addresses recovery challenges in multi-database architectures, focusing on both low-consistency and mission-critical scenarios. We explore practical strategies for implementing resilient recovery mechanisms across Amazon DynamoDB, Amazon Aurora, Amazon Neptune, Amazon OpenSearch Service, and other AWS database services.

The purpose-built database approach

At the core of AWS database offerings lies a comprehensive suite of purpose-built database services, each designed to meet the unique requirements of modern applications. These purpose-built databases provide a high-performance, secure, and reliable foundation to power a wide range of workloads, from generative AI solutions to data-driven applications that drive value for businesses and their customers.

While a one-size-fits-all database approach might suffice for straightforward scenarios, modern microservice architectures benefit significantly from using purpose-built databases. You can use this approach to select the optimal database and data model for each service based on its specific requirements – whether that’s transaction consistency, query patterns, scalability needs, or security constraints. Each database can be independently improved and scaled according to its workload characteristics.

However, this architectural flexibility introduces complexity when disruptions occur. The challenge lies in coordinating recovery across multiple databases while maintaining data consistency and alignment. This is where AWS database services’ built-in high availability and data protection features become critical.

Features like automated backups, point-in-time recovery, cross-Region replication, and automated failover provide the foundation for building resilient multi-database architectures. Understanding how to use and coordinate these capabilities across your purpose-built databases is essential for effective business continuity and disaster recovery planning.

Different levels of requirements

Not all purpose-built database architectures are created equal. The requirements for data consistency and recovery strategies can vary significantly, depending on the nature and criticality of the application. In this post, we explore two distinct scenarios that illustrate the different levels of requirements that organizations might face when aligning their purpose-built databases:

  • Use case I: Applications with no consistency requirements – In this scenario, the applications have no strict requirements in terms of data consistency among the multiple backend databases. These could be low-traffic, non-critical applications where the databases operate in relative isolation, and their synchronization isn’t a significant concern. For such use cases, the focus might be more on maintaining individual database reliability rather than maintaining strict data alignment across the entire ecosystem. The recovery strategies in this case would prioritize simplicity and operational efficiency over complex data reconciliation processes.
  • Use case II: Applications with high consistency requirements – This category includes mission-critical applications that demand strict data consistency and integrity across the purpose-built databases. These could be applications that handle sensitive financial data, healthcare records, or other mission-critical workloads where data alignment is paramount. For such use cases, the recovery strategies must prioritize data consistency and provide robust mechanisms for seamless data reconciliation, even in the face of complex disruptions or failures.

In this post, we provide you with the necessary insights and best practices to recover and align your purpose-built databases and navigate through potential recovery scenarios, regardless of your specific application requirements.

Use case 1: Applications with no requirements in terms of data consistency among the multiple backends

In this scenario, we consider applications where strict data consistency across multiple purpose-built databases isn’t a critical requirement. These are typically low-traffic, non-mission-critical applications where the databases operate relatively independently, and their synchronization isn’t a significant concern. For such use cases, the focus is more on maintaining individual database reliability and operational efficiency rather than maintaining strict data alignment across the entire ecosystem.

Let’s explore a use case that illustrates this scenario: a multi-channel customer feedback system. It incorporates different database services such as Amazon Aurora, Amazon DynamoDB, Amazon Redshift, and Amazon Simple Storage Service (Amazon S3) as object storage. Imagine a large retail company that wants to capture and analyze customer feedback from various channels: in-store surveys, online reviews, social media comments, and customer service interactions. The company aims to gain insights into customer satisfaction and product performance without requiring real-time consistency across all data sources.

Architecture overview

The following diagram shows the architecture of the first use case.

In this architecture, each database serves a specific purpose, designed for its particular workload. The lack of strict consistency requirements allows for a more flexible and scalable system.

The solution consists of the following key components and relationships:

  • Web application – Uses AWS Elastic Beanstalk and serves as the frontend for collecting customer feedback from various channels. It manages the flow of data to appropriate storage services.
  • Aurora – Stores structured data from in-store surveys and online reviews. This relational database is recommended for handling complex queries and transactions related to customer feedback.
  • DynamoDB – Captures and stores real-time social media comments and sentiment analysis results. Its low-latency and high-throughput capabilities make it a good fit for handling the constant stream of social media data.
  • Amazon S3 – Acts as a data lake, storing raw data from all sources, including unstructured data like customer service call recordings and chat logs.
  • Amazon Redshift – Serves as the data warehouse for analytical queries and reporting, combining data from all sources for long-term trend analysis and business intelligence (BI).
  • Amazon Quick SightConnects to Amazon Redshift for creating dashboards, reports, and performing one-time analyses.

Data flow

The data flow consists of the following steps:

  1. Customer feedback is collected through various channels (in-store, online, social media, customer service).
  2. The web application routes the data to appropriate storage services:
    1. Structured survey data goes to Amazon Aurora.
    2. Real-time social media data goes to DynamoDB.
    3. Raw and unstructured data is stored in Amazon S3.
  3. Data from Aurora and DynamoDB is periodically extracted and loaded into Amazon Redshift, which can be done using zero-ETL integration.
  4. Amazon S3 data is directly queried by Amazon Redshift using Amazon Redshift Spectrum when needed.
  5. BI tools such as Amazon Quick Sight connect to Amazon Redshift for analysis and reporting.

Key terms and metrics

Before going more into the details of this post, it’s important to provide the definitions of two key aspects when it comes to restore and recovery scenarios:

  • Business continuity – This refers to the ability of an organization to maintain essential functions during and after a disruption. It encompasses planning and preparation to make sure that an organization can operate its critical business functions during emergency events.
  • Disaster recovery – This can be considered as a subset of business continuity. It focuses on the IT systems that support business functions, specifically the policies, tools, and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a disaster.

Two key metrics in business continuity and disaster recovery planning are:

  • Recovery Time Objective (RTO) – The maximum acceptable length of time that an application can be down after a failure or disaster occurs.
  • Recovery Point Objective (RPO) – The maximum acceptable amount of data loss measured in time. It’s the age of the files or data in backup storage that must be restored to resume normal operations.

The following sections describe how these concepts apply to different database services and the options available to meet your business continuity and disaster recovery requirements.

Recovery strategies

Because the multiple backends have no strict consistency requirements, you can tailor recovery strategies using each database services’ high availability (HA), business continuity, and disaster recovery features. The following table summarizes these options.

Architecture Component Feature RTO RPO
Amazon Aurora Built-in high availability and failover Can be as low as a few seconds using Aurora’s fast crash recovery and Amazon RDS Proxy. Equal to 0, due to the Aurora’s enhanced durability and storage auto-repair capabilities of Aurora.
Restoring from most recent snapshot Recovery time ranges from minutes to hours, depending on database size and configuration. Regular restore testing in production-like environments with representative datasets is recommended to establish accurate recovery time baselines. Can be as low as a few minutes, equal to the age of the most recent snapshot.
PITR using automated backups Minutes to hours, restore time is linear to the database size and the number of incremental changes since the last snapshot. Within LatestRestorableTime and EarliestRestorableTime, depending on the configured backup retention windows. Normally, the LatestRestorableTime is within the last 5 minutes of the current time.
PITR using automated backups and AWS Backup Minutes to hours, restore time is linear to the database size. Within LatestRestorableTime and EarliestRestorableTime, depending on the configured backup retention windows and backup frequency. Normally the LatestRestorableTime is within the last 5 minutes of the current time.
Using AWS Backup in a cross-region disaster recovery scenario Minutes to hours, restore time is linear to the database size. Minutes, depending on the database size and network throughput (the primary and secondary AWS Regions distance does matter).
Amazon Aurora Global Database to improve multi-Region resilience Minutes Seconds; can be managed
Amazon DynamoDB Enable PITR to protect against accidental writes or deletes and effectively enabling nearly continuous backups of your table data along with enabling deletion protection on the table Service metrics show that 95 percent of customers’ table restores complete in less than one hour. However, restore times are directly related to the configuration of your tables and other related variables. A best practice is to regularly document average restore completion times and establish how these times affect your overall Recovery Time Objective. Up to 5 minutes. PITR continuously backs up your data, and you can restore it to any point in time within the last 35 days, with a potential data loss of up to 5 minutes.
Use on-demand backups for long-term retention and compliance purposes Service metrics show that 95 percent of customers’ table restores complete in less than one hour. However, restore times are directly related to the configuration of your tables and other related variables. A best practice is to regularly document average restore completion times and establish how these times affect your overall Recovery Time Objective. Determined by backup frequency.
Implement global tables for multi-Region resilience Zero with Multi-Region strong consistency (MRSC).
There’s no need for a restore process because the data is already replicated and available in other replica Regions.
Multi-Region Strong Consistency global tables support a Recovery Point Objective (RPO) of zero.
Amazon S3 Enable S3 Versioning to protect against accidental deletions or overwrites Near 0, with immediate access to previous object versions. 0 because all objects’ versions are retained.
Implement S3 cross-Region replication to provide geographical redundancy for disaster recovery Near 0; because the data is already available in the destination Region. Typically, less than 15 minutes, but can vary based on object size and network conditions.
Use S3 Intelligent-Tiering for cost-effective long-term storage of infrequently accessed data. Although it’s primarily a cost improvement feature, it can play a role in recovery strategies by making sure that data is available and stored in the most appropriate tier for quick access when needed Milliseconds for data in Frequent Access and Infrequent Access tiers, minutes to hours for data in the Archive Access tier, and within 12 hours for data in the Deep Archive tier. S3 Intelligent-Tiering doesn’t directly affect RPO, but it can be used in conjunction with features like S3 Versioning and S3 cross-Region replication.
Amazon Redshift Use automated snapshots (enabled by default to take snapshots of your cluster every 8 hours or 5 GB of data changes) for PITR Typically, 10–45 minutes, depending on the cluster size and data volume. Up to 8 hours or less if 5 GB of data changes occur sooner.
Implement automated cross-Region snapshot copy for disaster recovery Depends on cluster size, but typically within 1-2 hours, including snapshot restore time. Depends on the frequency of snapshot creation and copy process.
Redshift Spectrum can be used as part of a recovery strategy to query historical data directly from Amazon S3, reducing the need for frequent data loads Near 0 for querying data, because it doesn’t require data loading. Depends on how frequently data is updated in Amazon S3.

In this use case, each database can be recovered independently without concerns about strict synchronization. The focus is on making sure that each service maintains its own data integrity and availability. Periodic data reconciliation processes can be implemented to address any potential discrepancies, but they are not critical for the system’s day-to-day operations.

This approach enables a more flexible, scalable architecture, where each purpose-built database can be improved for its specific workload without maintaining strict consistency across the entire system. It’s particularly suitable for applications where near real-time data alignment is not a requirement, and where the benefits of using specialized databases outweigh the need for immediate cross-database consistency.

Use case 2: Applications with high requirements in terms of data consistency among the multiple backends

In this scenario, we consider applications that demand strict data consistency and integrity across multiple purpose-built databases. These are typically mission-critical applications that handle sensitive data where maintaining data alignment is paramount. For such use cases, recovery strategies must prioritize data consistency and provide robust mechanisms for seamless data reconciliation, even in the face of complex disruptions or failures.

An example of a use case that illustrates this scenario is described in Build a Modern Application with Purpose-Built AWS Databases, which describes an online bookstore demo application. This is a modern full-stack web application that serves as an online bookstore with some core features like product catalog management, search functionality, bestsellers tracking, and social recommendations.

Architecture overview

The following diagram shows the architecture of the second use case.

We focus on the services that store data, and we describe the techniques that you can use to make this architecture robust and resilient, implementing effective strategies to maintain the consistency of the data.

The solution consists of the following key components and relationships:

  • Product catalog implemented with Amazon DynamoDB – Store score product information such as IDs, descriptions and prices. Provides fast key-value lookups and scales from hundreds to billions of products.
  • Enterprise Search system implemented with Amazon OpenSearch Service – Handles full-text product searches through the main search bar, so users can find books naturally with additional filtering capabilities. Updates flows in real-time through Amazon DynamoDB Streams and AWS Lambda. A way to improve this pattern is by implementing the DynamoDB zero-ETL integration with Amazon OpenSearch Service.
  • Best sellers feature – Tracks the top 20 purchased books using in-memory sorted sets for microsecond-latency reads. Updates flow in real-time through DynamoDB Streams and Lambda functions.
  • Social Media recommendations implemented with Amazon Neptune – The graph database is used to store the social connections and track purchase patterns; the goal is to provide traversal of relationships for book recommendations in the online bookstore.

The following are additional considerations not shown in the diagram:

  • Backup and recovery – You can use AWS Backup for consistent, coordinated backups across services (for the complete list of supported services, see the AWS Backup documentation)
  • Monitoring and alerting – You can use Amazon CloudWatch for comprehensive monitoring and alerting.
  • Multi-Region deployment – You can replicate the entire architecture in a secondary Region for disaster recovery. For detailed guidance on multi-Region architectures, refer to AWS multi-Region fundamentals. This is a 300-level prescriptive guidance designed for cloud architects and senior leaders who build workloads on AWS. This comprehensive guide helps technical decision-makers understand and implement multi-Region architectures to improve workload resilience and high availability.

Data flow

The data flow consists of the following steps:

  1. The web frontend interfaces with the databases systems through API calls and Lambda functions.
  2. Changes in the product catalog automatically sync to the search index through DynamoDB Streams and Lambda.
  3. Purchase events update both the best sellers list and social recommendations in Amazon ElastiCache for Valkey and Neptune.
  4. All systems work together to provide a seamless user experience.

Recovery strategies

To provide seamless recovery for data alignment across the different purpose-built databases used in this use case, you can implement the following strategies, organized by service:

  • Amazon DynamoDB – The DynamoDB availability SLA is 99.99% with standard tables and 99.999% with global tables. However, we still recommend enabling point-in-time recovery (PITR), which provides an RPO of zero with multi-Region strong consistency (minutes for single tables), and RTO of minutes or hours, and backup retention of up to 35 days. Optionally, as a secondary protection, DynamoDB global tables enable efficient cross-Region disaster recovery with strong data consistency. With these approaches, you can set up your DynamoDB tables to be highly resilient in case of any issues, and you can implement Lambda functions error handling with a dead-letter queue (DLQ) mechanism. The recommendation in case of long DynamoDB unavailability is to prevent the access to the frontend, instead of managing a huge number of DLQ entries.
  • Amazon DynamoDB Streams – A DynamoDB stream is an ordered flow of information about changes to items in a DynamoDB table. When you enable a stream on a table, DynamoDB captures information about every modification to data items in the table. DynamoDB Streams by itself guarantees exactly-once delivery and in-order delivery of item level modifications and stores this information in a log for up to 24 hours. Combining DynamoDB Time to Live (TTL), DynamoDB Streams, and Lambda can help simplify archiving data, reduce DynamoDB storage costs, and reduce code complexity. If the Lambda function returns an error, Lambda retries the batch until it processes successfully or the data expires. You can also configure Lambda to retry with a smaller batch, limit the number of retries, discard records after they become too old, and other options. Optionally, you can implement a DLQ mechanism, as described before. Finally, both OpenSearch Service and ElastiCache offer native backup mechanisms for recovery. For OpenSearch index reconstruction, you can either use these backups, perform a full scan of DynamoDB, or use DynamoDB to export to Amazon S3. ElastiCache caches running Valkey, Redis OSS, or Serverless Memcached can back up their data by creating a snapshot, and you can use the backup to restore a cache or seed data to a new cache. The backup consists of the cache’s metadata, along with all of the data in the cache. All backups are written to Amazon S3. When it comes to rebuilding your DynamoDB-based architecture, the choice depends on factors like data volume, performance requirements, and recovery time objectives. For large DynamoDB tables, using DynamoDB export to S3 is generally more efficient than performing a full table scan. Note that a Lambda consumer for a DynamoDB stream doesn’t guarantee exactly once delivery and might lead to occasional duplicates. Make sure that your Lambda function code is idempotent to prevent unexpected issues from arising because of duplicate processing.
  • Amazon Kinesis Data Streams – To capture changes in DynamoDB, you can use Amazon Kinesis Data Streams, which offers longer data retention (up to 365 days compared to 24 hours of DynamoDB Streams), support for up to 20 simultaneous consumers per shard with enhanced fan-out (compared to 2 with DynamoDB Streams), and the ability to combine DynamoDB changes with other data sources in the same stream for unified processing. However, there are important considerations regarding data consistency: unlike DynamoDB Streams, Kinesis Data Streams might occasionally produce duplicate records, requiring application-level deduplication logic. Additionally, while DynamoDB Streams guarantees in-order delivery of item-level modifications, with Kinesis Data Streams, applications need to implement ordering logic by comparing the ApproximateCreationDateTime timestamp attribute on stream records. For a detailed comparison of these change data capture options, see Choose the right change data capture strategy for your Amazon DynamoDB applications.
  • Amazon Neptune – Neptune is fault tolerant by design and is fed using Lambda functions, so the same error handling strategies can be implemented for Lambda in case of transient and persistent unavailability of the backend system. In terms of availability SLA, Neptune provides 99.9% SLA for Multi-AZ DB instance, Multi-AZ DB cluster, and Multi-AZ graph, whereas a single DB instance and Single-AZ graph provide 99.5% SLA. The Neptune automated backups, with a retention of up to 35 days, can be implemented to get PITR capability, to achieve an RPO of seconds or minutes and an RTO of hours. Optionally, you can use Neptune global databases to build a strong and reliable cross-Region disaster recovery solution, with lower RTO and less data loss (lower RPO) than traditional replication solutions.
  • Amazon OpenSearch Service – OpenSearch Service comes with a 99.99% uptime SLA when Multi-AZ with standby is implemented, and 99.9% when Multi-AZ is implemented without standby. Although you can create snapshots of your OpenSearch indexes, the best approach is implementing an active-active architecture with failover.
  • Amazon ElastiCache for ValkeyElastiCache for Valkey offers an availability SLA of 99.99% when using a Multi-AZ configuration with automatic failover. ElastiCache for Valkey also supports snapshot-based backups for data recovery. You can restore your data by creating a new ElastiCache for Valkey cache directly from a snapshot. When automatic backups are enabled, ElastiCache creates a backup of the cache on a daily basis, and maximum retention is 35 days. As previously described for OpenSearch Service, the recommended approach to maintain consistency is to rely on DynamoDB Streams and Lambda resiliency, including the DLQ mechanism, because the cache can be rebuilt from scratch.

The core of the architecture that we consider in this use case is the product catalog implemented using Amazon DynamoDB, both the Enterprise Search system, and the Best Sellers features. The Best Sellers features are implemented respectively using Amazon OpenSearch Service and Amazon ElastiCache for Valkey, and are feed starting from the product catalog. This happens while the Social Media recommendations system is implemented using Amazon Neptune feed directly from the frontend interfaces through the Amazon API Gateway and AWS Lambda integration. One way to improve the architecture to make it more resilient is to introduce additional streaming/queueing layers like shown in the following diagram:

This way, you can store the incoming application frontend stream in Amazon Kinesis Data Streams, restore the product catalog and the social media recommendation engine with the latest available Point-in-time Restore, and reconcile by rebuilding only the last mile from the streaming mechanism, that acts effectively as a caching system.

This architecture provides patterns for keeping data consistency across multiple purpose-built databases providing robust recovery mechanisms. The system can recover from failures, minimizing potential data loss in this mission-critical application.

The approach is to rely on the specific service reliability and built-in features, such as automated backups. DLQs and streaming mechanisms can help make the entire architecture more robust and resilient, with more straightforward and more automated recovery procedures. In different scenarios, when you must maintain stronger data consistency among different backend systems, a distributed transaction system can be required, where applicable.

Remember to regularly test and refine your recovery processes, simulating various failure scenarios to make sure that your system can maintain data consistency and recover seamlessly in real-world situations.

Conclusion

In this post, we discussed how aligning purpose-built databases require a thoughtful approach based on your specific application requirements. Whether you’re dealing with applications that have minimal consistency requirements or mission-critical systems demanding strict data alignment, AWS provides a comprehensive suite of tools and features to build resilient architectures.

Our exploration reveals that understanding your consistency requirements is crucial before designing your recovery strategy. The foundation of data resilience can be built upon AWS built-in features such as DynamoDB Streams, PITR, and global tables, and implementing proper error handling mechanisms like DLQs helps maintain data consistency during failures. Remember that different scenarios might require different approaches, from straightforward independent recoveries to complex distributed transaction systems.As modern applications continue to evolve, the ability to effectively manage and recover multiple purpose-built databases will become increasingly important. The strategies and best practices outlined in this post provide a framework for addressing these challenges while making sure that your applications remain resilient and reliable.

We’d love to hear what you think! If you have questions or suggestions, leave a comment.


About the authors

author name

Domenico di Salvia

Domenico is a Senior Database Specialist Solutions Architect at AWS. In his role, Domenico works with customers in the EMEA region to provide guidance and technical assistance on database projects, helping them improve the value of their solutions when using or migrating to AWS, designing scalable, secure, performant, sustainable, cost-effective, and robust database architectures in the AWS Cloud.

Marco Tamassia

Marco is a Principal Technical Instructor based in Milan, Italy. He delivers a wide range of technical trainings to AWS customers across EMEA. Marco has a deep background as a database administrator for companies of all sizes, including AWS. This allows him to bring his database knowledge into the classroom, presenting real-world examples to his students.