Announcing durability for Amazon ElastiCache for Valkey

来源: AWS 数据库

#☁️ 基础设施
原文

Amazon ElastiCache serves hundreds of thousands of customers and processes billions of requests per second with microsecond latency across Valkey, Memcached, and Redis OSS workloads. Many organizations find that Multi-AZ replication and automatic failover in ElastiCache meet their resilience requirements, but as customers increasingly adopt ElastiCache as a persistent data store, as well as a cache, data loss becomes a primary concern.

Today, we are announcing durability for Amazon ElastiCache for Valkey, so that you can use ElastiCache for workloads that cannot tolerate data loss.

In this post, we explain how durability works, walk through the architecture, and share performance results to show that durability doesn’t compromise the microsecond latency customers expect from ElastiCache.

How durability works

Durability in ElastiCache uses a Multi-AZ transactional log to provide data protection with a fast recovery and restart during infrastructure failures. ElastiCache offers two durability options: synchronous writes designed for zero data loss and asynchronous writes for microsecond write latency.

Synchronous writes is the right choice when data loss is unacceptable. ElastiCache persists data across at least two Availability Zones (AZs) in the Multi-AZ transactional log before responding to the client. Every acknowledged write is durable, and write latency is single-digit milliseconds. Primary nodes are strongly consistent, meaning read operations on primaries always return the most up-to-date data. This consistency is preserved across failovers. Synchronous writes is ideal for knowledge bases for RAG applications, AI agent long-term memory, AI agent workflow state, payment tokenization, streaming metadata, gaming player state, and real-time inventory management where a lost write produces incorrect application behavior.

Asynchronous writes is the right choice when data is recoverable but rebuilding from source is slow or operationally expensive. With asynchronous writes, data is persisted in the Multi-AZ transactional log after responding to the client, maintaining microsecond write latency at no additional cost. Up to 10 seconds of uncommitted data can be lost in the unlikely event of a failure. To bound potential data loss, ElastiCache monitors the durability lag, which is the age of the oldest write not yet persisted to the log. If this lag reaches 10 seconds, the primary node stops accepting writes until the log catches up. Asynchronous writes is ideal for session stores, gaming leaderboards, real-time analytics, and pre-loaded datasets where losing a few seconds of recent writes is acceptable but a larger gap would be costly to reconcile.

ElastiCache without durability enabled is the right choice when your data is trivially reconstructable on demand. Use it for read-through caches backed by an origin database, rate-limit counters, or any workload where a missing entry can be fetched or recomputed live.

Both synchronous and asynchronous writes maintain microsecond read latencies. Under both options, replica nodes are eventually consistent and read operations from replicas may not always reflect the most recent writes. The following table summarizes the two durability options:

Synchronous writes Asynchronous writes
Typical read latency Microseconds Microseconds
Typical write latency Single-digit milliseconds Microseconds
Data loss guarantees Zero data loss; every acknowledged write is durable across at least two Availability Zones Up to 10 seconds of acknowledged writes may be lost in the unlikely event of a failure.
Typical use cases Knowledge bases for RAG applications, AI agent long-term memory and workflow state, payment tokenization, and real-time inventory management Session stores, gaming leaderboards, real-time analytics, pre-loaded datasets

Architecture

The following diagram illustrates how durability for ElastiCache works with the Multi-AZ transactional log.

Synchronous writes

When a client sends a write command to a cluster configured with synchronous writes:

  1. The primary node receives and executes the write command in memory.
  2. The write is persisted to the Multi-AZ transactional log across at least two Availability Zones.
  3. Once persistence is confirmed, the primary returns a success response to the client.

What this means is that after your client receives a successful response, that write is durable. It will not be lost even if the primary node fails immediately afterward, and any future read from the primary, including after a failover to a new primary, reflects that write. The tradeoff is write latency: each write incurs a cross-AZ network round-trip to the transactional log, resulting in single-digit millisecond write latency.

Asynchronous writes

When a client sends a write command to a cluster configured with asynchronous writes:

  1. The primary node receives and executes the write command in memory.
  2. The primary returns a success response to the client immediately at microsecond latency.
  3. In the background, the write is streamed to the Multi-AZ transactional log.

At the moment the client receives the success response, the write exists only in the primary node’s memory. It has not yet been written to the transactional log. If the primary node fails before the write is persisted, that write is lost. This is the fundamental tradeoff of asynchronous writes: microsecond write latency in exchange for a bounded window of potential data loss.

Durability buffer for asynchronous writes

To bound potential data loss with asynchronous writes, ElastiCache enforces a durability buffer of up to 10 seconds. The primary node continuously tracks the age of the oldest write that has been accepted but not yet persisted to the Multi-AZ transactional log, and publishes this value to Amazon CloudWatch as the DurabilityLag metric.

image

As long as this age remains under 10 seconds, the node continues accepting new writes normally. If the buffer grows beyond 10 seconds, for example because of transient network congestion to the transactional log, the primary temporarily rejects incoming write commands until it catches up. Read operations continue to be served at microsecond latency throughout this period. After the transactional log catches up and the durability lag drops below the threshold, writes resume automatically with no manual intervention required. In practice, most writes are persisted well within the 10-second threshold, and most clusters will never enter the rejection state under normal operating conditions. When configuring your client to send traffic to an asynchronous durable cluster, we recommend enabling automatic retry with exponential backoff for any write commands that are temporarily rejected. We recommend Valkey GLIDE, one of the official open source client libraries for Valkey, designed for reliability and high availability. GLIDE supports automatic retry with exponential backoff and availability-zone-aware routing. For client configuration best practices, see Best practices: Valkey/Redis OSS clients and Amazon ElastiCache.

Failure scenarios

Durability for ElastiCache protects against the following failure types:

  • Primary node failure. In the event of a primary node failure, ElastiCache automatically triggers a failover to a replica. The replica catches up from the transactional log and only then begins accepting writes as the new primary. The failed node is replaced and syncs from the log. With synchronous writes, no data is lost. With asynchronous writes, up to 10 seconds of acknowledged writes can be lost because the transactional log may not have all writes acknowledged before the primary failed.
  • Read replica failure. In the event of a read replica failure, the failed node is replaced and syncs from the Multi-AZ transactional log regardless of the durability option selected. No data loss occurs.
  • Full shard failure (all nodes in a shard). In the event of a full shard failure, all nodes are replaced and sync from the Multi-AZ transactional log. With synchronous writes, no data is lost. With asynchronous writes, up to 10 seconds of acknowledged writes can be lost. After committed data is restored, one of the replaced nodes is automatically elected as the new primary.

Performance analysis

We measured the throughput and read and write latencies of ElastiCache with and without durability to see how they compare and demonstrate that enabling durability in ElastiCache does not compromise the microsecond latency customers expect from ElastiCache.

Test methodology

We launched ElastiCache version 9.0 for Valkey clusters with no durability, synchronous writes, and asynchronous writes using r7g.4xlarge nodes. Each cluster consisted of one primary node with one read replica and was pre-populated with sample data prior to the test runs. We ran the Valkey default performance measurement tool (valkey-benchmark) with three million keys, without any command pipelining, and used ten Amazon Elastic Compute Cloud (Amazon EC2) instances in the same AZ as the primary node to direct traffic to the clusters. We tested at two throughput levels — 50K and 100K TPS — using a mixed workload (80% reads, 20% writes) representative of common customer workload patterns. Because ElastiCache clusters are Multi-AZ distributed systems, you may observe some level of variance from the numbers in the table below in identical setups.

The following table compares read and write latencies across all ElastiCache options on r7g.4xlarge nodes.

Workload (80% read, 20% write) ElastiCache option Node type Read P50 Read P90 Write P50 Write P90
50K TPS ElastiCache without durability r7g.4xlarge 260 µs 301 µs 147 µs 185 µs
50K TPS Asynchronous writes r7g.4xlarge 245 µs 289 µs 112 µs 152 µs
50K TPS Synchronous writes r7g.4xlarge 245 µs 288 µs 2.15 ms 2.36 ms
100K TPS ElastiCache without durability r7g.4xlarge 263 µs 301 µs 160 µs 196 µs
100K TPS Asynchronous writes r7g.4xlarge 245 µs 286 µs 128 µs 158 µs
100K TPS Synchronous writes r7g.4xlarge 879 µs 992 µs 2.72 ms 3.12 ms

Key takeaways:

  • All options maintain microsecond read latency. Whether synchronous or asynchronous, durability preserves microsecond read performance across both throughput levels, making it suitable for read-heavy workloads which represent the majority of real-world use cases.
  • Asynchronous writes deliver comparable latency to ElastiCache without durability. At both 50K and 100K TPS, read and write latencies are within microseconds of each other. You can add durability with no additional charge and negligible impact on throughput at typical workload levels. We recommend asynchronous writes as the default for all workloads that do not require zero data loss, as this option provides durability with no latency penalty.
  • Synchronous writes preserve microsecond read latency at moderate throughput. At 50K TPS, read latency remains sub-300 µs. At 100K TPS, read latency increases to sub-millisecond (879 µs) as the system handles higher concurrency to the transactional log. Write latency remains in the single-digit millisecond range at both throughput levels, the expected tradeoff for persisting data across two Availability Zones before acknowledging the write. You should use synchronous writes when your application cannot tolerate any data loss.

Getting started with durability for ElastiCache

Prerequisites

Before you begin, verify you have:

  • An active AWS account
  • AWS CLI version 2.x or later installed and configured
  • IAM permissions for elasticache:CreateReplicationGroup and elasticache:ModifyReplicationGroup

Creating a durable cluster

To get started with durability, you need to create a new ElastiCache cluster and select your preferred durability option using the AWS Management Console, AWS Software Development Kit (SDK), or AWS Command Line Interface (CLI).

Using the AWS Management Console

When creating a new cluster, select Valkey 9.0 or later. Choose your preferred durability option in the cluster settings:

Using the AWS CLI

To create a new durable cluster with synchronous writes:

aws elasticache create-replication-group \
--replication-group-id my-durable-cluster \
--replication-group-description "ElastiCache durable cluster" \
--engine valkey --engine-version 9.0 \
--num-node-groups 2 --replicas-per-node-group 1 \
--cache-node-type cache.r7g.large \
--multi-az-enabled \
--transit-encryption-enabled \
--durability sync \
--region us-east-1

To create a cluster with asynchronous writes, set --durability async:

aws elasticache create-replication-group \
--replication-group-id my-durable-cluster \
--replication-group-description "ElastiCache durable cluster" \
--engine valkey --engine-version 9.0 \
--num-node-groups 2 --replicas-per-node-group 1 \
--cache-node-type cache.r7g.large \
--multi-az-enabled \
--transit-encryption-enabled \
--durability async \
--region us-east-1

Verify your cluster

After creating the cluster, you can verify it is running with durability enabled:

aws elasticache describe-replication-groups \
--replication-group-id my-durable-cluster \
--query 'ReplicationGroups[0].[Status,Durability]' --region us-east-1

The output should show status as available and your selected durability option.

Switching durability options

To switch an existing cluster between synchronous and asynchronous writes, use modify-replication-group:

aws elasticache modify-replication-group \
--replication-group-id my-durable-cluster \
--durability async

Cleaning up

To avoid ongoing charges, delete the ElastiCache cluster you created:

aws elasticache delete-replication-group \ 
--replication-group-id my-durable-cluster \
--region us-east-1

Note: This operation permanently deletes the cluster and all data. Verify you have backed up any data you need before proceeding.

Conclusion

With Durability in ElastiCache you can use ElastiCache for both caching and persistent data store use cases. Synchronous writes are designed for zero data loss with microsecond read latency and single-digit millisecond write latency, suitable for workloads that cannot tolerate data loss. Asynchronous writes deliver comparable performance to ElastiCache without durability at no additional charge, suitable for workloads that can tolerate up to 10 seconds of potential data loss in the unlikely event of a failure. ElastiCache without durability remains the right choice for classic caching workloads where data can be reconstructed from an origin source and full write availability is paramount.

Durability for ElastiCache is available in all AWS commercial Regions, the AWS China Regions, and the AWS GovCloud (US) Regions starting with Valkey 9.0. For pricing details, visit the Amazon ElastiCache pricing page. To learn more, visit the ElastiCache documentation.


About the authors

Jules Lasarte

Jules Lasarte

Jules is a Software Development Engineer in the Amazon In-Memory Databases team. They lead the engineering effort behind ElastiCache durability, focusing on high-performance distributed systems and data protection for in-memory workloads. They’re based in Vancouver, Canada.

Karthik Konaparthi

Karthik Konaparthi

Karthik is a Principal Product Manager in the Amazon In-Memory Databases team and is based in Seattle, WA. He’s passionate about all things data and enjoys turning customer problems into products they love. Outside of work, he enjoys exploring new places with his family and is always looking for the next great restaurant.