Amazon ElastiCache adds durability for Valkey

Amazon ElastiCache adds durability for Valkey

AWS adds durability to ElastiCache for Valkey. The new feature enables the service to be used for workloads that cannot tolerate data loss. Two write modes offer flexibility: zero data loss with synchronous writes or microsecond latency with a limited risk of loss via asynchronous writes.

Amazon ElastiCache serves hundreds of thousands of customers and processes billions of requests per second with microsecond latency. The service supports Valkey, Memcached, and Redis OSS workloads. More and more organizations are using ElastiCache not only as a cache but also as a persistent data store. This makes data loss a serious concern.

Valkey is an open-source fork of Redis, launched after Redis changed its licensing model in 2024. It is under the Linux Foundation and uses a BSD license. AWS is actively involved in the project’s development. ElastiCache for Valkey is now the recommended choice for new AWS deployments.

Synchronous and asynchronous write operations

The new durability feature works via a Multi-AZ transactional log. Data is distributed across at least two Availability Zones to protect against infrastructure failures. AWS offers two modes.

For synchronous write operations, the primary node commits a write only after the data has been recorded in the transactional log across at least 2 AZs. Every confirmed write operation is therefore durable. The write latency is a few milliseconds. Read operations on primary nodes are strongly consistent, even after a failover. This mode is suitable for knowledge bases for RAG apps, AI agent workflow status, and real-time inventory management.

With asynchronous writes, the primary node responds to the client immediately, with microsecond latency. In the background, the write operation is streamed to the transactional log. In the event of a failure, up to ten seconds of committed write operations may be lost. ElastiCache publishes the DurabilityLag metric to Amazon CloudWatch to track this. If the buffer exceeds 10 seconds, the node temporarily stops accepting new write operations until the log catches up. Read operations remain available during that period. Asynchronous write operations are suitable for session stores, gaming leaderboards, and real-time analytics.

Choice by Scenario

ElastiCache without durability remains the right choice when data is easy to retrieve, such as with read-through caches or rate-limit counters. For asynchronous clusters, AWS recommends configuring automatic retry with exponential backoff via the Valkey GLIDE client, the official open-source client library with built-in support for availability zone-aware routing.

In the event of a node failure, ElastiCache automatically triggers a failover to a replica. The replica fetches the state from the transactional log before accepting write operations. Even in the event of a complete shard failure, all nodes are replaced and synchronized from the log.

Tip: Move to Valkey appears to be a response to the loss of Redis’ open-source license