Revisit Amazon Web Services re:Invent 2024’s biggest moments and watch keynotes and innovation talks on demand
We frequently upgrade our Amazon ElastiCache fleet, with patches and upgrades being applied to instances seamlessly. However, from time to time we need to relaunch your ElastiCache nodes to apply mandatory OS updates to the underlying host. These replacements are required to apply upgrades that strengthen security, reliability, and operational performance.
You also have the option to manage these replacements yourself at any time prior to the scheduled maintenance window. When you manage a replacement yourself, your instance will receive the OS update when you relaunch the node and your scheduled maintenance window will be cancelled.
Valkey features
What is Valkey?
Valkey is a Linux Foundation led open source evolution of Redis OSS that supports a variety of use cases such as caching, leaderboards, and session stores, built by long standing Redis OSS contributors and maintainers. Valkey is backed by 40+ companies and has seen rapid adoption since the project was created in March 2024.
Why should I use ElastiCache for Valkey?
With ElastiCache for Valkey, you can benefit from a fully managed experience built on open-source technology while leveraging the security, operational excellence, 99.99% availability SLA, and reliability that we provide. You can further optimize costs on ElastiCache Serverless for Valkey with 33% reduced price and minimum data storage of 100 MB, 90% lower than ElastiCache Redis OSS. On ElastiCache for Valkey self-designed node-based, you can benefit from up to 20% lower cost per node.
How do I upgrade from ElastiCache for Redis OSS to ElastiCache for Valkey?
You can upgrade an existing ElastiCache for Redis OSS cache to ElastiCache for Valkey without downtime, in just a few clicks. You can get started using the Management Console, Software Development Kit (SDK), or Command Line Interface (CLI). For more information, please visit the ElastiCache user guide.
Does ElastiCache support Multi-AZ operation?
Yes. With ElastiCache, you can create a read replica in another AZ. When using ElastiCache Serverless, data is automatically stored redundantly across multiple AZs for high availability. When designing your own ElastiCache cache, upon failure of a node, we will provision a new node. In scenarios where the primary node fails, ElastiCache will automatically promote an existing read replica to the primary role.
How do I upgrade to a newer engine version?
You can quickly upgrade to a newer engine version by using the ElastiCache APIs and specifying your preferred engine version. In the ElastiCache console, you can select a cache and select Modify. The engine upgrade process is designed to retain your existing data.
Can I downgrade to an earlier engine version?
No, downgrading to an earlier engine version is not supported.
Can I have cross-Region replicas with ElastiCache?
Yes. You can create cross-Region replicas using the Global Datastore feature in ElastiCache. Global Datastore provides fully managed, fast, reliable, and security-focused cross-Region replication. It allows you to write to your ElastiCache cluster in one Region and have the data available to be read from up to two other cross-Region replica clusters, thereby enabling low-latency reads and disaster recovery across Regions.
Performance
What are the performance benefits of ElastiCache?
ElastiCache provides enhanced I/O threads that deliver significant improvements to throughput and latency at scale through multiplexing, presentation layer offloading, and more. Enhanced I/O threads improve performance by using more cores for processing I/O and dynamically adjusting to the workload. ElastiCache improves the throughput of TLS-enabled clusters by offloading encryption to the same enhanced I/O threads. ElastiCache version 7.2 for Valkey introduced enhanced I/O multiplexing, that combines many client requests into a single channel and improves thread efficiency.
In ElastiCache version 7.2 for Valkey and above, we extended the enhanced I/O threads functionality to also handle the presentation layer logic. Enhanced I/O threads not only read client input, but also parses the input into a binary command format, which is then forwarded to the main thread for execution, to provide performance gains. With ElastiCache version 7.2 for Valkey, you can achieve up to 100% more throughput and 50% lower P99 latency, compared to prior version. On r7g.4xlarge or larger, you can achieve over 1 million requests per second (RPS) per node.
How do I monitor Valkey CPU utilization?
ElastiCache provides two different sets of metrics to measure the CPU utilization of your cache depending on your cache deployment choice. When using ElastiCache Serverless, you can monitor the CPU utilization with the ElastiCache Processing Units (ECPU) metric. The number of ECPUs consumed by your requests depends on the vCPU time taken and the amount of data transferred. Each read and write, like the Valkey GET and SET commands or the Memcached get and set commands, requires 1 ECPU for each kilobyte (KB) of data transferred. Some commands that operate on in-memory data structures can consume more vCPU time than a GET or SET command. ElastiCache calculates the number of ECPUs consumed based on the vCPU time taken by the command compared to a baseline of the vCPU time taken by a SET or GET command. If your command takes additional vCPU time and transfers more data than the baseline of 1 ECPU, then ElastiCache calculates the ECPUs required based on the higher of the two dimensions.
When designing your own cluster, you can monitor EngineCPUUtilization and CPUUtilization. The CPUUtilization metric measures the CPU utilization for the instance (node), and the EngineCPUUtilization metric measures the utilization at the engine process level. You need the EngineCPUUtilization metric in addition to the CPUUtilization metric, as the main engine process is single threaded and uses just one CPU of the multiple CPU cores available on an instance. Therefore, the CPUUtilization metric does not provide precise visibility into the CPU utilization rates at the process level. We recommend that you use both the CPUUtilization and EngineCPUUtilization metrics together to get a detailed understanding of CPU utilization for your Valkey clusters.
Both sets of metrics are available in all Amazon Web Services Regions, and you can access these metrics using Amazon CloudWatch or in the console.
Read replica
What does it mean to run a node as a read replica?
Read replicas serve two purposes:
- Failure handling
- Read scaling
When you run a cache with a read replica, the primary serves both writes and reads. The replica serves exclusively read traffic and is also available as a warm standby in the event that the primary becomes impaired.
When would I want to consider using a Valkey read replica?
With ElastiCache Serverless, read replicas are automatically maintained by the service. When designing your own cache, there are a variety of scenarios where deploying one or more read replicas for a given primary node might make sense. Common reasons for deploying a read replica include:
- Scaling beyond the compute or I/O capacity of a single primary node for read-heavy workloads: This excess read traffic can be directed to one or more read replicas.
- Serving read traffic while the primary is unavailable: If your primary node cannot take I/O requests (for example, due to I/O suspension for backups or scheduled maintenance), you can direct read traffic to your read replicas. For this use case, keep in mind that the data on the read replica might be stale, since the primary instance is unavailable. The read replica can also be used warmed up to restart a failed primary.
Data protection scenarios: In the unlikely event of primary node failure or that the AZ in which your primary node resides becomes unavailable, you can promote a read replica in a different AZ to become the new primary.
How do I connect to my read replicas?
You can connect to a read replica just as you would connect to a primary cache node. If you have multiple read replicas, it is up to your application to determine how read traffic will be distributed amongst them. Here are more details:
- Valkey or Redis (cluster mode disabled) OSS clusters, use the individual node endpoints for read operations. (In the API/CLI these are referred to as read endpoints.)
- Valkey or Redis OSS (cluster mode enabled) clusters, use the cluster's configuration endpoint for all operations. You can still read from individual node endpoints. (In the API and CLI these are referred to as read endpoints.)
How many read replicas can I create for a given primary node?
ElastiCache allows you to create up to five (5) read replicas for a given primary cache node.
What happens to read replicas if failover occurs?
In the event of a failover, any associated and available read replicas should automatically resume replication once failover has completed (acquiring updates from the newly promoted read replica).
How does ElastiCache keep my read replica up-to-date with its primary node?
Updates to a primary cache node will automatically be replicated to any associated read replicas. However, with Valkey or Redis OSS asynchronous replication technology, a read replica can fall behind its primary cache node for a variety of reasons. Typical reasons include:
- Write I/O volume to the primary cache node exceeds the rate at which changes can be applied to the read replica.
- Network partitions or latency between the primary cache node and a read replica.
Read replicas are subject to the strengths and weaknesses of Valkey or Redis OSS replication. If you are using read replicas, you should be aware of the potential for lag between a read replica and its primary cache node or “inconsistency.” ElastiCache emits a metric to help you understand the inconsistency.
How much do read replicas cost? When does billing begin and end?
A read replica is billed as a standard cache node and at the same rates. Just like a standard cache node, the rate per cache node hour for a read replica is determined by the cache node class of the read replica: visit the ElastiCache pricing page for up-to-date pricing. You are not charged for the data transfer incurred in replicating data between your primary cache node and read replica. Billing for a read replica begins as soon as the read replica has been successfully created (when the status is listed as active). The read replica will continue being billed at standard ElastiCache cache node hour rates until you issue a command to delete it.
What happens during failover and how long does it take?
Initiated failover is supported by ElastiCache so that you can resume cache operations as quickly as possible. When failing over, ElastiCache flips the DNS record for your cache node to point at the read replica, which is in turn promoted to become the new primary. We encourage you to follow best practices and implement cache node connection retry at the application layer. Typically, start to finish, steps one to five below complete within six minutes.
These are the automatic failover events, listed in order of occurrence:
- Replication group message: Test Failover API called for node group <node-group-id>
- Cache cluster message: Failover from primary node <primary-node-id> to replica node <node-id> completed
- Replication group message: Failover from primary node <primary-node-id> to replica node <node-id> completed
- Cache cluster message: Recovering cache nodes <node-id>
- Cache cluster message: Finished recovery for cache nodes <node-id>
Can I create a read replica in another Region as my primary?
No, your read replica can only be provisioned in the same or different AZ of the same Region as your cache node primary. You can, however, use the Global Datastore to work with fully managed, fast, reliable, and security-focused replication across Amazon Web Services Regions. Using this feature, you can create cross-Region read replica clusters for ElastiCache to enable low-latency reads and disaster recovery across Amazon Web Services Regions.
Can I add and remove read replica nodes for my cluster environment?
Yes. You can add or remove a read replica across one or more shards in a cluster environment. The cluster continues to stay online and serve incoming I/O during this operation.
Multi-AZ
What is Multi-AZ for ElastiCache?
Multi-AZ is a feature that allows you to run in a more highly available configuration when designing your own ElastiCache cache. All ElastiCache Serverless caches are automatically run in a Multi-AZ configuration. An ElastiCache replication group consists of a primary and up to five read replicas. If Multi-AZ is enabled, then at least one replica is required per primary. During certain types of planned maintenance, or in the unlikely event of an ElastiCache node failure or AZ failure, ElastiCache will automatically detect the failure of a primary, select a read replica, and promote it to become the new primary. ElastiCache also propagates the DNS changes of the promoted read replica, so if your application is writing to the primary node endpoint, no endpoint change will be needed.
What are the benefits of using Multi-AZ and when should I use it?
The main benefits of running your ElastiCache in Multi-AZ mode are enhanced availability and a smaller need for administration. When running ElastiCache in a Multi-AZ configuration, your caches are eligible for the 99.99% availability SLA. If an ElastiCache primary node failure occurs, the impact on your ability to read and write to the primary is limited to the time it takes for automatic failover to complete. When Multi-AZ is enabled, ElastiCache node failover is automatic and requires no administration.
How does Multi-AZ work?
You can use Multi-AZ if you are using ElastiCache and have a replication group consisting of a primary node and one or more read replicas. If the primary node fails, ElastiCache will automatically detect the failure, select one from the available read replicas, and promote it to become the new primary. ElastiCache will propagate the DNS changes of the promoted replica so that your application can keep writing to the primary endpoint. ElastiCache will also spin up a new node to replace the promoted read replica in the same AZ of the failed primary. In case the primary failed due to temporary AZ disruption, the new replica will be launched once that AZ has recovered.
Can I have replicas in the same AZ as the primary?
Yes. Note that placing both the primary and the replicas in the same AZ will not make your ElastiCache replication group resilient to an AZ disruption.
What events would cause ElastiCache to fail over to a read replica?
ElastiCache will fail over to a read replica in the event of any of the following:
- Loss of availability in primary’s AZ
- Loss of network connectivity to primary
- Compute unit failure on primary
Which read replica will be promoted in case of a primary node failure?
If there is more than one read replica, the read replica with the smaller asynchronous replication lag to the primary will be promoted.
Will I be alerted when automatic failover occurs?
Yes, ElastiCache will create an event to inform you that automatic failover occurred. You can use the DescribeEvents API to return information about events related to your ElastiCache node, or select the Events section in the ElastiCache Management Console.
After failover, my primary is now located in a different AZ than my other Amazon Web Services resources (for example, Amazon EC2 instances). Should I be concerned about latency?
AZs are engineered to provide low latency network connectivity to other AZs in the same Region. You should consider architecting your application and other Amazon Web Services resources with redundancy across multiple AZs so your application will be resilient in the event of an AZ disruption.
Where can I get more information about Multi-AZ?
For more information about Multi-AZ, see ElastiCache documentation.
Backup and restore
What is Backup and Restore?
Backup and Restore is a feature that allows you to create snapshots of your ElastiCache caches. ElastiCache stores the snapshots, allowing users to subsequently use them to restore caches. This is currently supported with ElastiCache for Valkey, ElastiCache for Redis OSS and Serverless.
Why would I need snapshots?
Creating snapshots can be useful in case of data loss caused by node failure, as well as the unlikely event of a hardware failure. Another common reason to use backups is for archiving purposes. Snapshots are stored in Amazon S3.
Can I export ElastiCache snapshots to an Amazon S3 bucket owned by me?
Yes, you can export your ElastiCache snapshots to an authorized S3 bucket in the same Region as your cache.
I have multiple Amazon Web Services accounts using ElastiCache. Can I use ElastiCache snapshots from one account to warm start an ElastiCache cluster in a different one?
Yes. You must first copy your snapshot into an authorized S3 bucket of your choice in the same Region and then grant cross-account bucket permissions to the other account.
How much does it cost to use Backup and Restore?
ElastiCache provides storage space for one snapshot free of charge for each active ElastiCache cache. Additional storage will be charged based on the space used by the snapshots with ¥0.53/GB every month (same price in all China regions). Data transfer for using the snapshots is free of charge.
What happens to my snapshots if I delete my ElastiCache cache?
When you delete an ElastiCache cache, your manual snapshots are retained. You will also have an option to create a final snapshot before the cache is deleted. Automatic cache snapshots are not retained.
Enhanced engine
How is the engine within ElastiCache different from Valkey or Redis OSS?
The engine within ElastiCache is fully compatible with Valkey and Redis OSS but also comes with enhancements that improve performance, robustness, and stability. Some of the enhancements include:
- More usable memory: You can now safely allocate more memory for your application without risking increased swap usage during syncs and snapshots.
- Improved synchronization: More robust synchronization under heavy load and when recovering from network disconnections. Additionally, syncs are faster as both the primary and replicas no longer use the disk for this operation.
- Smoother failovers: In the event of a failover, your shard now recovers faster as replicas no longer flush their data to do a full resync with the primary.
- TLS offload and IO multiplexing: ElastiCache is designed to better use available CPU resources by handling certain network-related processes on dedicated threads.
Do I need to change my application code to use the enhanced engine on ElastiCache?
The enhanced engine is fully compatible with Valkey or Redis OSS, thus you can enjoy its improved robustness and stability without the need to make any changes to your application code.
How much does it cost to use the enhanced engine?
There is no additional charge for using the enhanced engine.
Encryption
How can I use encryption in transit, at rest, and Valkey or Redis OSS AUTH?
Encryption in transit, encryption at rest, Valkey AUTH, and Role-Based Access Control (RBAC) are features you can select when creating your ElastiCache cache. If you enabled in-transit encryption, you can choose to use AUTH or RBAC for additional security and access control.
What does encryption at rest for ElastiCache provide?
Encryption at rest provides mechanisms to guard against unauthorized access of your data. When enabled, it encrypts the following aspects:
- Disk during sync, backup, and swap operations
- Backups stored in Amazon S3
ElastiCache offers default (service-managed) encryption at rest, as well as the ability to use your own symmetric customer-managed Amazon KMS keys in KMS.
What does encryption in transit for ElastiCache provide?
The encryption in transit feature facilitates encryption of communications between clients and ElastiCache, as well as between the servers (primary and read replicas). Read more about ElastiCache in-transit encryption.
Is there any action needed to renew TLS certificates?
No, ElastiCache manages certification expiration and renewal behind the scenes. No user action is necessary for ongoing certificate maintenance.
Are there additional costs for using encryption?
No, there are no additional costs for using encryption.
Global Datastore
What is ElastiCache Global Datastore?
Global Datastore is a feature of ElastiCache that provides fully managed, fast, reliable and security-focused cross-Region replication. With Global Datastore, you can write to your cache in one Region and have the data available for read in up to two other cross-Region replica clusters, thereby enabling low-latency reads and disaster recovery across Regions.
Designed for real-time applications with a global footprint, Global Datastore typically replicates data across Regions in under one second, increasing the responsiveness of your applications by providing geolocal reads closer to end users. In the unlikely event of Regional degradation, one of the healthy cross-Region replica caches can be promoted to become the primary with full read and write capabilities. Once initiated, the promotion typically completes in less than a minute, allowing your applications to remain available.
Which engine versions support Global Datastore?
Global Datastore is supported on ElastiCache version 7.2 for Valkey and ElastiCache verson 5.0.6 onward for Redis OSS.
How many Amazon Web Services Regions can I replicate to?
You can replicate to up to two secondary Regions within a Global Datastore. The caches in secondary Regions can be used to serve low-latency local reads and for disaster recovery in the unlikely event of Regional degradation.
How can I create a Global Datastore?
You can set up a Global Datastore by using an existing cache or creating a new cache to be used as a primary. You can create a Global Datastore with just a few steps in the ElastiCache Management Console or by downloading the latest Amazon Web Services SDK or Amazon CLI. There is support for Global Datastore in CloudFormation.
Does ElastiCache automatically fail over a Global Datastore to promote a secondary cluster in the event when a primary cluster (Region) is degraded?
No, ElastiCache doesn’t automatically promote a secondary cluster in the event when a primary cluster (Region) is degraded. You can manually initiate the failover by promoting a secondary cluster to become a primary. The fail over and promotion of a secondary cluster typically completes in less than one minute.
What is the pricing for Global Datastore?
ElastiCache does not charge any premium to use Global Datastore. You pay for the primary and secondary caches in your Global Datastore and for the cross-Region data transfer traffic.
What Recovery Point Objective (RPO) and Recovery Time Objective (RTO) can I expect with Global Datastore?
ElastiCache doesn’t provide an SLA for RPO and RTO. The RPO varies based on replication lag between Regions and depends on network latency between Regions and cross-Region network traffic congestion. The RPO of Global Datastore is typically under one second, so the data written in a primary Region is available in secondary Regions within one second. The RTO of Global Datastore is typically under a minute. Once a failover to a secondary cluster is initiated, ElastiCache typically promotes the secondary to full read and write capabilities in under a minute.