We use machine learning technology to do auto-translation. Click "English" on top navigation bar to check Chinese version.
Explore etcd Defragmentation in Amazon EKS
Introduction
Amazon Elastic Kubernetes Service (
Understanding etcd in Amazon EKS
Etcd serves as the primary data store in Amazon EKS, which stores cluster configuration, state, and metadata. It maintains a
One primary function of etcd is to store and organize data of the Kubernetes API as key-value pairs. It keeps track of the current state of objects and configurations in the cluster, making sure that the actual state aligns with the desired state specified by cluster administrators, application developers, or
Multi-Version Concurrency Control
Etcd is a persistent key-value store. It employs Multi-Version Concurrency Control (MVCC) mechanism to ensure data consistency while also allowing concurrent read and write operations. Each key-value pair is associated with an increasing version number. When a new value is written for a key, it receives a higher version number than the previous one. Versions are unique and strictly ordered. The previous version is retained, which allows for historical tracking of changes. This append-only nature of the store causes the database size to grow indefinitely.
Data fragmentation
As data is updated and deleted over time, it can lead to
Solution overview
How defragmentation works in etcd
Compaction
Compaction in etcd focuses on identifying and removing obsolete data to avoid eventual storage space exhaustion. Each key-value pair in etcd is assigned a unique index number, known as a revision. Compaction works based on these revision numbers to identify data that’s no longer needed or has expired. A retention policy determines the revision range of data to be retained. Data outside this range is considered eligible for removal during compaction. After deletion, storage space previously occupied by that data is released and can be used for new data. An API server triggers compaction every 5 minutes.
Defragmentation
Defragmentation consists of rewriting the data into contiguous files, effectively eliminating fragmentation and enhancing data locality. Etcd analyzes the data storage to identify fragmentation levels and areas where data reorganization is beneficial. During analysis, etcd identifies contiguous storage space where related data can be moved to. Etcd then moves data to consolidate related key-value pairs. Data that is scattered is relocated to these contiguous storage locations. Unused space is released back to the filesystem.
Impact on API availability
It is important to note that defragmentation is a blocking operation, which means that while the process is in progress, it prevents any read or write operations from taking place. This can impact the API server’s communication with etcd to serve read and write requests to the clients. The time taken for defragmentation depends on amount of compacted data that needs to be copied into the new database file. On average, it can take up to 10 seconds for every gigabyte of data to be reorganized. We recommend monitoring the etcd database size and deleting unwanted objects to minimize performance impact to the API server. This topic is covered in great detail in
On larger etcd databases, it is common to see the following error message returned from API server during defragmentation:
Amazon EKS development team is actively working on improvements to minimizing the impact of defragmentation on API availability. Some examples include the introduction of gRPC Remote Procedure Calls (gRPC)
Handling API timeouts
Intermittent timeouts from API server are to be expected during defragmentation. Therefore, it is best practice to design your client applications to gracefully handle these situations. By building robust error handling mechanisms and incorporating retry strategies, client applications can mitigate the impact of intermittent timeouts and maintain reliability. When a timeout occurs, your application should include
Walkthrough
Managing defragmentation in Amazon EKS
The underlying etcd cluster and its defragmentation process are handled transparently by the Amazon EKS control plane.
Amazon EKS employs automated maintenance processes to ensure the health and stability of etcd. Amazon Web Services takes care of provisioning, scaling, and managing the etcd instances for you. This proactive approach guarantees that the etcd cluster nodes remain healthy, and any potential issues related to fragmentation are mitigated before they impact the overall Amazon EKS cluster’s performance.
Minimizing impact of defragmentation
The size of the database is a crucial factor that influences the defragmentation time in etcd. As the etcd database grows larger, the defragmentation process becomes more time-consuming due to the increased volume of data that needs to be reorganized and compacted. To minimize the impact of defragmentation, consider the following practices:
1. Remove unused or orphaned objects
Regularly audit your cluster to identify and remove unused or orphaned objects. These objects may include old deployments, replica sets, or services that are no longer in use. Deleting unnecessary objects reduce the storage footprint in etcd and minimized the fragmentation impact. Tools such as
2. Sparing use of ConfigMaps and Secrets
Avoid storing large amounts of data in ConfigMaps and Secrets. Use these resources sparingly, keeping them concise and organized to reduce the number of large objects stored in etcd. Alternatively, consider using
3. Avoid large Pod specs
Pod specifications with sizable amounts of embedded metadata (512 K+) can quickly inflate an etcd database. This is especially problematic in scenarios where a deployment enters a crash loop and subsequently consumes all available etcd storage with infinite Pod revisions.
4. Implement object lifecycle management
Define and enforce object lifecycle management policies. Set expiration dates or implement retention policies for objects that have a limited lifespan. Automate the removal of expired objects to prevent unnecessary data accumulation in etcd.
Clean up finished Jobs automatically, by specifying .spec.ttlSecondsAfterFinished field,
Limit number of
5. Regularly monitor etcd storage usage
Monitor etcd storage usage to gain insights into resource utilization and identify and abnormal growth patterns. This helps you proactively address storage-related issues and take corrective actions, such as optimizing object usage, if required. See best practices guide on
-
Amazon CloudWatch : Amazon EKS integrates with Amazon CloudWatch, which allows you to monitor various cluster metrics, including etcd disk usage. You can useCloudWatch Insights to write custom queries and extract the relevant etcd metrics.
-
kubectl : You can use kubectl command-line tool to fetch etcd metrics directly. For example, to get etcd metrics in Amazon EKS v1.26+, you can run:
-
Prometheus andGrafana : Amazon Web Services Distro for OpenTelemetry (ADOT) has built-in support for EKS API server monitoring. To learn more about ADOT, seemonitoring Amazon EKS API server
Conclusion
In this post, we showed you how defragmentation, as a key functionality within etcd, plays a crucial role in optimizing Amazon EKS performance and ensuring cluster stability. By proactively reorganizing data and optimizing storage, defragmentation improves overall cluster efficiency and enhances resource utilization. As Amazon EKS continues to empower organizations with scalable and resilient Kubernetes deployments, we understand the nuances of etcd becoming an essential tool for administrators to unlock the full potential of their Amazon EKS clusters.
For additional reading, review the following article on
The mentioned AWS GenAI Services service names relating to generative AI are only available or previewed in the Global Regions. Amazon Web Services China promotes AWS GenAI Services relating to generative AI solely for China-to-global business purposes and/or advanced technology introduction.