We use machine learning technology to do auto-translation. Click "English" on top navigation bar to check Chinese version.
Patterns for updating Amazon OpenSearch Service index settings and mappings
Customers have been seeking guidance on best practices and patterns for changing index settings without an index maintenance window or affecting overall performance of the OpenSearch Service domain. This is part one of a two-part series, in which we show how to make settings changes to OpenSearch Service indexes with little to no downtime while supporting active producers and consumers of the data.
Indexes in OpenSearch Service
In OpenSearch Service,
OpenSearch Service indexes have two types of settings that periodically need adjustments as the profile of your workload changes:
-
Dynamic – Settings that can be changed on the index at any time -
Static – Settings that can only be defined at the index creation time and can’t be changed throughout the index lifecycle
Dynamic index settings can be changed at any time using the
index.number_of_replicas
or index.auto_expand_replicas
, and depending on the domain’s configuration, can cause a temporary increase in resource utilization while the domain adds replicas. We recommend maintaining at least one replica for redundancy reasons, and multiple replicas for high query throughput use cases.
Static index settings such as mapping and shard count are defined at index creation time and can’t be changed throughout the index lifecycle. In this post, we focus on patterns and best practices for working with static index settings, such as changing shard count and patterns for updating index mappings.
All operations and procedures that we cover in this post are issued directly to the
As with any use case, there is a spectrum of solutions and constraints to be considered. We start with a few simple foundational patterns and build on them, accounting for use cases with more operational constraints and working with large datasets.
Solution overview
OpenSearch Service has a default sharding strategy of 5:1, where each index is divided into five primary shards. Within each index, each primary shard also has its own replica. OpenSearch Service automatically assigns primary shards and replica shards to separate data nodes.
It’s not possible to increase the primary shard number of an existing index, meaning an index must be recreated if you want to increase the primary shard count.
The
_reindex
operation is resource intensive. We recommend disabling replicas in your destination index by setting number_of_replicas
to 0 and re-enable replicas when the reindex process is complete. If you have your data in a second, durable store, the simplest thing to do is pause updates and reindex from the source. But that’s not always possible. In this post, we share several patterns that enable you to update even static index settings like shard count.
One the major advantages of using the _reindex
operation is that it doesn’t require placing the source index in a read-only mode (data producers may continue to write the data while reindexing is in progress). Also, the _reindex
operation enables reprocessing,
_reindex
operation, you can copy all or a subset of documents that you select through a query to another index. In its most basic form, the _reindex
operation requires you to specify a source and a destination index and configuration parameters.
The following are the some of the use cases supported by the reindex API:
- Reindexing all documents
- Reindexing from a remote cluster when transferring data between clusters
- Reindexing a subset of documents that match a search query
- Combining one or more indexes
- Transforming documents during reindexing
To increase the shard count, you create a new index, set number_of_shards
to your desired primary shard count, set number_of_replicas
to 0, update the new index mapping based on your requirement, and run the reindex API operation. After the _reindex
operation is complete, we recommend updating number_of_replicas
in the destination index settings to achieve your desired level of replica shards.
In the following sections, we provide a walkthrough of the reindex API operation. Note that the patterns and procedures presented in this post have been validated on Amazon OpenSearch Service version 1.3.
Prerequisites
The source of the documents must be stored in the index (the “_source”
setting at the index mappings level must be set to “enabled”:true
, which is the default). The _reindex
operation can’t be used without source documents.
Create the destination index with your desired mapping (field or data type). For demonstration purposes, our source index has a field ratings defined as long, and we want the same field to use the float data type in the destination index:
Ensure that you have sufficient disk space on each hot tier data node to house the new index primary shards and, depending on your configuration, replica shards. If disk space is insufficient,
The following screenshot shows the output.
Check the disk.avail
metric for hot storage tier nodes to validate your available disk space.
Use the reindex API operation
The _reindex
operation snapshots the index at the beginning of its run and performs processing on a snapshot to minimize impact on the source index. The source index can still be used for querying and processing the data. Although the _reindex
operation can run both synchronously and asynchronously, we recommend using an asynchronous run. You can monitor the progress of the _reindex
operation, cancel its run, or throttle its run using the _task
, _cancel
, and _rethrottle
operations, respectively.
Because the _reindex
operation doesn’t require the source index placed in a read-only mode, query and index update operations are free to continue.
Use the reindex API with the following command:
The source indexes as part of the _reindex
API operation can be supplemented with a query for
Note that the _reindex
operation can be throttled via a _rethrottle
API or settings passed as a parameter. You can cancel the task with the _cancel
operation:
The following screenshot shows the output of the _reindex
operation for reindexing from source_index_name
to destination_index_name
.
When the operation is complete, both consumers and producers of the source indexes or aliases need to re-point to the destination index and the same _reindex
operation needs to run again to catch up on any create, update, or delete operations performed on the source indexes while the initial _reindex
operation was running. This step is required because the _reindex
operation is running on a snapshot of the index. At this time, the _reindex
operation needs to run with “op_type”:”create”
to realign missing and out-of-version documents. See the following API command:
After the operation is complete and data integrity in the destination index is confirmed, you can delete the source index to reclaim disk space.
Increase index shard count using the split index API
The split index API and shrink index API cover a large array of use cases and present with low resource utilization in the domain. However, these APIs require closing the index for write operations and don’t address use cases that require changes to the mapping settings.
In OpenSearch Service, the number_of_shards
index setting is immutable and defined at the time when the index is created. However, although this setting is immutable, there are several patterns to increase or decrease index shard count without needing to explicitly reindex the data. You can alternatively use the
In OpenSearch Service, an
Although the majority of use cases focus on increasing a number of shards on an existing index due to data growth, there are also instances where you need to reduce the number of shards on an existing index. Such cases occasionally happen when an actual index size is less than what was anticipated when the index was created, and you want to align with a
Conclusion
In this post, we reviewed best practices when
In our next post, we’ll explore patterns for remote indexing to alleviate load and resource utilization on the source OpenSearch Service domain.
About the Authors
Mikhail Vaynshteyn is a Solutions Architect with Amazon Web Services. Mikhail works with healthcare and life sciences customers to build solutions that help improve patients’ outcomes. Mikhail specializes in data analytics services.
Sukhomoy Basak is a Solutions Architect at Amazon Web Services, with a passion for data and analytics solutions. Sukhomoy works with enterprise customers to help them architect, build, and scale applications to achieve their business outcomes.
The mentioned AWS GenAI Services service names relating to generative AI are only available or previewed in the Global Regions. Amazon Web Services China promotes AWS GenAI Services relating to generative AI solely for China-to-global business purposes and/or advanced technology introduction.