10.2. Optimizing the handling of Lucene segments

Once Elasticsearch receives documents from your application, it indexes them in memory in inverted indices called segments. From time to time, these segments are written to disk. Recall from chapter 3 that these segments can’t be changed—only deleted—to make it easy for the operating system to cache them. Also, bigger segments are periodically created from smaller segments to consolidate the inverted indices and make searches faster.

There are lots of knobs to influence how Elasticsearch handles these segments at every step, and configuring them to fit your use case often gives important performance gains. In this section, we’ll discuss these knobs and divide them into three categories:

How often to refresh and flush— Refreshing reopens Elasticsearch’s view on the index, making newly indexed documents available for search. Flushing commits indexed data from memory to the disk. Both refresh and flush operations are expensive in terms of performance, so it’s important to configure them correctly for your use case.

Merge policies— Lucene (and by inheritance, Elasticsearch) stores data into immutable groups of files called segments. As you index more data, more segments are created. Because a search in many segments is slow, small segments are merged in the background into bigger segments to keep their number manageable. Merging is performance intensive, especially for the I/O subsystem. You can adjust the merge policy to influence how often merges happen and how big segments can get. Store and store throttling— Elasticsearch limits the impact of merges on your system’s I/O to a certain number of bytes per second. Depending on your hardware and use case, you can change this limit. There are also other options for how Elasticsearch uses the storage. For example, you can choose to store your indices only in memory.

We’ll start with the category that typically gives you the biggest performance gain of the three: choosing how often to refresh and flush.

10.2.1. Refresh and flush thresholds

Recall from chapter 2 that Elasticsearch is often called near real time; that’s because searches are often not run on the very latest indexed data (which would be real time) but close to it.

This near-real-time label fits because normally Elasticsearch keeps a point-in-time view of the index opened, so multiple searches would hit the same files and reuse the same caches. During this time, newly indexed documents won’t be visible to those searches until you do a refresh.

Refreshing, as the name suggests, refreshes this point-in-time view of the index so your searches can hit your newly indexed data. That’s the upside. The downside is that each refresh comes with a performance penalty: some caches will be invalidated, slowing down searches, and the reopening process itself needs processing power, slowing down indexing.

When to refresh

The default behavior is to refresh every index automatically every second. You can change the interval for every index by changing its settings, which can be done at runtime. For example, the following command will set the automatic refresh interval to 5 seconds:

% curl -XPUT localhost:9200/get-together/_settings -d '{

"index.refresh_interval": "5s"

Tip

To confirm that your changes were applied, you can get all the index settings by running curl localhost:9200/get-together/_settings?pretty.

As you increase the value of refresh_interval, you’ll have more indexing throughput because you’ll spend fewer system resources on refreshing.

Alternatively, you can set refresh_interval to -1 to effectively disable automatic refreshes and rely on manual refresh. This works well for use cases where indices change only periodically in batches, such as for a retail chain where products and stocks are updated every night. Indexing throughput is important because you want to consume those updates quickly, but data freshness isn’t, because you don’t get the updates in real time, anyway. So you can do nightly bulk index/updates with automatic refresh disabled and refresh manually when you’ve finished.

To refresh manually, hit the _refresh endpoint of the index (or indices) you want to refresh:

% curl localhost:9200/get-together/_refresh

When to flush

If you’re used to older versions of Lucene or Solr, you might be inclined to think that when a refresh happens, all data that was indexed (in memory) since the last refresh is also committed to disk.

With Elasticsearch (and Solr 4.0 or later) the process of refreshing and the process of committing inmemory segments to disk are independent. Indeed, data is indexed first in memory, but after a refresh, Elasticsearch will happily search the in-memory segments as well. The process of committing in-memory segments to the actual Lucene index you have on disk is called a flush, and it happens whether the segments are searchable or not.

To make sure that in-memory data isn’t lost when a node goes down or a shard is relocated, Elasticsearch keeps track of the indexing operations that weren’t flushed yet in a transaction log. Besides committing inmemory segments to disk, a flush also clears the transaction log, as shown in figure 10.2.

Figure 10.2. A flush moves segments from memory to disk and clears the transaction log.

A flush is triggered in one of the following conditions, as shown in figure 10.3:

Figure 10.3. A flush is triggered when the memory buffer or transaction log is full or at an interval.

The memory buffer is full.

A certain amount of time passed since the last flush. The transaction log hit a certain size threshold.

To control how often a flush happens, you have to adjust the settings that control those three conditions.

The memory buffer size is defined in the elasticsearch.yml configuration file through the

indices.memory.index_buffer_size setting. This controls the overall buffer for the entire node, and the value can be either a percent of the overall JVM heap like 10% or a fixed value like 100

MB.

Transaction log settings are index specific and control both the size at which a flush is triggered (via

index.translog.flush_threshold_size) and the time since the last flush (via index.translog.flush_threshold_period). As with most index settings, you can change them at runtime:

% curl -XPUT localhost:9200/get-together/_settings -d '{

"index.translog": {

"flush_threshold_size": "500mb",

"flush_threshold_period": "10m"

}

When a flush is performed, one or more segments are created on the disk. When you run a query,

Elasticsearch (through Lucene) looks in all segments and merges the results in an overall shard result. Then, as you saw in chapter 2, per-shard results are aggregated into the overall results that go back to your application.

The key thing to remember here about segments is that the more segments you have to search through, the slower the search. To keep the number of segments at bay, Elasticsearch (again, through Lucene) merges multiple sets of smaller segments into bigger segments in the background.

10.2.2. Merges and merge policies

We first introduced segments in chapter 3 as immutable sets of files that Elasticsearch uses to store indexed data. Because they don’t change, segments are easily cached, making searches fast. Also, changes to the dataset, such as the addition of a document, won’t require rebuilding the index for data stored in existing segments. This makes indexing new documents fast, too—but it’s not all good news. Updating a document can’t change the actual document; it can only index a new one. This requires deleting the old document, too. Deleting, in turn, can’t remove a document from its segment (that would require rebuilding the inverted index), so it’s only marked as deleted in a separate .del file. Documents are only actually removed during segment merging.

This brings us to the two purposes of merging segments: to keep the total number of segments in check (and with it, query performance) and to remove deleted documents.

Segment merging happens in the background, according to the defined merge policy. The default merge policy is tiered, which, as illustrated in figure 10.4, divides segments into tiers, and if you have more than the set maximum number of segments in a tier, a merge is triggered in that tier.

Figure 10.4. Tiered merge policy performs a merge when it finds too many segments in a tier.

There are other merge policies, but in this chapter we’ll focus only on the tiered merge policy, which is the default, because it works best for most use cases.

TIP

There are some nice videos and explanations of different merge policies on Mike McCandless’s blog (he’s a co-author of Lucene in Action, Second Edition [Manning Publications, 2010]): http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html .

Tuning merge policy options

The overall purpose of merging is to trade I/O and some CPU time for search performance. Merging happens when you index, update, or delete documents, so the more you merge, the more expensive these operations get. Conversely, if you want faster indexing, you’ll need to merge less and sacrifice some search performance.

In order to have more or less merging, you have a few configuration options. Here are the most important ones:

index.merge.policy.segments_per_tier— The higher the value, the more segments

you can have in a tier. This will translate to less merging and better indexing performance. If you have little indexing and you want better search performance, lower this value.

index.merge.policy.max_merge_at_once— This setting limits how many segments can be merged at once. You’d typically make it equal to the segments_per_tier value. You could lower the max_merge_at_once value to force less merging, but it’s better to do that by increasing segments_per_tier. Make sure max_merge_at_once isn’t higher than segments_per_tier because that will cause too much merging.

index.merge.policy.max_merged_segment— This setting defines the maximum

segment size; bigger segments won’t be merged with other segments. You’d lower this value if you wanted less merging and faster indexing because larger segments are more difficult to merge.

index.merge.scheduler.max_thread_count— Merging happens in the background on

separate threads, and this setting controls the maximum number of threads that can be used for merging. This is the hard limit of how many merges can happen at once. You’d increase this setting for an aggressive merge policy on a machine with many CPUs and fast I/O, and you’d decrease it if you had a slow CPU or I/O.

All those options are index-specific, and, as with transaction log and refresh settings, you can change them at runtime. For example, the following snippet forces more merging by reducing

segments_per_tier to 5 (and with it, max_merge_at_once), lowers the maximum segment size to 1 GB, and lowers the thread count to 1 to work better with spinning disks:

% curl -XPUT localhost:9200/get-together/_settings -d '{

"index.merge": {

"policy": {

"segments_per_tier": 5,

"max_merge_at_once": 5,

"max_merged_segment": "1gb"

"scheduler.max_thread_count": 1

}

Optimizing indices

As with refreshing and flushing, you can trigger a merge manually. A forced merge call is also known as optimize, because you’d typically run it on an index that isn’t going to be changed later to optimize it to a specified (low) number of segments for faster searching.

As with any aggressive merge, optimizing is I/O intensive and invalidates lots of caches. If you continue to index, update, or delete documents from that index, new segments will be created and the advantages of optimizing will be lost. Thus, if you want fewer segments on an index that’s constantly changing, you should tune the merge policy.

Optimizing makes sense on a static index. For example, if you index social media data and you have one index per day, you know you’ll never change yesterday’s index until you remove it for good. It might help to optimize it to a low number of segments, as shown in figure 10.5, which will reduce its total size and speed up queries once caches are warmed up again.

Figure 10.5. Optimizing makes sense for indices that don’t get updates.

To optimize, you’d hit the _optimize endpoint of the index or indices you need to optimize. The max_num_segments option indicates how many segments you should end up with per shard:

% curl localhost:9200/get-together/_optimize?max_num_segments=1

An optimize call can take a long time on a large index. You can send it to the background by setting wait_for_merge to false.

One possible reason for an optimize (or any merge) being slow is that Elasticsearch, by default, limits the amount of I/O throughput merge operations can use. This limiting is called store throttling, and we’ll discuss it next, along with other options for storing your data.

10.2.3. Store and store throttling

In early versions of Elasticsearch, heavy merging could slow down the cluster so much that indexing and search requests would take unacceptably long, or nodes could become unresponsive altogether. This was all due to the pressure of merging on the I/O throughput, which would make the writing of new segments slow. Also, CPU load was higher due to I/O wait.

As a result, Elasticsearch now limits the amount of I/O throughput that merges can use through store throttling. By default, there’s a node-level setting called indices.store.throttle.max_bytes_per_sec, which defaults to 20mb as of version 1.5.

This limit is good for stability in most use cases but won’t work well for everyone. If you have fast machines and lots of indexing, merges won’t keep up, even if there’s enough CPU and I/O to perform them. In such situations, Elasticsearch makes internal indexing work only on one thread, slowing it down to allow merges to keep up. In the end, if your machines are fast, indexing might be limited by store throttling. For nodes with SSDs, you’d normally increase the throttling limit to 100–200 MB.

Changing store throttling limits

If you have fast disks and need more I/O throughput for merging, you can raise the store throttling limit.

You can also remove the limit altogether by setting indices .store.throttle.type to none. On the other end of the spectrum, you can apply the store throttling limit to all of Elasticsearch’s disk operations, not just merge, by setting indices.store.throttle.type to all.

Those settings can be changed from elasticsearch.yml on every node, but they can also be changed at runtime through the Cluster Update Settings API. Normally, you’d tune them while monitoring how much merging and other disk activities are actually happening—we’ll show you how to do that in chapter 11.

TIP

Elasticsearch 2.0, which will be based on Lucene 5.0, will use Lucene’s auto-io-throttle feature,^[1] which will automatically throttle merges based on how much indexing is going on. If there’s little indexing, merges will be throttled more so they won’t affect searches. If there’s lots of indexing, there will be less merge throttling, so that merges won’t fall behind.

For more details, check the Lucene issue, https://issues.apache.org/jira/browse/LUCENE-6119 , and the Elasticsearch issue, https://github.com/elastic/elasticsearch/pull/9243.

The following command would raise the throttling limit to 500 MB/s but apply it to all operations. It would also make the change persistent to survive full cluster restarts (which is opposed to transient settings that are lost when the cluster is restarted):

% curl -XPUT localhost:9200/_cluster/settings -d '{

"persistent": {

"indices.store.throttle": {

"type": "all",

"max_bytes_per_sec": "500mb"

}

Tip

As with index settings, you can also get cluster settings to see if they’re applied. You’d do that by running curl localhost:9200/_cluster/settings?pretty.

Configuring store

When we talked about flushes, merges, and store throttling, we said “disk” and “I/O” because that’s the default: Elasticsearch will store indices in the data directory, which defaults to /var/lib/elasticsearch/data if you installed Elasticsearch from a RPM/DEB package, or the data/ directory from the unpacked tar.gz or ZIP archive if you installed it manually. You can change the data directory from the path.data property of elasticsearch.yml.

Tip

You can specify multiple directories in path.data which—in version 1.5, at least—will put different files in different directories to achieve striping (assuming those directories are on different disks). If that’s what you’re after, you’re often better off using RAID0, in terms of both performance and reliability.

For this reason, the plan is to put each shard in the same directory instead of striping it.^[2]

More details can be found on Elasticsearch’s bug tracker: https://github.com/elastic/elasticsearch/issues/9498 .

The default store implementation stores index files in the file system, and it works well for most use cases. To access Lucene segment files, the default store implementation uses Lucene’s

MMapDirectory for files that are typically large or need to be randomly accessed, such as term dictionaries. For the other types of files, such as stored fields, Elasticsearch uses Lucene’s

NIOFSDirectory.

MMapDirectory

MMapDirectory takes advantage of file system caches by asking the operating system to map the needed files in virtual memory in order to access that memory directly. To Elasticsearch, it looks as if all the files are available in memory, but that doesn’t have to be the case. If your index size is larger than your available physical memory, the operating system will happily take unused files out of the caches to make room for new ones that need to be read. If Elasticsearch needs those uncached files again, they’ll be loaded in memory while other unused files are taken out and so on. The virtual memory used by

MMapDirectory works similarly to the system’s virtual memory (swap), where the operating system uses the disk to page out unused memory in order to be able to serve multiple applications.

NIOFSDirectory

Memory-mapped files also imply an overhead because the application has to tell the operating system to map a file before accessing it. To reduce this overhead, Elasticsearch uses NIOFSDirectory for some types of files. NIOFSDirectory accesses files directly, but it has to copy the data it needs to read in a buffer in the JVM heap. This makes it good for small, sequentially accessed files, whereas MMapDirectory works well for large, randomly accessed files.

The default store implementation is best for most use cases. You can, however, choose other implementations by changing index.store.type in the index settings to values other than default:

mmapfs— This will use the MMapDirectory alone and will work well, for example, if you have a relatively static index that fits in your physical memory.

niofs— This will use NIOFSDirectory alone and would work well on 32-bit systems, where virtual memory address space is limited to 4 GB, which will prevent you from using mmapfs or default for larger indices.

Store type settings need to be configured when you create the index. For example, the following command creates an mmap-ed index called unit-test:

% curl -XPUT localhost:9200/unit-test -d '{

"index.store.type": "mmapfs"

If you want to apply the same store type for all newly created indices, you can set

index.store.type to mmapfs in elasticsearch.yml. In chapter 11 we’ll introduce index templates, which allow you to define index settings that would apply to new indices matching specific patterns. Templates can also be changed at runtime, and we recommend using them instead of the more static elasticsearch.yml equivalent if you often create new indices.

Open files and virtual memory limits

Lucene segments that are stored on disk can spread onto many files, and when a search runs, the operating system needs to be able to open many of them. Also, when you’re using the default store type or mmapfs, the operating system has to map some of those stored files into memory—even though these files aren’t in memory, to the application it’s like they are, and the kernel takes care of loading and unloading them in the cache. Linux has configurable limits that prevent the applications from opening too many files at once and from mapping too much memory. These limits are typically more conservative than needed for

Elasticsearch deployments, so it’s recommended to increase them. If you’re installing Elasticsearch from a DEB or RPM package, you don’t have to worry about this because they’re increased by default. You can find these variables in /etc/default/elasticsearch or /etc/sysconfig/elasticsearch:

MAX_OPEN_FILES=65535

MAX_MAP_COUNT=262144

To increase those limits manually, you have to run ulimit -n 65535 as the user who starts

Elasticsearch for the open files and run sysctl -w vm.max_map_count =262144 as root for the virtual memory.

The default store type is typically the fastest because of the way the operating system caches files. For caching to work well, you need to have enough free memory.

Tip

From Elasticsearch 2.0 on, you’ll be able to compress stored fields (and _source) further by setting index.codec to best_compression.^[3] The default (named default, as with store types) still compresses stored fields by using LZ4, but best_compression uses deflate.^[4] Higher compression will slow down operations that need _source, like fetching results or highlighting. Other operations, such as aggregations, should be at least equally fast because the overall index will be smaller and easier to cache.

For more details, check the Elasticsearch issue, https://github.com/elastic/elasticsearch/pull/8863 , and the main Lucene issue, https://issues.apache.org/jira/browse/LUCENE-5914 .

4 https://en.wikipedia.org/wiki/DEFLATE

We mentioned how merge and optimize operations invalidate caches. Managing caches for Elasticsearch to perform well deserves more explanation, so we’ll discuss that next.