11.3. Monitoring for bottlenecks

Elasticsearch provides a wealth of information via its APIs: memory consumption, node membership, shard distribution, and I/O performance. The cluster and node APIs help you gauge the health and overall performance metrics of your cluster. Understanding cluster diagnostic data and being able to assess the overall status of the cluster will alert you to performance bottlenecks, such as unassigned shards and missing nodes, so you can easily address them.

11.3.1. Checking cluster health

The cluster health API provides a convenient yet coarse-grained view of the overall health of your cluster, indices, and shards. This is normally the first step in being alerted to and diagnosing a problem that may be actively occurring in your cluster. The next listing shows how to use the cluster health API to check overall cluster state.

Listing 11.3. Cluster health API request

curl -XGET 'localhost:9200/_cluster/health?pretty';

And the response:

Taking the response shown here at face value, you can deduce a lot about the general health and state of the cluster, but there’s much more to reading this simple output than what’s obvious at first glance. Let’s look a little deeper into the meaning of the last three indicators in the code: relocating_shards, initializing_shards, and unassigned_shards.

relocating_shards— A number above zero means that Elasticsearch is moving shards of data across the cluster to improve balance and failover. This ordinarily occurs when adding a new node, restarting a failed node, or removing a node, thereby making this a temporary occurrence.

initializing_shards— This number will be above zero when you’ve just created a new index or restarted a node.

unassigned_shards— The most common reason for this value to be above zero is having unassigned replicas. The issue is common in development environments, where a single-node cluster has an index defined as having the default, five shards and one replica. In this case, there’ll be five unassigned replicas.

As you saw from the first line of output, the cluster status is green. There are times when this may not be so, as in the case of nodes not being able to start or falling away from the cluster, and although the status value gives you only a general idea of the health of the cluster, it’s useful to understand what those status values mean for cluster performance:

Green— Both primary and replica shards are fully functional and distributed.

Yellow— Normally this is a sign of a missing replica shard. The unassigned_ shards value is likely above zero at this point, making the distributed nature of the cluster unstable. Further shard loss could lead to critical data loss. Look for any nodes that aren’t initialized or functioning correctly.

Red— This is a critical state, where a primary shard in the cluster can’t be found, prohibiting indexing operations on that shard and leading to inconsistent query results. Again, likely a node or several nodes are missing from the cluster.

Armed with this knowledge, you can now take a look at a cluster with a yellow status and attempt to track down the source of the problem:

curl -XGET 'localhost:9200/_cluster/health?pretty';

{

"cluster_name" : "elasticiq",

"status" : "yellow",

"timed_out" : false,

"number_of_nodes" : 1,

"number_of_data_nodes" : 1,

"active_primary_shards" : 10,

"active_shards" : 10,

"relocating_shards" : 0,

"initializing_shards" : 0,

"unassigned_shards" : 5

}

Given this API call and response, you see that the cluster is now in yellow status, and as you’ve already learned, the likely culprit is the unassigned_shards value being above 0. The cluster health API provides a more fine-grained operation that will allow you to further diagnose the issue. In this case, you can look deeper at which indices are affected by the unassigned shards by adding the level parameter:

The single-node cluster is experiencing some problems because Elasticsearch is trying to allocate replica shards across the cluster, but it can’t do so because there’s only one node running. This leads to the replica shards not being assigned anywhere and therefore a yellow status across the cluster, as figure 11.3 shows.

Figure 11.3. Yellow status solved by making nodes accessible

As you can see, an easy remedy is to add a node to the cluster so Elasticsearch can then allocate the replica shards to that location. Making sure that all of your nodes are running and accessible is the easiest way to solve the yellow status issue.

11.3.2. CPU: slow logs, hot threads, and thread pools

Monitoring your Elasticsearch cluster may from time to time expose spikes in CPU usage or bottlenecks in performance caused by a constantly high CPU utilization or blocked/waiting threads. This section will help demystify some of these possible performance bottlenecks and provide you with the tools needed to identify and address these issues.

Slow logs

Elasticsearch provides two logs for isolating slow operations that are easily configured within your cluster configuration file: slow query log and slow index log. By default, both logs are disabled. Log output is scoped at the shard level. That is, one operation can represent several lines in the corresponding log file. The advantage to shard-level logging is that you’ll be better able to identify a problem shard and thereby a node with the log output, as shown here. It’s important to note at this point that these settings can also be modified using the '{index_name}/_settings' endpoint:

index.search.slowlog.threshold.query.warn: 10s index.search.slowlog.threshold.query.info: 1s index.search.slowlog.threshold.query.debug: 2s index.search.slowlog.threshold.query.trace: 500ms

index.search.slowlog.threshold.fetch.warn: 1s index.search.slowlog.threshold.fetch.info: 1s index.search.slowlog.threshold.fetch.debug: 500ms index.search.slowlog.threshold.fetch.trace: 200ms

As you can see, you can set thresholds for both phases of a search: the query and the fetch. The log levels (warn, info, debug, trace) allow you finer control over which level will be logged, something that comes in handy when you simply want to grep your log file. The actual log file you’ll be outputting to is configured in your logging.yml file, along with other logging functionality, as shown here:

index_search_slow_log_file: type: dailyRollingFile

file: ${path.logs}/${cluster.name}_index_search_slowlog.log datePattern: "'.'yyyy-MM-dd" layout:

type: pattern conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"

The typical output on a slow log file will appear as this:

[2014-11-09 16:35:36,325][INFO ][index.search.slowlog.query] [ElasticIQMaster] [streamglue][4] took[10.5ms], took_millis[10], types[], stats[], search_type[QUERY_THEN_FETCH], total_shards[10],

source[{"query":{"filtered":{"query":{"query_string":{"query":"test"}}}},...]

[2014-11-09 16:35:36,339][INFO ][index.search.slowlog.fetch] [ElasticIQ-

Master] [streamglue][3] took[9.1ms], took_millis[9], types[], stats[], search_type[QUERY_THEN_FETCH], total_shards[10], ...

Slow query log

The important parts you’re interested in for identifying performance issues are the query times: took[##ms]. Additionally, it’s helpful to know the shards and indices involved, and those are identifiable by the [index][shard_number] notation; in this case it’s [streamglue][4].

Slow index log

Equally useful in discovering bottlenecks during index operations is the slow index log. Its thresholds are defined in your cluster configuration file, or via the index update settings API, much like the previous slow log:

index.indexing.slowlog.threshold.index.warn: 10s index.indexing.slowlog.threshold.index.info: 5s index.indexing.slowlog.threshold.index.debug: 2s index.indexing.slowlog.threshold.index.trace: 500ms

As before, the output of any index operation meeting the threshold values will be written to your log file, and you’ll see the [index][shard_number] ([bitbucket][2]) and duration (took[4.5ms]) of the index operation:

[2014-11-09 18:28:58,636][INFO ][index.indexing.slowlog.index] [ElasticIQ-

Master] [bitbucket][2] took[4.5ms], took_millis[4], type[test], id[w0QyH_m6Sa2P-juppUy3Tw], routing[], source[] ...

Discovering where your slow queries and index calls are happening will go a long way in helping remedy Elasticsearch performance problems. Allowing slow performance to grow unbounded can cause a cascading failure across your entire cluster, leading to it crashing entirely.

Hot_threads API

If you’ve ever experienced high CPU utilization across your cluster, you’ll find the hot_threads API helpful in identifying specific processes that may be blocked and causing the problem. The hot_threads API provides a list of blocked threads for every node in your cluster. Note that unlike other APIs, hot_threads doesn’t return JSON but instead returns formatted text: curl -XGET 'http://127.0.0.1:9200/_nodes/hot_threads';

Here’s the sample output:

::: [ElasticIQ-Master][AtPvr5Y3ReW-ua7ZPtPfuQ][loki.local][inet[/

127.0.0.1:9300]]{master=true}

37.5% (187.6micros out of 500ms) cpu usage by thread 'elasticsearch[ElasticIQ-Master][search][T#191]

10/10 snapshots sharing following 3 elements ...

The output of the hot_threads API requires some parsing to understand correctly, so let’s have a look at what information it provides on CPU performance:

::: [ElasticIQ-Master][AtPvr5Y3ReW-ua7ZPtPfuQ][loki.local][inet[/

127.0.0.1:9300]]{master=true}

The top line of the response includes the node identification. Because the cluster presumably has more than one node, this is the first indication of which CPU the thread information belongs to:

37.5% (187.6micros out of 500ms) cpu usage by thread

'elasticsearch[ElasticIQ-Master][search][T#191]

Here you can see that 37.5% of CPU processing is being spent on a search thread. This is key to your understanding, because you can then fine-tune your search queries that may be causing the CPU spike. Expect that the search value won’t always be there. Elasticsearch may present other values here like merge, index, and the like that identify the operation being performed on that thread. You know this is CPU-related because of the cpu usage identifier. Other possible output identifiers here are block usage, which identifies threads that are blocked, and wait usage for threads in a WAITING state:

10/10 snapshots sharing following 3 elements

The final line before the stack trace tells you that Elasticsearch found that this thread with the same stack trace was present in 10 out of the 10 snapshots it took within a few milliseconds.

Of course, it’s worth learning how Elasticsearch gathers the hot_threads API information for presentation.

Every few milliseconds, Elasticsearch collects information about thread duration, its state

(WAITING/BLOCKED), and the duration of the wait or block for each thread. After a set interval (500 ms by default), Elasticsearch does a second pass of the same information-gathering operation. During each of these passes, it takes snapshots of each stack trace. You can tune the information-gathering process by adding parameters to the hot_threads API call:

curl -XGET 'http://127.0.0.1:9200/_nodes/ hotthreads?type=wait&interval=1000ms&threads=3'; **_type—** One of cpu, wait, or block. Type of thread state to snapshot for.

interval— Time to wait between the first and second checks. Defaults to 500 ms. threads— Number of top “hot” threads to display.

Thread pools

Every node in a cluster manages thread pools for better management of CPU and memory usage.

Elasticsearch will seek to manage thread pools to achieve the best performance on a given node. In some cases, you’ll need to manually configure and override how thread pools are managed to avoid cascading failure scenarios. Under a heavy load, Elasticsearch may spawn thousands of threads to handle requests, causing your cluster to fail. Knowing how to tune thread pools requires intimate knowledge of how your application is using the Elasticsearch APIs. For instance, an application that uses mostly the bulk index API should be allotted a larger set of threads. Otherwise, bulk index requests can become overloaded, and new requests will be ignored.

You can tune the thread pool settings within your cluster configuration. Thread pools are divided by operation and configured with a default value depending on the operation type. For brevity, we’re listing only a few of them:

bulk— Defaults to a fixed size based on the number of available processors for all bulk operations.

index— Defaults to a fixed size based on the number of available processors for index and delete operations.

search— Defaults to a fixed size that’s three times the number of available processors for count and search operations.

Looking at your elasticsearch.yml configuration, you can see that you can increase the size of the thread pool queue and number of thread pools for all bulk operations. It’s also worth noting here that the Cluster Settings API allows you to update these settings on a running cluster as well:

Bulk Thread Pool threadpool.bulk.type: fixed threadpool.bulk.size: 40 threadpool.bulk.queue_size: 200

Note that there are two thread pool types, fixed and cache. A fixed thread pool type holds a fixed number of threads to handle requests with a backing queue for pending requests. The queue_size parameter in this case controls the number of threads and defaults to the five times the number of cores. A cache thread pool type is unbounded, meaning that a new thread will be created if there are any pending requests.

Armed with the cluster health API, slow query and index logs, and thread information, you can diagnose CPU-intensive operations and bottlenecks more easily. The next section will cover memory-centric information, which can help in diagnosing and tuning Elasticsearch performance issues.

11.3.3. Memory: heap size, field, and filter caches

This section will explore efficient memory management and tuning for Elasticsearch clusters. Many aggregation and filtering operations are memory-bound within Elastic-search, so knowing how to effectively improve the default memory-management settings in Elasticsearch and the underlying JVM will be a useful tool for scaling your cluster.

Heap size

Elasticsearch is a Java application that runs on the Java Virtual Machine (JVM), so it’s subject to memory management by the garbage collector. The concept behind the garbage collector is a simple one: it’s triggered when memory is running low, clearing out objects that have been dereferenced and thus freeing up memory for other JVM applications to use. These garbage-collection operations are time consuming and cause system pauses. Loading too much data in memory can also lead to OutOfMemory exceptions, causing failures and unpredictable results—a problem that even the garbage collector can’t address.

For Elasticsearch to be fast, some operations are performed in memory because of improved access to field data. For instance, Elasticsearch doesn’t just load field data for documents that match your query; it loads values for all the documents in your index. This makes your subsequent query much faster by virtue of having quick access to in-memory data.

The JVM heap represents the amount of memory allocated to applications running on the JVM. For that reason, it’s important to understand how to tune its performance to avoid the ill effects of garbage collection pauses and OutOfMemory exceptions. You set the JVM heap size via the HEAP_SIZE environment variable. The two golden rules to keep in mind when setting your heap size are as follows:

Maximum of 50% of available system RAM— Allocating too much system memory to the JVM means there’s less memory allocated to the underlying file-system cache, which Lucene makes frequent use of.

Maximum of 32 GB RAM— The JVM changes its behavior at over 32 GB allocated by not using compressed ordinary object pointers (OOP). This means that setting the heap size under 32 GB uses approximately half the memory space.

Filter and field cache

Caches play an important role in Elasticsearch performance, allowing for the effective use of filters, facets, and index field sorting. This section will explore two of these caches: the filter cache and the field data cache.

The filter cache stores the results of filters and query operations in memory. This means that an initial query with a filter applied will have its results stored in the filter cache. Every subsequent query with that filter applied will use the data from the cache and not go to disk for the data. The filter cache effectively reduces the impact on CPU and I/O and leads to faster results of filtered queries.

Two types of filter caches are available in Elasticsearch:

Index-level filter cache

Node-level filter cache

The node-level filter cache is the default setting and the one we’ll be covering. The index-level filter cache isn’t recommended because you can’t predict where the index will reside inside the cluster and therefore can’t predict memory usage.

The node-level filter cache is an LRU (least recently used) cache type. That means that when the cache becomes full, cache entries that are used the least amount of times are destroyed first to make room for new entries. Choose this cache type by setting index.cache.filter.type to node, or don’t set it at all; it’s the default value. Now you can set the size with the indices.cache.filter.size property. It will take either a percentage value of memory (20%) to allocate or a static value (1024 MB) within your elasticsearch.yml configuration for the node. Note that a percentage property uses the maximum heap for the node as the total value to calculate from.

Field-data cache

The field-data cache is used to improve query execution times. Elasticsearch loads field values into memory when you run a query and keeps those values in the field-data cache for subsequent requests to use. Because building this structure in memory is an expensive operation, you don’t want Elasticsearch performing this on every request, so the performance gains are noticeable. By default, this is an unbounded cache, meaning that it will grow until it trips the field-data circuit breaker (covered in the next section). By specifying a value for the field-data cache, you tell Elasticsearch to evict data from the structure once the upper bound is reached.

Your configuration should include an indices.fielddata.cache.size property that can be set to either a percentage value (20%) or a static value (16 GB). These values represent the percentage or static segment of node heap space to use for the cache.

To retrieve the current state of the field-data cache, there are some handy APIs available:

Per-Node:

curl -XGET 'localhost:9200/_nodes/stats/indices/ fielddata?fields=&pretty=1'; Per-Index: curl -XGET 'localhost:9200/_stats/fielddata?fields=&pretty=1'; Per-Node Per-Index:

curl -XGET 'localhost:9200/_nodes/stats/indices/ fielddata?level=indices&fields =*&pretty=1';

Specifying fields=* will return all field names and values. The output of these APIs looks similar to the following:

"indices" : {

"bitbucket" : {

"fielddata" : {

"memory_size_in_bytes" : 1024mb,

"evictions" : 200,

"fields" : { ... }

} }, ...

These operations will break down the current state of the cache. Take special note of the number of evictions. Evictions are an expensive operation and a sign that the field-data cache may be set to too small of a value.

Circuit breaker

As mentioned in the previous section, the field-data cache may grow to the point that it causes an OutOfMemory exception. This is because the field-data size is calculated after the data is loaded. To avoid such events, Elasticsearch provides circuit breakers.

Circuit breakers are artificial limits imposed to help reduce the chances of an OutOfMemory exception. They work by introspecting data fields requested by a query to determine whether loading the data into the cache will push the total size over the cache size limit. Two circuit breakers are available in Elasticsearch, as well as a parent circuit breaker that sets a limit on the total amount of memory that all circuit breakers may use:

indices.breaker.total.limit— Defaults to 70% of heap. Doesn’t allow the field-data and request circuit breakers to surpass this limit.

indices.breaker.fielddata.limit— Defaults to 60% of heap. Doesn’t allow the fielddata cache to surpass this limit.

indices.breaker.request.limit— Defaults to 40% of heap. Controls the size fraction of heap allocated to operations like aggregation bucket creation.

The golden rule with circuit breaker settings is to be conservative in their values because the caches the circuit breakers control have to share memory space with memory buffers, the filter cache, and other Elasticsearch memory use.

Avoiding swap

Operating systems use the swapping process as a method of writing memory pages to disk. This process occurs when the amount of memory isn’t enough for the operating system. When the swapped pages are needed by the OS, they’re loaded back in memory for use. Swapping is an expensive operation and should be avoided.

Elasticsearch keeps a lot of runtime-necessary data and caches in memory, as shown in figure 11.4, so expensive write and read disk operations will severely impact a running cluster. For this reason, we’ll show how to disable swapping for faster performance.

Figure 11.4. Elasticsearch keeps runtime data and caches in memory, so writes and reads can be expensive.

The most thorough way to disable Elasticsearch swapping is to set bootstrap .mlockall to true in the elasticsearch.yml file. Next, you need to verify that the setting is working. Running Elasticsearch, you can either check the log for a warning or simply query for a live status:

Sample error in the log:

[2014-11-21 19:22:00,612][ERROR][common.jna]

Unknown mlockall error 0 API request: curl -XGET 'localhost:9200/_nodes/process?pretty=1'; Response:

...

"process" : {

"refresh_interval_in_millis" : 1000,

"id" : 9809,

"max_file_descriptors" : 10240,

"mlockall" : false } ...

If either the warning is visible in the log or the status check results in mlockall being set to false, your settings didn’t work. Insufficient access rights on the user running Elasticsearch are the most common reason for the new setting not taking affect. This is normally solved by running ulimit -l unlimited from the shell as the root user. It will be necessary to restart Elasticsearch for these new settings to be applied.

11.3.4. OS caches

Elasticsearch and Lucene leverage the OS file-system cache heavily due to Lucene’s immutable segments. Lucene is designed to leverage the underlying OS file-system cache for in-memory data structures. Lucene segments are stored in individual immutable files. Immutable files are considered to be cache-friendly, and the underlying OS is designed to keep “hot” segments resident in memory for faster access. The end effect is that smaller indices tend to be cached entirely in memory by your OS and become diskless and fast.

Because of Lucene’s heavy use of the OS file-system cache and the previous recommendation to set the JVM heap at half the physical memory, you can count on Lucene using much of the remaining half for caching. For this simple reason, it’s considered best practice to keep the indices that are most often used on faster machines. The idea is that Lucene will keep hot data segments in memory for really fast access, and this is easiest to accomplish on machines with more non-heap memory allocated. But to make this happen, you’ll need to assign specific indices to your faster nodes using routing.

First, you need to assign a specific attribute, tag, to all of your nodes. Every node has a unique value assigned to the attribute tag; for instance, node.tag: mynode1 or node.tag: mynode2. Using the node’s individual settings, you can create an index that will deploy only on nodes that have specific tag values. Remember, the point of this exercise is to make sure that your new, busy index is created only on nodes with more non-heap memory that Lucene can make good use of. To achieve this, your new index, myindex, will now be created only on nodes that have tag set to mynode1 and mynode2, with the following command:

curl -XPUT localhost:9200/myindex/_settings -d '{

"index.routing.allocation.include.tag" : "mynode1,mynode2"

Assuming these specific nodes have a higher non-heap memory allocation, Lucene will cache segments in memory, resulting in a much faster response time for your index than the alternative of having to seek segments on disk.

11.3.5. Store throttling

Apache Lucene stores its data in immutable segment files on disk. Immutable files are by definition written only once by Lucene but read many times. Merge operations work on these segments because many segments are read at once when a new one is written. Although these merge operations normally don’t task a system heavily, systems with low I/O can be impacted negatively when merges, indexing, and search operations are all occurring at the same time. Fortunately, Elasticsearch provides throttling features to help control how much I/O is used.

You can configure throttling at both the node level and the index level. At the node level, throttling configuration settings affect the entire node, but at the index level, throttling configuration takes effect only on the indices specified.

Node-level throttling is configured by use of the indices.store.throre.throttle.type property with possible values of none, merge, and all. The merge value instructs Elasticsearch to throttle I/O for merging operations across the entire node, meaning every shard on that node. The all value will apply the throttle limits to all operations for all of the shards on the node. Index-level throttling is configured much the same way but uses the index.store.throttle.type property instead. Additionally, it allows for a node value to be set, which means it will apply the throttling limits to the entire node.

Whether you’re looking to implement node- or index-level throttling, Elasticsearch provides a property for setting the maximum bytes per second that I/O will use. For node-level throttling, use

indices.store.throttle.max_bytes_per_sec, and for index-level throttling, use index.store.throttle.max_bytes_per_sec. Note that the values are expressed in megabytes per second: indices.store.throttle.max_bytes_per_sec : "50mb" or

index.store.throttle.max_bytes_per_sec : "10mb"

We leave as an exercise for you to configure the correct values for your particular system. If the frequency of I/O wait on a system is high or performance is degrading, lowering these values may help ease some of the pain.

Although we’ve explored ways to curtail a disaster, the next section will look at how to back up and restore data from/to your cluster in the event of one.