6.9. Field data detour

The inverted index is great when you want to look for a term and get back the matching documents. But when you need to sort on a field or return some aggregations, Elasticsearch needs to quickly figure out, for each matching document, the terms that will be used for sorting or aggregations.

Inverted indices don’t perform well for such tasks, and this is where field data becomes useful. When we talk about field data, we’re talking about all of the unique values for a field. These values are loaded by Elasticsearch into memory. If you have three documents that look like these

{"body": "quick brown fox"}

{"body": "fox brown fox"}

{"body": "slow turtle"}

the terms that would get loaded into memory would be quick, brown, fox, slow, and turtle.

Elasticsearch loads these in a compressed manner into the field data cache, which we’ll look at next.

6.9.1. The field data cache

The field data cache is an in-memory cache that Elasticsearch uses for a number of things. This cache is usually (but not always) built the first time the data is needed and then kept around to be used for various operations. This loading can take a lot of time and CPU if you have lots of data, slowing down that first search.

This is where warmers, queries that Elasticsearch runs automatically to make sure internal caches are filled, can come in handy to preload data used for queries before it’s needed. We’ll discuss warmers more in chapter 10.

Why the field data cache is so necessary

Elasticsearch needs this cache because a lot of comparison and analytic operations operate on a large amount of data, and the only way these operations can be accomplished in a reasonable amount of time is if the data is accessible in memory. Elasticsearch goes to great lengths to minimize the amount of memory that this cache takes up, but it still ends up being one of the largest users of heap space in the Java virtual machine.

Not only should you be aware of the memory used by the cache, but you should also be aware that the initial loading of this cache can take a nontrivial amount of time. You may notice this when performing aggregations and seeing that the first aggregation takes 2–3 seconds to complete, whereas subsequent aggregation requests return in 30 milliseconds.

If this loading time becomes problematic, you can pay the price at index time and make Elasticsearch load the field data automatically when making a new segment available for search. To do this for a field you sort or aggregate on, you have to set fielddata.loading to eager in the mapping. By your setting this to eager, Elasticsearch won’t wait until the first search to load the field data but will do it as soon as it’s available to be loaded.

For example, to make the verbatim tags of a get-together group (on which you run a terms aggregation to get the top 10 tags) eagerly loaded, you can have the mapping shown in the following listing. Listing 6.20. Eager loaded field data for the title field

6.9.2. What field data is used for

As previously mentioned, field data is used for a number of things in Elasticsearch. Here are some of the uses of field data:

Sorting by a field

Aggregating on a field

Accessing the value of a field in a script with the doc['fieldname'] notation

Using the field_value_factor function in the function_score query

Using the decay functions in the function_score query

Returning fields from field data using fielddata_fields in a search request Caching the IDs of a parent/child document relationship

Probably the most common of these uses is sorting or aggregating on a field. For example, if you sort the get-together results by the organizer field, all of the unique values of that field must be loaded into memory in order for them to be efficiently compared to provide a sorting order.

Right behind sorting on a field is aggregating on a field. When a terms aggregation is performed, Elasticsearch needs to be able to count each unique term, so those unique terms and their counts must be held in memory in order to generate these sorts of analytic results. Likewise in the case of a statistical aggregation, the numeric data for a field has to be loaded in order to calculate the resulting values.

Not to fear, though; as I mentioned, although this may sound like a lot of data to load (and it certainly can be), Elasticsearch does its best to load the data in a compressed manner. That said, you do need to be aware of it, so let’s talk about how to manage field data in your cluster.

6.9.3. Managing field data

There are a few ways to manage field data in an Elasticsearch cluster. Now, what do we mean when we say “manage”? Well, managing field data means avoiding issues in the cluster where JVM garbage collection is taking a long time or so much memory is being loaded that you get an

OutOfMemoryError; it would also be beneficial to avoid cache churn, so data isn’t constantly being loaded and unloaded from memory.

We’re going to talk about three different ways to do such management:

Limiting the amount of memory used by field data

Using the field data circuit breaker

Bypassing memory altogether with doc values

Limiting the amount of memory used by field data

One of the easiest ways to make sure your data doesn’t take up too much space in memory is to limit the field data cache to a certain size. If you don’t specify this, Elasticsearch doesn’t limit the cache at all, and data isn’t automatically expired from the cache after a set time.

There are two different options when it comes to limiting the field data cache: you can limit by a size amount, or you can set an expiration time after which the field data in the cache will be invalidated.

To set these options, specify the following in your elasticsearch.yml file; these settings can’t be updated through the cluster update settings API and therefore require a restart when changed:

indices.fielddata.cache.size: 400mb indices.fielddata.cache.expire: 25m

But when setting these, it makes more sense to set the indices.fielddata.cache.size option instead of the expire option. Why? Because when field data is loaded into the cache, it will stay there until the limit is reached, and then it will be evicted in a last-recently-used (LRU) manner. By setting just the size limit, you’re also removing only the oldest data from the cache once the limit has been reached.

When setting the size, you can also use a relative size instead of an absolute, so instead of the 400mb from our example, you can specify 40% to use 40% of the JVM’s heap size for the field data cache. This can be useful if you have machines with differing amounts of physical memory but want to unify the elasticsearch.yml configuration file between them without specifying absolute values.

Using the field data circuit breaker

What happens if you don’t set the size of the cache? Well, in order to protect against loading too much data into memory, Elasticsearch has the concept of a circuit breaker, which monitors the amount of data being loaded into memory and “trips” if a certain limit is reached.

In the case of field data, every time a request happens that would load field data (sorting on a field, for example), the circuit breaker estimates the amount of memory required for the data and checks whether loading it would exceed the maximum size. If it does exceed the size, an exception is returned and the operation is prevented.

This has a number of benefits: when applying a limit to the field data cache, the size of field data can be calculated only after the data has been loaded into memory, so it’s possible to load too much data and run out of memory; the circuit breaker, on the other hand, estimates the size of the data before it’s loaded so as to avoid loading it if it would cause the system to run out of memory.

Another benefit of this approach is that the circuit breaker limit can be dynamically adjusted while the node is running, whereas the size of the cache must be set in the configuration file and requires restarting the node to change. The circuit breaker is configured by default to limit the field data size to 60% of the JVM’s heap size. You can configure this by sending a request like this:

curl -XPUT 'localhost:9200/_cluster/settings'

{

"transient": {

"indices.breaker.fielddata.limit": "350mb"

}

Again, this setting supports either an absolute value like 350mb or a percentage such as 45%. Once you’ve set this, you can see the limit and how much memory is currently tracked by the breaker with the Nodes Stats API, which we’ll talk about in chapter 11.

Note

As of version 1.4, there is also a request circuit breaker, which helps you make sure that other in-memory data structures generated by a request don’t cause an OutOfMemoryError by limiting them to a default

of 40%. There’s also a parent circuit breaker, which makes sure that the field data and the request

breakers together don’t exceed 70% of the heap. Both limits can be updated via the Cluster Update Settings API through indices.breaker.request.limit and indices.breaker.total.limit, respectively.

Bypassing memory and using the disk with doc values

So far you’ve seen that you should use circuit breakers to make sure outstanding requests don’t crash your nodes, and if you fall consistently short of field data space, you should either increase your JVM heap size to use more RAM or limit the field data size and live with bad performance. But what if you’re consistently short on field data space, don’t have enough RAM to increase the JVM heap, and can’t live with bad performance caused by field data evictions? This is where doc values come in.

Doc values take the data that needs to be loaded into memory and instead prepare it when the document is indexed, storing it on disk alongside the regular index data. This means that when field data would normally be used and read out of memory, the data can be read from disk instead. This provides a number of advantages:

Performance degrades smoothly— Unlike default field data, which needs to live in the JVM heap all at once, doc values are read from the disk, like the rest of the index. If the OS can’t fit everything

in its RAM caches, more disk seeks will be needed, but there are no expensive loads and evictions, no risk of OutOfMemory-Errors, and no circuit-breaking exceptions because the circuit breaker prevented the field data cache from using too much memory.

Better memory management— When used, doc values are cached in memory by the kernel, avoiding the cost of garbage collection associated with heap usage.

Faster loading— With doc values, the uninverted structure is calculated at index time, so even when

you run the first query, Elasticsearch doesn’t have to uninvert on the fly. This makes the initial requests faster, because the uninverting process has already been performed.

As with everything in this chapter, there’s no such thing as free lunch. Doc values come with disadvantages, too:

Bigger index size— Storing all doc values on disk inflates the index size.

Slightly slower indexing— The need to calculate doc values at index time slows down the process of indexing.

Slightly slows requests that use field data— Disk is also slower than memory, so some requests that would usually use an already-loaded field data cache in memory will be slightly slower when reading doc values from disk. This includes sorting, facets, and aggregations.

Works only on non-analyzed fields— As of version 1.4, doc values don’t support analyzed fields. If you want to build a word cloud of the words in event titles, for example, you can’t take advantage of doc values. Doc values can be used for numeric, date, Boolean, binary, and geo-point fields, though, and work well for large datasets on non-analyzed data, such as the timestamp field of log messages that are indexed into Elasticsearch.

The good news is that you can mix and match fields that use doc values with those that use the in-memory field data cache, so although you may want to use doc values for the timestamp field in your events, you can still keep the event’s title field in memory.

How are doc values used? Because they’re written out at indexing time, configuring doc values has to happen in the mapping for a particular field. If you have a string field that’s not analyzed and you’d like to use field values on it, you can configure the mapping when creating an index, as shown in the next listing.

Listing 6.21. Using doc-values in the mapping for the title field

Once the mapping has been configured, indexing and searching will work as normal without any additional changes.