10.4. Other performance tradeoffs

In previous sections, you might have noticed that to make an operation fast, you need to pay with something. For example, if you make indexing faster by refreshing less often, you pay with searches that may not “see” recently indexed data. In this section we’ll continue looking at such tradeoffs, especially those that occur in more specific use cases, by answering questions on the following topics:

Inexact matches— Should you get faster searches by using ngrams and shingles at index time? Or is it better to use fuzzy and wildcard queries?

Scripts— Should you trade some flexibility by calculating as much as possible at index time? If not, how can you squeeze more performance out of them?

Distributed search— Should you trade some network round-trips for more accurate scoring? Deep paging— Is it worth trading memory to get page 100 faster?

By the time this chapter ends, we’ll have answered all these questions and lots of others that will come up along the way. Let’s start with inexact matches.

10.4.1. Big indices or expensive searches

Recall from chapter 4 that to get inexact matches—for example, to tolerate typos—you can use a number of queries:

Fuzzy query— This query matches terms at a certain edit distance from the original. For example, omitting or adding an extra character would make a distance of 1.

Prefix query or filter— These match terms starting with the sequence you provide.

Wildcards— These allow you to use ? and to substitute one or many characters. For example, "esearch" would match “elasticsearch.”

These queries offer lots of flexibility, but they’re also more expensive than simple queries, such as term queries. For an exact match, Elasticsearch has to find only one term in the term dictionary, whereas fuzzy, prefix, and wildcard queries have to find all terms matching the given pattern.

There’s also another solution for tolerating typos and other inexact matches: ngrams. Recall from chapter

5 that ngrams generate tokens from each part of the word. If you use them at both index and query time, you’ll get similar functionality to a fuzzy query, as you can see in figure 10.9.

Figure 10.9. Ngrams generate more terms than you need with fuzzy queries, but they match exactly.

Which approach is best for performance? As with everything in this chapter, there’s a tradeoff, and you need to choose where you want to pay the price:

Fuzzy queries slow down your searches, but your index is the same as with exact matches.

Ngrams, on the other hand, increase the size of your index. Depending on ngram and term sizes, the index size with ngrams can increase a few times. Also, if you want to change ngram settings, you have to re-index all data, so there’s less flexibility, but searches are typically faster overall with ngrams.

The ngram method is typically better when query latency is important or when you have lots of concurrent queries to support, so you need each one to take less CPU. Ngrams cause indices to be bigger, but they need to still fit in OS caches or you need fast disks—otherwise performance will degrade because your index is too big.

The fuzzy approach, on the other hand, is better when you need indexing throughput, where index size is an issue, or you have slow disks. Fuzzy queries also help if you need to change them often, such as by adjusting the edit distance, because you can make those changes without re-indexing all data.

Prefix queries and edge ngrams

For inexact matches, you often assume that the beginning is right. For example, a search for “elastic” might be looking for “elasticsearch.” Like fuzzy queries, prefix queries are more expensive than regular term queries because there are more terms to look through.

The alternative could be to use edge ngrams, which were introduced in chapter 5. Figure 10.10 shows edge ngrams and prefix queries side by side.

Figure 10.10. A prefix query has to match more terms but works with a smaller index than edge ngrams.

As with the fuzzy queries and ngrams, the tradeoff is between flexibility and index size, which are better in the prefix approach, and query latency and CPU usage, which are better for edge ngrams.

Wildcards

A wildcard query where you always put a wildcard at the end, such as elastic*, is equivalent in terms of functionality to a prefix query. In this case, you have the same alternative of using edge ngrams.

If the wildcard is in the middle, as with e*search, there’s no real index-time equivalent. You can still use ngrams to match the provided letters e and search, but if you have no control over how wildcards are used, then the wildcard query is your only choice.

If the wildcard is always in the beginning, the wildcard query is typically more expensive than trailing wildcards because there’s no prefix to hint in which part of the term dictionary to look for matching terms. In this case, the alternative can be to use the reverse token filter in combination with edge ngrams, as you saw in chapter 5. This alternative is illustrated in figure 10.11.

Figure 10.11. You can use the reverse and edge ngram token filters to match suffixes.

Phrase queries and shingles

When you need to account for words that are next to each other, you can use the match query with type set to phrase, as you saw in chapter 4. Phrase queries are slower because they have to account not only for the terms but also for their positions in the documents.

Note

Positions are enabled by default for all analyzed fields because index_options is set to positions. If you don’t use phrase queries, only term queries, you can disable indexing positions by setting index_options to freqs. If you don’t care about scoring at all—for example, when you index application logs and you always sort results by timestamp—you can also skip indexing frequencies by setting index_options to docs.

The index-time alternative to phrase queries is to use shingles. As you saw in chapter 5, shingles are like ngrams but for terms instead of characters. A text that was tokenized into Introduction, to, and Elasticsearch with a shingle size of 2 would produce the terms “Introduction to” and “to Elasticsearch.”

The resulting functionality is similar to phrase queries, and the performance implications are similar to the ngram situations we discussed earlier: shingles will increase the index size and slow down indexing in exchange for faster queries.

The two approaches are not exactly equivalent, in the same way wildcards and ngrams aren’t equivalent. With phrase queries, for example, you can specify a slop, which allows for other words to appear in your phrase. For example, a slop of 2 would allow a sequence like “buy the best phone” to match a query for “buy phone.” That works because at search time, Elasticsearch is aware of the position of each term, whereas shingles are effectively single terms.

The fact that shingles are single terms allows you to use them for better matching of compound words. For example, many people still refer to Elasticsearch as “elastic search,” which can be a tricky match. With shingles, you can solve this by using an empty string as a separator instead of the default white space, as shown in figure 10.12.

Figure 10.12. Using shingles to match compound words

As you’ve seen in our discussion of shingles, ngrams, and fuzzy and wildcard queries, there’s often more than one way to search your documents, but that doesn’t mean those ways are equivalent. Choosing the best one in terms of performance and flexibility depends a lot on your use case. Next we’ll look more deeply at scripts, where you’ll find more of the same: multiple ways to achieve the same result, but each method comes with its own advantages and disadvantages.

10.4.2. Tuning scripts or not using them at all

We first introduced scripts in chapter 3 because they can be used for updates. You saw them again in chapter 6, where you used them for sorting. In chapter 7 you used scripts again, this time to build virtual fields at search time using script fields.

You get a lot of flexibility through scripting, but this flexibility has an important impact on performance. Results of a script are never cached because Elasticsearch doesn’t know what’s in the script. There can be something external, like a random number, that will make a document match now but not match for the next run. There’s no choice for Elasticsearch other than running the same script for all documents involved.

When used, scripts are often the most time- and CPU-consuming part of your searches. If you want to speed up your queries, a good starting point is to try skipping scripts altogether. If that’s not possible, the general rule is to get as close to native code as you can to improve their performance.

How can you get rid of scripts or optimize them? The answer depends heavily on the exact use case, but we’ll try to cover the best practices here.

Avoiding the use of scripts

If you’re using scripts to generate script fields, as you did in chapter 7, you can do this at index time. Instead of indexing documents directly and counting the number of group members in a script by looking at the array length, you can count the number of members in your indexing pipeline and add it to a new field. In figure 10.13, we compare the two approaches.

Figure 10.13. Counting members in a script or while indexing

As with ngrams, this approach to doing the computation at index time works well if query latency is a higher priority than indexing throughput.

Besides precomputing, the general rule for performance optimization for scripting is to reuse as much of Elasticsearch’s existing functionality as possible. Before using scripts, can you fulfill the requirements with the function score query that we discussed in chapter 6? The function score query offers lots of ways to manipulate the score. Let’s say you want to run a query for “elasticsearch” events, but you’ll boost the score in the following ways, based on these assumptions:

Events happening soon are more relevant. You’ll make events’ scores drop exponentially the farther in the future they are, up to 60 days.

Events with more attendees are more popular and more relevant. You’ll increase the score

linearly the more attendees an event has.

If you calculate the number of event attendees at index time (name the field attendees_count), you can achieve both criteria without using any script:

"function_score": {

"functions": [

{

"linear": {

"date": {

"origin": "2013-07-25T18:00",

"scale": "60d"

}

}

},

{

"field_value_factor": {

"field": "attendees_count"

}

}

]

}

Native scripts

If you want the best performance from a script, writing native scripts in Java is the best way to go. Such a native script would be an Elasticsearch plugin, and you can look in appendix B for a complete guide on how to write one.

The main disadvantage with native scripts is that they have to be stored on every node in Elasticsearch’s classpath. Changing a script implies updating it on all the nodes of your cluster and restarting them. This won’t be a problem if you don’t have to change your queries often.

To run a native script in your query, set lang to native and the name of the script as the script content. For example, if you have a plugin with a script called number-OfAttendees that calculates the number of event attendees on the fly, you can use it in a stats aggregation like this:

"aggregations": {

"attendees_stats": {

"stats": { "script": "numberOfAttendees",

"lang": "native"

}

}

}

Lucene expressions

If you have to change scripts often or you want to be prepared to change them without restarting all your clusters, and your scripts work with numerical fields, Lucene expressions are likely to be the best choice.

With Lucene expressions, you provide a JavaScript expression in the script at query time, and

Elasticsearch compiles it in native code, making it as quick as a native script. The big limitation is that you have access only to indexed numeric fields. Also, if a document misses the field, the value of 0 is taken into account, which might skew results in some use cases.

To use Lucene expressions, you’d set lang to expression in your script. For example, you might have the number of attendees already, but you know that only half of them usually show up, so you want to calculate some stats based on that number:

"aggs": {

"expected_attendees": {

"stats": { "script": "doc['attendees_count'].value/2",

"lang": "expression"

}

}

}

If you have to work with non-numeric or non-indexed fields and you want to be able to easily change scripts, you can use Groovy—the default language for scripting since Elasticsearch 1.4. Let’s see how you can optimize Groovy scripts.

Term statistics

If you need to tune the score, you can access Lucene-level term statistics without having to calculate the score in the script itself—for example, if you only want to compute the score based on the number of times that term appears in the document. Unlike Elasticsearch’s defaults, you don’t care about the length of the field in that document or the number of times that term appears in other documents. To do that, you can have a script score that only specifies the term frequency (number of times the term appears in the document), as shown in the following listing.

Listing 10.9. Script score that only specifies term frequency

Accessing field data

If you need to work with the actual content of a document’s fields in a script, one option is to use the

_source field. For example, you’d get the organizer field by using _source['organizer'].

In chapter 3, you saw how you can store individual fields instead of alongside _source. If an individual field is stored, you can access the stored content, too. For example, the same organizer field can be retrieved with _fields['organizer'].

The problem with _source and _fields is that going to the disk to fetch the field content of that particular field is expensive. Fortunately, this slowness is exactly what made field data necessary when Elasticsearch’s built-in sorting and aggregations needed to access field content. Field data, as we discussed in chapter 6, is tuned for random access, so it’s best to use it in your scripts, too. It’s often orders of magnitude faster than the _source or _fields equivalent, even if field data isn’t already loaded for that field when the script is first run (or if you use doc values, as explained in chapter 6).

To access the organizer field via field data, you’d refer to doc['organizer']. For example, you can return groups where the organizer isn’t a member, so you can ask them why they don’t participate to their own groups:

% curl 'localhost:9200/get-together/group/_search?pretty' -d '{

"query": {

"filtered": {

"filter": {

"script": {

"script": "return

doc.organizer.values.intersect(doc.members.values).isEmpty()",

}

}

}

}

}'

There’s one caveat for using doc['organizer'] instead of _source['organizer'] or the

_fields equivalent: you’ll access the terms, not the original field of the document. If an organizer is

'Lee', and the field is analyzed with the default analyzer, you’ll get 'Lee' from _source and 'lee' from doc. There are tradeoffs everywhere, but we assume you’ve gotten used to them at this point in the chapter.

Next, we’ll take a deeper look at how distributed searches work and how you can use search types to find a good balance between having accurate scores and low-latency searches.

10.4.3. Trading network trips for less data and better distributed scoring

Back in chapter 2, you saw how when you hit an Elasticsearch node with a search request, that node distributes the request to all the shards that are involved and aggregates the individual shard replies into one final reply to return to the application.

Let’s take a deeper look at how this works. The naïve approach would be to get N documents from all shards involved (where N is the value of size), sort them on the node that received the HTTP request (let’s call it the coordinating node), pick the top N documents, and return them to the application. Let’s say that you send a request with the default size of 10 to an index with the default number of 5 shards. This means that the coordinating node will fetch 10 whole documents from each shard, sort them, and return only the top 10 from those 50 documents. But what if there were 10 shards and 100 results? The network overhead of transferring the documents and the memory overhead of handling them on the coordinating node would explode, much like specifying large shard_size values for aggregations are bad for performance.

How about returning only the IDs of those 50 documents and the metadata needed for sorting to the coordinating node? After sorting, the coordinating node can fetch only the required top 10 documents from the shards. This would reduce the network overhead for most cases but will involve two round-trips.

With Elasticsearch, both options are available by setting the search_type parameter to the search. The naïve implementation of fetching all involved documents is query_and_fetch, whereas the twotrip method is called query_then_fetch, which is also the default. A comparison of the two is shown in figure 10.14.

Figure 10.14. Comparison between query_and_fetch and query_then_fetch

The default query_then_fetch (shown on the right of the figure) gets better as you hit more shards, as you request more documents via the size parameter, and as documents get bigger because it will transfer much less data over the network. query_and_fetch is only faster when you hit one shard— that’s why it’s used implicitly when you search a single shard, when you use routing, or when you only get the counts (we’ll discuss this later). Right now you can specify query_and_fetch explicitly, but in version 2.0 it will only be used internally for these specific use cases.[11]

11 https://github.com/elastic/elasticsearch/issues/9606

Distributed scoring

By default, scores are calculated per shard, which can lead to inaccuracies. For example, if you search for a term, one of the factors is the document frequency (DF), which shows how many times the term you search for appears in all documents. Those “all documents” are by default “all documents in this shard.” If the DF of a term is significantly different between shards, scoring might not reflect reality. You can see this in figure 10.15, where doc 2 gets a higher score than doc 1, even though doc 1 has more occurrences of “elasticsearch,” because there are fewer documents with that term in its shard.

Figure 10.15. Uneven distribution of DF can lead to incorrect ranking.

You can imagine that with a high enough number of documents, DF values would naturally balance across shards, and the default behavior would work just fine. But if score accuracy is a priority or if DF is unbalanced for your use case (for example, if you’re using custom routing), you’ll need a different approach.

That approach could be to change the search type from query_then_fetch to

dfs_query_then_fetch. The dfs part will tell the coordinating node to make an extra call to the shards in order to gather document frequencies of the searched terms. The aggregated frequencies will be used to calculate the score, as you can see in figure 10.16, ranking your doc 1 and doc 2 correctly.

Figure 10.16. dfs search types use an extra network hop to compute global DFs, which are used for scoring.

You probably already figured out that DFS queries are slower because of the extra network call, so make sure that you actually get better scores before switching. If you have a low-latency network, this overhead can be negligible. If, on the other hand, your network isn’t fast enough or you have high query concurrency, you may see a significant overhead.

Returning only counts

But what if you don’t care about scoring at all and you don’t need the document content, either? For example, you need only the document count or the aggregations. In such cases, the recommended search type is count. count asks the involved shards only for the number of documents that match and adds up those numbers.

Tip

In version 2.0, adding size=0 to a query will automatically do the same logic that

search_type=count currently does, and search_type=count will be deprecated. More details can be found here: https://github.com/elastic/elasticsearch/pull/9296.

10.4.4. Trading memory for better deep paging

In chapter 4, you learned that you’d use size and from to paginate the results of your query. For example, to search for “elasticsearch” in get-together events and get the fifth page of 100 results, you’d run a request like this:

% curl 'localhost:9200/get-together/event/_search?pretty' -d '{

"query": {

"match": {

"title": "elasticsearch"

}

},

"from": 400,

"size": 100

}'

This will effectively fetch the top 500 results, sort them, and return only the last 100. You can imagine how inefficient this gets as you go deeper with pages. For example, if you change the mapping and want to re-index all existing data into a new index, you might not have enough memory to sort through all the results in order to return the last pages.

For this kind of scenario you can use the scan search type, as you’ll do in listing 10.10, to go through all the get-together groups. The initial reply returns only the scroll ID, which uniquely identifies this request and will remember which pages were already returned. To start fetching results, send a request with that scroll ID. Repeat the same request to fetch the next page until you either have enough data or there are no more hits to return—in which case the hits array is empty.

Listing 10.10. Use scan search type

As with other searches, scan searches accept a size parameter to control the page size. But this time, the page size is calculated per shard, so the actual returned size would be size times the number of shards. The timeout given in the scroll parameter of each request is renewed each time you get a new page; that’s why you can have a different timeout with every new request.

Note

It may be tempting to have big timeouts so that you’re sure a scroll doesn’t expire while you’re processing it. The problem is that if a scroll is active and not used, it wastes resources, taking up some JVM heap to remember the current page and disk space taken by Lucene segments that can’t be deleted by merges until the scroll is completed or expired.

The scan search type always returns results in the order in which it encounters them in the index, regardless of the sort criteria. If you need both deep paging and sorting, you can add a scroll parameter to a regular search request. Sending a GET request to the scroll ID will get the next page of results. This time, size works accurately, regardless of the number of shards. You also get the first page of results with the first request, just like you get with regular searches:

% curl 'localhost:9200/get-together/event/_search?pretty&scroll=1m' -d ' {

"query": {

"match": {

"title": "elasticsearch"

}

}

}'

From a performance perspective, adding scroll to a regular search is more expensive than using the scan search type because there’s more information to keep in memory when results are sorted. That being said, deep paging is much more efficient than the default because Elasticsearch doesn’t have to sort all previous pages to return the current page.

Scrolling is useful only when you know in advance that you want to do deep paging; it’s not recommended for when you need only a few pages of results. As with everything in this chapter, you pay a price for every performance improvement. In the case of scrolling, that price is to keep information about the current search in memory until the scroll expires or you have no more hits.