7.1. Understanding the anatomy of an aggregation

All aggregations, no matter their type, follow some rules:

You define them in the same JSON request as your queries, and you mark them by the key aggregations, or aggs. You need to give each one a name and specify the type and the options specific to that type.

They run on the results of your query. Documents that don’t match your query aren’t accounted for unless you include them with the global aggregation, which is a bucket aggregation that will be covered later in this chapter.

You can further filter down the results of your query, without influencing aggregations. To do that, we’ll show you how to use post filters. For example, when searching for a keyword in an online shop, you can build statistics on all items matching the keyword but use post filters to show only results that are in stock.

Let’s take a look at the popular terms aggregation, which you’ve already seen in the intro to this chapter. The example use case was getting the most popular subjects (tags) for existing groups of your get-together site. We’ll use this same terms aggregation to explore the rules that all aggregations must follow.

7.1.1. Structure of an aggregation request

In listing 7.1, you’ll run a terms aggregation that will give you the most frequent tags in the get-together groups. The structure of this terms aggregation will apply to every other aggregation.

Note

For this chapter’s listing to work, you’ll need to index the sample dataset from the code samples that come with the book, located at https://github.com/dakrone/elasticsearch-in-action.

Listing 7.1. Using the terms aggregation to get top tags

At the top level there’s the aggregations key, which can be shortened to aggs.

On the next level, you have to give the aggregation a name. You can see that name in the reply. This is useful when you use multiple aggregations in the same request, so you can easily see the meaning of each set of results.

Finally, you have to specify the aggregation type terms and the specific option. In this case, you’ll have the field name.

The aggregation request from listing 7.1 hits the _search endpoint, just like the queries you’ve seen in previous chapters. In fact, you also get back 10 group results. This is all because no query was specified, which will effectively run the match_all query you saw in chapter 4, so your aggregation will run on all the group documents. Running a different query will make the aggregation run through a different set of documents. Either way, you get 10 such results because size defaults to 10. As you saw in chapters 2 and 4, you can change size from either the URI or the JSON payload of your query.

Field data and aggregations

When you run a regular search, it goes fast because of the nature of the inverted index: you have a limited number of terms to look for, and Elasticsearch will identify documents containing those terms and return the results. An aggregation, on the other hand, has to work with the terms of each document matching the query. It needs a mapping between document IDs and terms—opposite of the inverted index, which maps terms to documents.

By default, Elasticsearch un-inverts the inverted index into field data, as we explained in chapter 6, section 6.10. The more terms it has to deal with, the more memory the field data will use. That’s why you have to make sure you give Elasticsearch a large enough heap, especially when you’re doing aggregations on large numbers of documents or if you’re analyzing fields and you have more than one term per document. For not_analyzed fields, you can use doc values to have this un-inverted data structure built at index time and stored on disk. More details about field data and doc values can be found in chapter 6, section 6.10.

7.1.2. Aggregations run on query results

Computing metrics over the whole dataset is just one of the possible use cases for aggregations. Often you want to compute metrics in the context of a query. For example, if you’re searching for groups in Denver, you probably want to see the most popular tags for those groups only. As you’ll see in the next listing, this is the default behavior for aggregations. Unlike in listing 7.1, where the implied query was match_all, in the following listing you query for “Denver” in the location field, and aggregations will only be about groups from Denver.

Listing 7.2. Getting top tags for groups in Denver

Recall from chapter 4 that you can use the from and size parameters of your query to control the pagination of results. These parameters have no influence on aggregations because aggregations always run on all the documents matching a query.

If you want to restrict query results more without also restricting aggregations, you can use post filters. We’ll discuss post filters and the relationship between filters and aggregations in general in the next section.

7.1.3. Filters and aggregations

In chapter 4 you saw that for most query types there’s a filter equivalent. Because filters don’t calculate scores and are cacheable, they’re faster than their query counterparts. You’ve also learned that you should wrap filters in a filtered query, like this:

% curl 'localhost:9200/get-together/group/_search?pretty' -d '{

"query": {

"filtered": {

"filter": {

"term": {

"location": "denver"

}

}

}

}}'

Using the filter this way is good for overall query performance because the filter runs first. Then the query —which is typically more performance-intensive—runs only on documents matching the filter. As far as aggregations are concerned, they run only on documents matching the overall filtered query, as shown in figure 7.3.

Figure 7.3. A filter wrapped in a filtered query runs first and restricts both results and aggregations.

“Nothing new so far,” you might say. “The filtered query behaves like any other query when it comes to aggregations,” and you’d be right. But there’s also another way of running filters: by using a post filter, which will run after the query and independent of the aggregation. The following request will give the same results as the previous filtered query:

% curl 'localhost:9200/get-together/group/_search?pretty' -d '{

"post_filter": {

"term": {

"location": "denver"

}

}}'

As illustrated in figure 7.4, the post filter differs from the filter in the filtered query in two ways:

Figure 7.4. Post filter runs after the query and doesn’t affect aggregations.

Performance— The post filter runs after the query, making sure the query will run on all documents, and the filter runs only on those documents matching the query. The overall request is typically slower than the filtered query equivalent, where the filter runs first.

Document set processed by aggregations— If a document doesn’t match the post filter, it will still be accounted for by aggregations.

Now that you understand the relationships between queries, filters, and aggregations, as well as the overall structure of an aggregation request, we can dive deeper into Aggregations Land and explore different aggregation types. We’ll start with metrics aggregations and then go to bucket aggregations, and then we’ll discuss how to combine them to get powerful insights from your data in real time.