7.4. Nesting aggregations

The real power of aggregations is the fact that you can combine them. For example, if you have a blog and you record each access to your posts, you can use the terms aggregation to show the most-viewed posts. But you can also nest a cardinality aggregation under this terms aggregation and show the number of unique visitors for each post; you can even change the sorting in the terms aggregation to show posts with the most unique visitors.

As you may imagine, nesting aggregations opens a whole new range of possibilities for exploring data. Nesting is the main reason aggregations emerged in Elasticsearch as a replacement for facets, because facets couldn’t be combined.

Multi-bucket aggregations are typically the point where you start nesting. For example, the terms aggregation allows you to show the top tags for get-together groups; this means you’ll have a bucket of documents for each tag. You can use sub-aggregations to show more metrics for each bucket. For example, you can show how many groups are being created each month, for each tag, as illustrated in figure 7.10.

Figure 7.10. Nesting a date histogram aggregation under a terms aggregation

Later in this section, we’ll discuss one particular use case for nesting: result grouping, which, unlike a regular search that gives you the top N results by relevance, gives you the top N results for each bucket of documents generated by the parent aggregation. Say you have an online shop and someone searches for “Windows.” Normally, relevance-sorted results will show many versions of the Windows operating system first. This may not be the best user experience, because at this point it’s not 100% clear whether the user is looking to buy a Windows operating system, some software built for Windows, or some hardware that works with Windows. This is where result grouping, illustrated in figure 7.11, comes in handy: you can show the top three results from each of the operating systems, software, and hardware categories and give the user a broader range of results. The user may also want to click on the category name to narrow the search to that category only.

Figure 7.11. Nesting the top_hits aggregation under a terms aggregation to get result grouping

In Elasticsearch, you’ll be able to get result grouping by using a special aggregation called tophits. It retrieves the top _N results, sorted by score or a criterion of your choice, for each bucket of a parent aggregation. That parent aggregation can be a terms aggregation that’s running on the category field, as suggested in the online shop example of figure 7.11; we’ll go over this special aggregation in the next section.

The last nesting use case we’ll talk about is controlling the document set on which your aggregations run. For example, regardless of the query, you might want to show the top tags for get-together groups created in the last year. To do this, you’d use the filter aggregation, which creates a bucket of documents that match the provided filter, in which you can nest other aggregations.

7.4.1. Nesting multi-bucket aggregations

To nest an aggregation within another one, you just have to use the aggregations or aggs key on the same level as the parent aggregation type and then put the sub-aggregation definition as the value. For multi-bucket aggregations, this can be done indefinitely. For example, in the following listing you’ll use

the terms aggregation to show the top tags. For each tag, you’ll use the date_histogram aggregation to show how many groups were created each month, for each tag. Finally, for each bucket of such groups, you’ll use the range aggregation to show how many groups have fewer than three members and how many have at least three.

Listing 7.15. Nesting multi-bucket aggregations three times

You can always nest a metrics aggregation within a bucket aggregation. For example, if you wanted the average number of group members instead of the 0–2 and 3+ ranges that you had in the previous listing, you could use the avg or stats aggregation.

One particular type of aggregation we promised to cover in the last section is top_hits. It will get you the top N results, sorted by the criteria you like, for each bucket of its parent aggregation. Next, we’ll look at how you’ll use the top_hits aggregation to get result grouping.

7.4.2. Nesting aggregations to get result grouping

Result grouping is useful when you want to show the top results grouped by a certain category. Like in Google, when you have many results from the same site, you sometimes see only the top three or so, and then it moves on to the next site. You can always click the site’s name to get all the results from it that match your query.

That’s what result grouping is for: it allows you to give the user a better idea of what else is in there. Say you want to show the user the most recent events, and to make results more diverse you’ll show the most recent event for the most frequent attendees. You’ll do this in the next listing by running the terms aggregation on the attendees field and nesting the top_hits aggregation under it.

Listing 7.16. Using the top hits aggregation to get result grouping

At first, it may seem strange to use aggregations for getting results grouping. But now that you’ve learned what aggregations are all about, you can see that these concepts of buckets and nesting are powerful and enable you to do much more than gather some statistics on query results. The top_hits aggregation is an example of a non-statistic outcome of aggregations.

You’re not limited to only query results when you run aggregations; this is the default behavior, as you learned in section 7.1, but you can work around that if you need to. For example, let’s say that you want to show the most popular blog post tags on your blog somewhere on a sidebar. And you want to show that sidebar no matter what the user is searching for. To achieve this, you’d need to run your terms aggregation on all blog posts, independent of your query. Here’s where the global aggregation becomes useful: it produces a bucket with all the documents of your search context (the indices and types you’re searching in), making all other aggregations nested under it work with all these documents.

The global aggregation is one of the single-bucket aggregations that you can use to change the document set other aggregations run on, and that’s what we’ll explore next.

7.4.3. Using single-bucket aggregations

As you saw in section 7.1, Elasticsearch will run your aggregations on the query results by default. If you want to change this default, you’ll have to use single-bucket aggregations. Here we’ll discuss three of them:

global creates a bucket with all the documents of the indices and types you’re searching on. This is useful when you want to run aggregations on all documents, no matter the query.

filter and filters aggregations create buckets with all the documents matching one or more filters. This is useful when you want to further restrict the document set—for example, to run aggregations only on items that are in stock, or separate aggregations for those in stock and those that are promoted.

missing creates a bucket with documents that don’t have a specified field. It’s useful when you have another aggregation running on a field, but you want to do some computations on documents that aren’t covered by that aggregation because the field is missing. For example, you want to show the average price of items across multiple stores and also want to show the number of stores not listing a price for those items.

Global

Using your get-together site from the code samples, assume you’re querying for events about

Elasticsearch, but you want to see the most frequent tags overall. For example, as we described earlier, you want to show those top tags somewhere on a sidebar, independent of what the user is searching for. To achieve this, you need to use the global aggregation, which can alter the flow of data from query to aggregations, as shown in figure 7.12.

Figure 7.12. Nesting aggregations under the global aggregation makes them run on all documents.

In the following listing you’ll nest the terms aggregation under the global aggregation to get the most frequent tags on all documents, even if the query looks for only those with “elasticsearch” in the title.

Listing 7.17. Global aggregation helps show top tags overall regardless of the query

When we say “all documents,” we mean all the documents from the search context defined in the search URI. In this case you’re searching in the group type of the get-together index, so all the groups will be taken into account. If you searched in the whole get-together index, both groups and events would be included in the aggregation.

Filter

Remember the post filter from section 7.1? It’s used when you define a filter directly in the JSON request, instead of wrapping it in a filtered query; the post filter restricts the results you get without affecting the aggregations.

The filter aggregation does the opposite: it restricts the document set your aggregations run on, without affecting the results. This is illustrated in figure 7.13.

Figure 7.13. The filter aggregation restricts query results for aggregations nested under it.

If you’re searching for events with “elasticsearch” in the title, you want to create a word cloud from words within the description, but you want to only account for documents that are recent enough—let’s say after July 1, 2013.

To do that, in the following listing you’d run a query as usual, but with aggregations. You’ll first have a filter aggregation restricting the document set to those after July 1, and under it you’ll nest the terms aggregation that generates the word-cloud information.

Listing 7.18. filter aggregation restricts the document set coming from the query

There’s also a filters (plural) aggregation, which allows you to define multiple filters. It works similarly to the filter aggregation, except that it generates multiple buckets, one for each filter—like the range aggregation generates multiple buckets, one for each range. For more information about the filters aggregation, go to www.elastic.co/guide/en/elasticsearch/reference/current/searchaggregations-bucket-filters-aggregation.html.

Missing

Most of the aggregations we’ve looked at so far make buckets of documents and get metrics from values of a field. If a document is missing that field, it won’t be part of the bucket and it won’t contribute to any metrics.

For example, you might have a date_histogram aggregation on event dates, but some events have no date set yet. You can count them, too, through the missing aggregation:

% curl "$URI?pretty&search_type=count" -d '{

"aggregations": {

"event_dates": {

"date_histogram": {

"field": "date", "interval": "1M"

}

},

"missing_date": {

"missing": {

"field": "date"

}

}

}}'

As with other single-bucket aggregations, the missing aggregation allows you to nest other aggregations under it. For example, you can use the max aggregation to show the maximum number of people who intend to participate in a single event that doesn’t have a date set yet.

There are other important single-bucket aggregations that we didn’t cover here, like the nested and reverse_nested aggregations, which allow you to use all the power of aggregations with nested documents.

Using nested documents is one of the ways to work with relational data in Elasticsearch. The next chapter provides all you need to know about relations among documents, including nested documents and nested aggregations.