7.3. Multi-bucket aggregations

As you saw in the previous section, metrics aggregations are about taking all your documents and generating one or more numbers that describe them. Multi-bucket aggregations are about taking those documents and putting them into buckets—like the group of documents matching each tag. Then, for each bucket, you’ll get one or more numbers that describe the bucket, such as counting the number of groups for each tag.

So far you’ve run metrics aggregations on all documents matching the query. You can think of those documents as one big bucket. Other aggregations generate such buckets: for example, if you’re indexing logs and have a country code field, you can do a terms aggregation on it to create one bucket of documents for each country. As you’ll see in section 7.4, you can nest aggregations: for example, a cardinality aggregation could run on the buckets created by the terms aggregation to give you the number of unique visitors per country.

For now, let’s see what kinds of multi-bucket aggregations are available and where they’re typically useful:

Terms aggregations let you figure out the frequency of each term in your documents. There’s the terms aggregation, which you’ve seen a couple of times already, that gives you back the number of times each term appears. It’s useful for figuring out things like frequent posters on a blog or popular tags. There’s also the significant_terms aggregation, which gives you back the difference between the occurrence of a term in the whole index and its occurrence in your query results. This is useful for suggesting terms that are significant for the search context, like “elasticsearch” would be for the context of “search engine.”

Range aggregations create buckets based on how documents fall into which numerical, date, or IP address range. This is useful when analyzing data where the user has fixed expectations. For example, if someone is searching for a laptop in an online shop, you know the price ranges that are most popular.

Histogram aggregations, either numerical or date, are similar to range aggregations, but instead of requiring you to define each range, you have to define an interval, and Elasticsearch will build buckets based on that interval. This is useful when you don’t know where the user is likely to look.

For example, you could show a chart of how many events occur each month.

Nested, reverse nested, and children aggregations allow you to perform aggregations across document relationships. We’ll discuss them in chapter 8 when we talk about nested and parent-child relations.

Geo distance and geohash grid aggregations allow you to create buckets based on geolocation. We’ll show them in appendix A, which is focused on geo search.

Figure 7.5 shows an overview of the types of multi-bucket aggregations we’ll discuss here.

Figure 7.5. Major types of multi-bucket aggregations

Next, let’s zoom into each of these multi-bucket aggregations and see how you can use them.

7.3.1. Terms aggregations

We first looked at the terms aggregation in section 7.1 as an example of how all aggregations work. The typical use case is to get the top frequent X, where X would be a field in your document, like the name of a user, a tag, or a category. Because the terms aggregation counts every term and not every field value, you’ll normally run this aggregation on a non-analyzed field, because you want “big data” to be counted once and not once for “big” and once for “data.”

You could use the terms aggregation to extract the most frequent terms from an analyzed field, like the description of an event. You can use this information to generate a word cloud, like the one in figure 7.6. Just make sure you have enough memory for loading all the fields in memory if you have many documents or the documents contain many terms.

Figure 7.6. A terms aggregation can be used to get term frequencies and generate a word cloud.

By default, the order of terms is by their count, descending, which fits all the top frequent X use cases. But you can order terms ascending, or by other criteria, such as the term name itself. The following listing shows how to list the group tags ordered alphabetically by using the order property.

Listing 7.8. Ordering tag buckets by name

If you’re nesting a metric aggregation under your terms aggregation, you can order terms by the metric, too. For example, you could use the average metric aggregation under your tags aggregation from listing

7.8 to get the average number of group members per tag. And you can order tags by the number of members by referring your metric aggregation name, like avg_members: desc (instead of _term: asc as in listing 7.8).

Which terms to include in the reply

By default, the terms aggregation will return only the top 10 terms by the order you selected. You can, however, change that number though the size parameter. Setting size to 0 will get you all the terms, but it’s dangerous to use with a high-cardinality field, because returning a very large result is CPUintensive to sort and might saturate your network.

To get back the top 10 terms—or the number of terms you configure with size—Elasticsearch has to get a number of terms (configurable through shardsize) from each shard and aggregate the results. The process is shown in figure 7.7, with shard size and size set to 2 for clarity.

Figure 7.7. Sometimes the overall top X is inaccurate, because only the top X terms are returned per shard.

This mechanism implies that you might get inaccurate counters for some terms if those terms don’t make it to the top of each individual shard. This can even result in missing terms, like in figure 7.7 where lucene, with a total value of 7, isn’t returned in the top 2 overall tags because it didn’t make the top 2 for each shard.

You can get more accurate results by setting a large shard_size, as shown in figure 7.8. But this will make aggregations more expensive (especially if you nest them) because there are more buckets that need to be kept in memory.

Figure 7.8. Reducing inaccuracies by increasing shard_size

To get an idea of how accurate results are, you can check the values at the beginning of the aggregation response:

"tags" : {

"doc_count_error_upper_bound" : 0,

"sum_other_doc_count" : 6,

The first number is the worst-case scenario error margin. For example, if the minimum count for a term returned by a shard is 5, it could be that a term occurring four times in that shard has been missed. If that

term should have appeared in the final results, that’s a worst-case error of 4. The total of these numbers for all shards makes up doc_count_error_upper_bound. For our code samples, that number is always 0, because we have only one shard—the top terms for that shard are the same as the global top terms.

The second number is the total count of the terms that didn’t make the top.

You can get a doc_count_error_upper_bound value for each term by setting

show_term_doc_count_error to true. This will take the worst-case scenario error per term: for example if “big data” is returned by a shard, you know that it’s the exact value. But if another shard doesn’t return “big data” at all, the worst-case scenario is that “big data” actually exists with a value just below the last returned term. Adding up these error numbers for shards not returning that term make up doc_count_error _upper_bound per term.

At the other end of the accuracy spectrum, you could consider terms with low frequency irrelevant and exclude them from the result set entirely. This is especially useful when you sort terms by something other than frequency, which makes it likely that low-frequency terms will appear, but you don’t want to pollute the results with irrelevant results like typos. To do that, you’ll need to change the min_doc_count setting from the default value of 1. If you want to cut these low-frequency terms at the shard level, you use shard_min_doc_count.

Finally, you can include and exclude specific terms from the result. You’d do that by using the include and exclude options and providing regular expressions as values. Using include alone will include only terms matching the pattern; using exclude alone will include terms that don’t match. Using both will have exclude take precedence: included terms will match the include pattern but won’t match the exclude pattern.

The following listing shows how to only return counters for tags containing “search.”

Listing 7.9. Creating buckets only for terms containing “search”

URI=localhost:9200/get-together/group/_search curl "$URI?pretty&search_type=count" -d '{

"aggregations": {

"tags": {

"terms": {

"field": "tags.verbatim", "include": ".search."

}

}

}}'

reply

"aggregations" : {

"tags" : {

"buckets" : [ {

"key" : "elasticsearch", "doc_count" : 2

}, {

"key" : "enterprise search",

"doc_count" : 1

Collect mode

By default, Elasticsearch does all aggregations in a single pass. For example, if you had a terms aggregation and a cardinality aggregation nested in it, Elasticsearch would make a bucket for each term, calculate the cardinality for each bucket, sort those buckets, and return the top X.

This works well for most use cases, but it will take lots of time and memory if you have lots of buckets and lots of sub-aggregations, especially if a sub-aggregation is also a multi-bucket aggregation with lots of buckets. In such cases, a two-pass approach will be better: first create the buckets of the top-level aggregation, sort and cache the top X, and then calculate sub-aggregations on only those top X.

You can control which approach Elasticsearch uses by setting collect_mode. The default is depth_first, and the two-pass approach is breadth_first.

Significant terms

The significant_terms aggregation is useful if you want to see which terms have higher

frequencies than normal in your current search results. Let’s take the example of get-together groups: in all the groups out there, the term clojure may not appear frequently enough to count. Let’s assume that it appears 10 times out of 1,000,000 terms (0.0001%). If you restrict your search for Denver, let’s say it appears 7 times out of 10,000 terms (0.007%). The percentage is significantly higher than before and indicates a strong Clojure community in Denver, compared to the rest of the search area. It doesn’t matter that other terms such as programming or devops have a much higher absolute frequency.

The significant_terms aggregation is much like the terms aggregation in the sense that it’s counting terms. But the resulting buckets are ordered by a score, which represents the difference in percentage between the foreground documents (that 0.007% in the previous example) and the background documents (0.0001%). The foreground documents are those matching your query, and the background documents are all the documents from the index.

In the following listing, you’ll try to find out which users of the get-together site have a similar preference to Lee for events. To do that, you’ll query for events where Lee attends and use the

significant_terms aggregation to see which event attendees participate in more, compared to the overall set of events they attend.

Listing 7.10. Finding attendees attending similar events to Lee

As you might have guessed from the listing, the significant_terms aggregation has the same size,

shard_size, min_doc_count, shard_min_doc_count, include, and exclude options as

the terms aggregation, which lets you control the terms you get back. In addition to those, it allows you to change the background documents from all the documents in the index to only those matching a defined filter in the background_filter parameter. For example, you may know that Lee participates only in technology events, so you can filter those to make sure that events irrelevant to him aren’t taken into account.

Both the terms and significant_terms aggregations work well for string fields. For numeric fields, range and histogram aggregations are more relevant, and we’ll look at them next.

7.3.2. Range aggregations

The terms aggregation is most often used with strings, but it works with numeric values, too. This is useful when you have low cardinality, like when you want to give counts on how many laptops have two years of warranty, how many have three, and so on.

With high-cardinality fields, such as ages or prices, you’re most likely looking for ranges. For example, you may want to know how many of your users are between 18 and 39, how many are between 40 and 60,

and so on. You can still do that with the terms aggregation, but it’s going to be tedious: in your application, you’d have to add up counters for ages 18, 19, and so on until you get to 39 to get the first bucket. And if you want to add sub-aggregations, like the ones you’ll see later in this chapter, things will get even more complicated.

To solve this problem for numerical values, you have the range aggregation. As the name suggests, you give the numerical ranges you want, and it will count the documents with values that fall into each bucket. You can use those counters to represent the data in a graphical way—for example, with a pie chart, as shown in figure 7.9.

Figure 7.9. range aggregations give you counts of documents for each range. This is good for pie charts.

Recall from chapter 3 that date strings are stored as type long in Elasticsearch, representing the UNIX time in milliseconds. To work with date ranges, you have a variant of the range aggregation called the date_range aggregation.

Range aggregation

Let’s get back to our get-together site example and do a breakdown of events by their number of attendees. You’ll do it with the range aggregation and give it an array of ranges. The thing to keep in mind here is that the minimum value from the range (the key from) is included in the bucket, whereas the maximum

Ranges don’t have to be adjacent; they can be separated or they can overlap. In most cases it makes sense to cover all values, but you don’t need to.

Listing 7.11. Using a range aggregation to divide events by the number of attendees

You can see from the listing that you don’t have to specify both from and to for every range in the aggregation. Omitting one of these parameters will remove the respective boundary, and this enables you to search for all events with fewer than four members or with at least six.

Date range aggregation

As you might imagine, the date_range aggregation works just like the range aggregation, except you put date strings in your range definitions. And because of that, you should define the date format so Elasticsearch will know how to translate the string you give it into the numerical UNIX time, which is how date fields are stored.

In the following listing, you’ll divide events into two categories: before July 2013 and starting with July 2013. You can use a similar approach to count future events and past events, for example.

Listing 7.12. Using a date range aggregation to divide events by scheduled date

If the value of the format field looks familiar, it’s because it’s the same Joda Time annotation that you saw in chapter 3 when you defined date formats in the mapping. For the complete syntax, you can look at the DateTimeFormat documentation: http://jodatime.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html.

7.3.3. Histogram aggregations

For dealing with numeric ranges, you also have histogram aggregations. These are much like the range aggregations you just saw, but instead of manually defining each range, you’d define a fixed interval, and Elasticsearch would build the ranges for you. For example, if you want age groups from people documents, you can define an interval of 10 (years) and you’ll get buckets like 0–10 (excluding 10), 10–20 (excluding 20), and so on.

Like the range aggregation, the histogram aggregation has a variant that works with dates, called the date_histogram aggregation. This is useful, for example, when building histogram charts of how many emails were sent on a mailing list each day.

Histogram aggregation

Running a histogram aggregation is similar to running a range aggregation. You just replace the ranges array with an interval, and Elasticsearch will build ranges starting with the minimum value, adding the interval until the maximum value is included. For example, in the following listing, you specify an interval of 1 and show how many events have three attendees, how many have four, and how many have five.

Listing 7.13. Histogram showing the number of events for each number of attendees

Like the terms aggregation, the histogram aggregation lets you specify a min_doc _count value, which is helpful if you want buckets with few documents to be ignored. min_doc_count is also useful if you want to show empty buckets. By default, if there’s an interval between the minimum and maximum values that has no documents, that interval will be omitted altogether. Set min_doc_count to 0 and those intervals will still appear with a document count of 0.

Date histogram aggregation

As you might expect, you’d use the date_histogram aggregation like the histogram one, but you’d insert a date in the interval field. That date would be specified in the same Joda Time annotation as the date_range aggregation, with values such as 1M or 1.5h. For example, the following listing gives the breakdown of events happening in each month.

Listing 7.14. Histogram of events per month

Like the regular histogram aggregation, you can use the min_doc_count option to either show empty buckets or omit buckets containing just a few documents.

You probably noticed that the date_histogram aggregation has two things in common with all the other multi-bucket aggregations:

It counts documents having certain terms.

It creates buckets of documents falling into each category.

The buckets themselves are useful only when you nest other aggregations under a multi-bucket aggregation. This allows you to have deeper insights into your data, and we’ll look at nesting aggregations in the next section. First, take time to look at table 7.2, which gives you a quick overview of the multibucket aggregations and what they’re typically used for.

Table 7.2. Multi-bucket aggregations and typical use cases

Aggregation type Example use case
terms Show top tags on a blogging site; hot topics this week on a news site.
significant_terms Identify new technology trends by looking at what’s used/downloaded a lot this month compared to overall.
range and date_range Show entry-level, medium-priced, and expensive laptops. Show archived events, events this week, upcoming events.
histogram and date_histogram Show distributions: how much people of each age exercise. Or show trends: items bought each day.

The list isn’t exhaustive, but it does include the most important aggregation types and their options. You can check the documentation[1] for a complete list. Also, geo aggregations are dealt with in appendix A, and nested and children aggregations in chapter 8.

1 www.elastic.co/guide/en/elasticsearch/reference/master/search-aggregations.html