Chapter 7. Exploring your data with aggregations

This chapter covers

Metrics aggregations

Single and multi-bucket aggregations

Nesting aggregations

Relations among queries, filters, and aggregations

So far in this book, we’ve concentrated on the use case of indexing and searching: you have many documents and the user wants to find the most relevant matches to some keywords. There are more and more use cases where users aren’t interested in specific results. Instead, they want to get statistics from a set of documents. These statistics might be hot topics for news, revenue trends for different products, the number of unique visitors to your website, and much more.

Aggregations in Elasticsearch solve this problem by loading the documents matching your search and doing all sorts of computations, such as counting the terms of a string field or calculating the average on a numeric field. To look at how aggregations work, we’ll use an example from the get-together site you’ve worked with in previous chapters: a user entering your site may not know what groups to look for. To give the user something to start with, you could make the UI show the most popular tags for existing groups of your get-together site, as illustrated in figure 7.1.

Figure 7.1. Example use case of aggregations: top tags for get-together groups

Those tags would be stored in a separate field of your group documents. The user could then select a tag and filter down to only documents containing that tag. This makes it easier for users to find groups relevant to their interests.

To get such a list of popular tags in Elasticsearch, you’d use aggregations, and in this specific case, you’d use the terms aggregation on the tags field, which counts occurrences of each term in that field and returns the most frequent terms. Many other types of aggregations are also available, and we’ll discuss them later in this chapter. For example, you can use a date_histogram aggregation to show how many events happened in each month of the last year, use the avg aggregation to show you the average number of attendees for each event, or even find out which users have similar taste for events as you do by using the significant_terms aggregation.

What about facets?

If you’ve used Lucene, Solr, or even Elasticsearch for some time, you might have heard about facets.

Facets are similar to aggregations, because they also load the documents matching your query and perform computations in order to return statistics. Facets are still supported in versions 1.x but are deprecated and will be removed in version 2.0.

The main difference between aggregations and facets is that you can’t nest multiple types of facets in Elasticsearch, which limits the possibilities for exploring your data. For example, if you had a blogging site, you could use the terms facet to find out the hot topics this year, or you could use the date histogram facet to find out how many articles are posted each day, but you couldn’t find the number of posts per day, separately for each topic (at least not in one request). You’d be able to do that if you could nest the date histogram facet under the terms facet.

Aggregations were born to remove this limit and allow you to get deeper insights from your documents. For example, if you store your online shop logs in Elasticsearch, you can use aggregations to find not only the best-selling products but also the best-selling products in each country, the trends for each product in each country, and so on.

In this chapter, we’ll first discuss the common traits of all aggregations: how you run them and how they relate to the queries and filters you learned in previous chapters. Then we’ll dive into the particularities of each type of aggregation, and in the end, we’ll show you how to combine different aggregation types.

Aggregations are divided in two main categories: metrics and bucket. Metrics aggregations refer to the statistical analysis of a group of documents, resulting in metrics such as the minimum value, maximum value, standard deviation, and much more. For example, you can get the average price of items from an online shop or the number of unique users logging on to it.

Bucket aggregations divide matching documents into one or more containers (buckets) and then give you the number of documents in each bucket. The terms aggregation, which would give you the most popular tags in figure 7.1, makes a bucket of documents for each tag and gives you the document count for each bucket.

Within a bucket aggregation, you can nest other aggregations, making the sub-aggregation run on each bucket of documents generated by the top-level aggregation. You can see an example in figure 7.2.

Figure 7.2. The terms bucket aggregation allows you to nest other aggregations within it.

Looking at the figure from the top down, you can see that if you’re using the terms aggregation to get the most popular group tags, you can also get the average number of members for groups matching each tag. You could also ask Elasticsearch to give you, per tag, the number of groups created per year.

As you may imagine, you can combine many types of aggregations in many ways. To get a better view of the available options, we’ll go through metrics and bucket aggregations and then discuss how you can combine them. But first, let’s see what’s common for all types of aggregations: how to write them and how they relate to your queries.