7.2. Metrics aggregations

Metrics aggregations extract statistics from groups of documents, or, as we’ll explore in section 7.4, buckets of documents coming from other aggregations.

These statistics are typically done on numeric fields, such as the minimum or average price. You can get each such statistic separately or you can get them together via the stats aggregation. More advanced statistics, such as the sum of squares or the standard deviation, are available through the extended_stats aggregation.

For both numeric and non-numeric fields you can get the number of unique values using the cardinality aggregation, which will be discussed in section 7.2.3.

7.2.1. Statistics

We’ll begin looking at metrics aggregations by getting some statistics on the number of attendees for each event.

From the code samples, you can see that event documents contain an array of attendees. You can calculate the number of attendees at query time through a script, which we’ll show in listing 7.3. We discussed scripting in chapter 3, when you used scripts for updating documents. In general, with Elasticsearch queries you can build a script field, where you put a typically small piece of code that returns a value for each document. In this case, the value will be the count of elements of the attendees array.

The flexibility of scripts comes with a price

Scripts are flexible when it comes to querying, but you have to be aware of the caveats in terms of performance and security.

Even though most aggregation types allow you to use them, scripts slow down aggregations because they have to be run on every document. To avoid the need of running a script, you can do the calculation at index time. In this case, you can extract the number of attendees for every event and add it to a separate field before indexing it. We’ll talk more about performance in chapter 10.

In most Elasticsearch deployments, the user specifies a query string, and it’s up to the server-side application to construct the query out of it. But if you allow users to specify any kind of query, including scripts, someone might exploit this and run malicious code. That’s why, depending on your Elasticsearch version, running scripts inline like in listing 7.3 (called dynamic scripting) is disabled. To enable it, set script.disable_dynamic: false in elasticsearch.yml.

In the following listing, you’ll request statistics on the number of attendees for all events. To get the number of attendees in the script, you’ll use doc['attendees'].values to get the array of attendees. Adding the length property to that will return their number.

Listing 7.3. Getting stats for the number of event attendees

You can see that you get back the minimum number of attendees per event, as well as the maximum, the sum, and the average. You also get the number of documents these statistics were computed on.

If you need only one of those statistics, you can get it separately. For example, you’ll calculate the average number of attendees per event through the avg aggregation in the next listing.

Listing 7.4. Getting the average number of event attendees

URI=localhost:9200/get-together/event/_search curl "$URI?pretty&search_type=count" -d '{

"aggregations": {

"attendees_avg": {

"avg": { "script": "doc['"'attendees'"'].values.length"

}

}

}}'

reply

[...]

"aggregations" : {

"attendees_avg" : { "value" : 3.8666666666666667

}

}

}

Similar to the avg aggregation, you can get the other metrics through the min, max, sum, and value_count aggregations. You’d have to replace avg from listing 7.4 with the needed aggregation name. The advantage of separate statistics is that Elasticsearch won’t spend time computing metrics that you don’t need.

7.2.2. Advanced statistics

In addition to statistics gathered by the stats aggregation, you can get the sum of squares, variance, and standard deviation of your numeric field by running the extended_stats aggregation, as shown in the next listing.

Listing 7.5. Getting extended statistics on the number of attendees

URI=localhost:9200/get-together/event/_search curl "$URI?pretty&search_type=count" -d '{

"aggregations": {

"attendees_extended_stats": {

"extended_stats": {

"script": "doc['"'attendees'"'].values.length"

}

}

}}'

reply

"aggregations" : { "attendees_extended_stats" : {

"count" : 15,

"min" : 3.0,

"max" : 5.0,

"avg" : 3.8666666666666667,

"sum" : 58.0,

"sum_of_squares" : 230.0,

"variance" : 0.38222222222222135,

"std_deviation" : 0.6182412330330462

}

}

All these statistics are calculated by looking at all the values in the document set matching the query, so they’re 100% accurate all the time. Next we’ll look at some statistics that use approximation algorithms, trading some of the accuracy for speed and less memory consumption.

7.2.3. Approximate statistics

Some statistics can be calculated with good precision—though not 100%—by looking at some of the values from your documents. This will limit both their execution time and their memory consumption.

Here we’ll look at how to get two types of such statistics from Elasticsearch: percentiles and cardinality. Percentiles are values below which you can find x% of the total values, where x is the given percentile. This is useful, for example, when you have an online shop: you log the value of each shopping cart and you want to see in which price range are most shopping carts. Perhaps most of your users only buy an item or two, but the upper 10% buy a lot of items and generate most of your revenue.

Cardinality is the number of unique values in a field. This is useful, for example, when you want the number of unique IP addresses accessing your website.

Percentiles

For percentiles, think about the number of attendees for events once again and determine the maximum number of attendees you’ll consider normal and the number you’ll consider high. In listing 7.6, you’ll calculate the 80th percentile and the 99th. You’ll consider numbers under the 80th to be normal and numbers under the 99th high, and you’ll ignore the upper 1%, because they’re exceptionally high.

To accomplish this, you’ll use the percentiles aggregation, and you’ll set the percents array to

80 and 99 in order to get these specific percentiles.

Listing 7.6. Getting the 80th and the 99th percentiles from the number of attendees

For small data sets like the code samples, you have 100% accuracy, but this may not happen with large data sets in production. With the default settings, you have over 99.9% accuracy for most data sets for most percentiles. The specific percentile matters, because accuracy is at its worst for the 50th percentile, and as you go toward 0 or 100 it gets better and better.

You can trade memory for accuracy by increasing the compression parameter from the default 100. Memory consumption increases proportionally to the compression, which in turn controls how many values are taken into account when approximating percentiles.

There’s also a percentile_ranks aggregation that allows you to do the opposite—specify a set of values—and you’ll get back the corresponding percentage of documents having up to those values:

% curl "$URI?pretty&search_type=count" -d '{

"aggregations": {

"attendees_percentile_ranks": {

"percentile_ranks": { "script": "doc['"'attendees'"'].values.length",

"values": [4, 5]

}

}

}}'

Cardinality

For cardinality, let’s imagine you want the number of unique members of your get-together site. The following listing shows how to do that with the cardinality aggregation.

Listing 7.7. Getting the number of unique members through the cardinality aggregation URI=localhost:9200/get-together/group/_search curl "$URI?pretty&search_type=count" -d '{

"aggregations": {

"members_cardinality": {

"cardinality": {

"field": "members"

}

}

}}'

reply

"aggregations" : { "members_cardinality" : {

"value" : 8

}

}

Like the percentiles aggregation, the cardinality aggregation is approximate. To understand the benefit of such approximation algorithms, let’s take a closer look at the alternative. Before the cardinality aggregation was introduced in version 1.1.0, the common way to get the cardinality of a field was by running the terms aggregation you saw in section 7.1. Because the terms aggregation will get the counts of each term for top N terms, where N is the configurable size parameter, if you specify a size large enough, you could get all the unique terms back. Counting them will give you the cardinality.

Unfortunately, this approach only works for fields with relatively low cardinality and a low number of documents. Otherwise, running a terms aggregation with a huge size requires a lot of resources:

Memory— All the unique terms need to be loaded in memory in order to be counted.

CPU— Those terms have to be returned in order; by default the order is on how many times each term occurs.

Network— From each shard, the large array of sorted unique terms has to be transferred to the node that received the client request. That node also has to merge per-shard arrays into one big array and transfer it back to the client.

This is where approximation algorithms come into play. The cardinality field works with an algorithm called HyperLogLog++ that hashes values from the field you want to examine and uses the hashes to approximate the cardinality. It loads only some of those hashes into memory at once, so the memory usage will be constant no matter how many terms you have.

Note

For more details on the HyperLogLog++ algorithm, have a look at the original paper from Google:

http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archi

Memory and cardinality

We said that the memory usage of the cardinality aggregation is constant, but how large would that constant be? You can configure it through the precision_threshold parameter. The higher the threshold, the more precise the results, but more memory is consumed. If you run the cardinality aggregation on its own, it will take about precision_threshold times 8 bytes of memory for each shard that gets hit by the query.

The cardinality aggregation, like all other aggregations, can be nested under a bucket aggregation. When that happens, the memory usage is further multiplied by the number of buckets generated by the parent aggregations.

Tip

For most cases, the default precision_threshold will work well, because it provides a good tradeoff between memory usage and accuracy, and it adjusts itself depending on the number of buckets.

Next, we’ll look at the choice of multi-bucket aggregations. But before we go there, table 7.1 gives you a quick overview of each metrics aggregation and the typical use case.

Table 7.1. Metrics aggregations and typical use cases

Aggregation type Example use case
stats Same product sold in multiple stores. Gather statistics on the price: how many stores have it and what the minimum, maximum, and average prices are.
individual stats (min, max, sum, avg, value_count) Same product sold in multiple stores. Show “prices starting from” and then the minimum price.
extended_stats Documents contain results from a personality test. Gather statistics from that group of people, such as the variance and the standard deviation.
percentiles Access times on your website: what the usual delays are and how long the longest response times are.
percentile_ranks Checking if you meet SLAs: if 99% of requests have to be served under 100ms, you can check what’s the actual percentage.
cardinality Number of unique IP addresses accessing your service.