6.6. Custom scoring with function_score

Finally, we come to one of the coolest queries that Elasticsearch has to offer: function_score. The function_score query allows you to take control over the relevancy of your results in a fine-grained manner by specifying any number of arbitrary functions to be applied to the score of the documents matching an initial query.

Each function in this case is a small snippet of JSON that influences the score in some way. Sound confusing? Well, we’ll clear it up by the end of this section. We’ll start with the basic structure of the function_score query; the next listing an example that doesn’t perform any fancy scoring.

Listing 6.11. Function_score query basic structure

Simple enough—it looks just like a regular match query inside a function_score wrapper. There’s a new key, functions, that’s currently empty, but don’t worry about that yet; you’ll put things into that array in just a second. This listing is intended to show that the results of this query are going to be the documents that the function_score functions operate on. For example, if you have 30 total documents in the index and the match query for “elasticsearch” in the description field matches 25 of them, the functions inside the array will be applied to those 25 documents.

The function_score query has a number of different functions, and in addition to the original query, each function can take another filter element. You’ll see examples of this as we go into the details about each function in the next sections.

6.6.1. weight

The weight function is the simplest of the bunch; it multiplies the score by a constant number. Note that instead of a regular boost field, which increases the score by a value that gets normalized, weight really does multiply the score by the value.

In the previous example, you’re already matching all of the documents that have “elasticsearch” in the description, so you’ll boost documents that contain “hadoop” in the description as well in the next listing.

Listing 6.12. Using weight function to boost documents containing “hadoop”

The only change to the example was adding the following snippet to the functions array:

{

"weight": 1.5,

"filter": {"term": {"description": "hadoop"}}

}

This means that documents that match the term query for “hadoop” in the description will have their score multiplied by 1.5.

You can have as many of these as you’d like. For example, to also increase the score of get-together groups that mention “logstash,” you could specify two different weight functions, as in the following listing.

Listing 6.13. Specifying two weight functions

6.6.2. Combining scores

Let’s talk about how these scores get combined. There are two different factors we need to discuss when talking about scores:

How the scores from each of the individual functions should be combined, called the score_mode

How the score of the functions should be combined with the original query score (searching for

“elasticsearch” in the description in our example), known as boost_mode

The first factor, known as the score_mode parameter, deals with how each of the different functions’ scores are combined. In the previous cURL request you have two functions: one with a weight of 2, the other with a weight of 3. You can set the score_mode parameter to multiply, sum, avg, first, max, or min. If not specified, the scores from each function will be multiplied together.

If first is specified, only the first function with a matching filter will have its score taken into account. For example, if you set score_mode to first and had a document with both “hadoop” and “logstash” in the description, only a boost factor of 2 would be applied, because that’s the first function that matches the document.

The second score-combining setting, known as boost_mode, controls how the score of the original query is combined with the scores of the functions themselves. If not specified, the new score will be the original query score and the combined function’s score multiplied together. You can change this to sum, avg, max, min, or replace. Setting this to replace means that the original query’s score is replaced by the score of the functions.

Armed with these settings, you can tackle the next function score function, which is used for modifying the

score based on a field’s value. The functions we’ll cover are field_value_factor,

script_score, and random_score, as well as the three decay functions: linear, gauss, and exp. We’ll start with the field_value_factor function.

6.6.3. field_value_factor

Modifying the score based on other queries is quite useful, but a lot of people want to use the data inside their documents to influence the score of a document. In this example, you might want to use the number of reviews an event has received to increase the score for that event; this is possible to do by using the field_value_factor function inside a function_score query.

The field_value_factor function takes the name of a field containing a numeric field, optionally multiplies it by a constant number, and then finally applies a math function such as taking the logarithm of the value. Look at the example in the next listing.

Listing 6.14. Using field_value_factor inside a function_score query

The score that comes out of the field_value_factor function here will be

ln(2.5 * doc['reviews'].value)

For a document with a value of 7 in the reviews field, the score would be

ln(2.5 * 7) -> ln(17.5) -> 2.86

Besides ln there are other modifiers: none (default), log, log1p, log2p, ln1p, ln2p, square, sqrt, and reciprocal. One more thing to remember when using field_ value_factor: it

loads all the values of whichever field you’ve specified into memory, so the scores can be calculated quickly; this is part of the field data, which we’ll discuss in section 6.10. But before we talk about that, we’ll cover another function, which can give you finer-grained control over influencing the score by specifying a custom script.

6.6.4. Script

Script scoring gives you complete control over how to change the score. You can perform any sort of scoring inside a script.

As a brief refresher, scripts are written in the Groovy language, and you can access the original score of the document by using _score inside a script. You can access the values of a document using doc['fieldname']. An example of scoring using a slightly more complex script is shown in the next listing.

Listing 6.15. Scoring using a complex script

In this example, you’re using the size of the attendee list to influence the score by multiplying it by a weight and taking the logarithm of it.

Scripting is extremely powerful because you can do anything you’d like inside it, but keep in mind that scripts will be much slower than regular scoring because they must be executed dynamically for each document that matches your query. When using the parameterized script as in listing 6.15, caching the script helps performance.

6.6.5. random

The random_score function gives you the ability to assign random scores to your documents. The advantage of being able to sort documents randomly is the ability to introduce a bit of variation into the first page of results. When searching for get-togethers, sometimes it is nice to not always see the same result at the top.

You can also optionally specify a seed, which is a number passed with the query that will be used to generate the randomness with the function; this lets you sort documents in a random manner, but by using the same random seed, the results will be sorted the same way if the same request is performed again. That’s the only option it supports, so that makes this a simple function.

The next listing shows an example of using it to sort get-togethers randomly.

Listing 6.16. Using random_score function to sort documents randomly

Don’t worry if this doesn’t seem useful yet. Once we’ve covered all of the different functions, we’ll come up with an example that ties them all together at the end of this section. Before we do that, though, there’s one more set of functions we need to discuss: decay functions.

6.6.6. Decay functions

The last set of functions for function_score is the decay functions. They allow you to apply a gradual decay in the score of a document based on some field. There are a number of ways this can be useful. For example, you may want to make get-togethers that occurred more recently have a higher score, with the score gradually tapering off as the get-togethers get older. Another example is with geolocation data; using the decay functions, you can increase the score of results that are closer to a geo point (a user’s location, for example) and decrease the score the farther the group is from the point.

There are three types of decay functions: linear, gauss, and exp. Each decay function follows the same sort of syntax:

{

"TYPE": {

"origin": "...",

"offset": "...",

"scale": "...",

"decay": "..."

}

The TYPE can be one of the three types. Each of the types corresponds to a differently shaped curve, shown in figures 6.4, 6.5, and 6.6.

Figure 6.4. Linear curve—scores decrease from the origin at the same rate.

Figure 6.5. Gauss curve—scores decrease more slowly until the scale point is reached and then they decrease faster.

Figure 6.6. Exponential curve—scores drastically drop from the origin.

6.6.7. Configuration options

The configuration options define what the curve will look like; there are four configuration options for each of the three decay curves:

The origin is the center point of the curve, so it’s the point where you’d like the score to be the highest. In the geo-distance example, the origin is most like a person’s current location. In other situations the origin can also be a date or a numeric field.

The offset is the distance away from the originating point, before the score starts to be reduced. In our example, if the offset is set to 1km, it means the score will not be reduced for points within one kilometer from the origin point. It defaults to 0, meaning that scores immediately start to decay as the numeric value moves away from the origin.

The scale and decay options go hand in hand; by setting them, you can say that at the scale value for a field, the score should be reduced to the decay. Sound confusing? It’s much simpler to think of it with actual values. If you set the scale to 5km and the decay to 0.25, it’s the same as saying “at 5 kilometers from my origin point, the score should be 0.25 times the score at the origin.”

The next listing shows an example of Gaussian decay with the get-together data. Listing 6.17. Using Gaussian decay on the geo point location

Let’s look at what’s going on in this listing:

You use a match_all query, which will return all results.

Then you score each result using a Gaussian decay on the score.

The origin point is set in Boulder, Colorado, so the results that come back have the get-togethers in Boulder scored the highest, then results in Denver (a city near Boulder), and so on, as the different get-togethers get farther and farther away from the point of origin.