6.2. Other scoring methods

Although the practical scoring model from the previous section, a combination of TF-IDF and the vector space model, is arguably the most popular scoring mechanism for Elasticsearch and Lucene, that doesn’t mean it’s the only model. From now on we’ll call the default scoring model TF-IDF, though we mean the practical scoring model based on TF-IDF. Other models include the following:

Okapi BM25

Divergence from randomness, or DFR similarity

Information based, or IB similarity

LM Dirichlet similarity

LM Jelinek Mercer similarity

We’ll briefly cover one of the most popular alternative options here (BM25) and how to configure Elasticsearch to use it. When we talk about scoring methods, we’re talking about changing the similarity module inside Elasticsearch.

Before we talk about the alternate scoring method to TF-IDF (known as BM25, a probabilistic scoring framework), let’s talk about how to configure Elasticsearch to use it. There are two different ways to specify the similarity for a field; the first is to change the similarity parameter in a field’s mapping, as shown in the following listing.

Listing 6.1. Changing the similarity parameter in a field’s mapping

The second way to configure Elasticsearch to use an alternate scoring method is an extension of specifying it in the field’s mapping. The similarity is defined in the settings, similarly to how an analyzer is, and then referenced in the mappings for a field by name. This approach allows you to configure the settings for a similarity algorithm. The next listing shows an example of configuring advanced settings for the BM25 similarity and using that scoring algorithm for a field in the mappings.

Listing 6.2. Configuring advanced settings for BM25 similarity

Additionally, if you’ve decided you want to always use a particular scoring method, you can configure it globally by adding the following setting to your elasticsearch.yml configuration file:

index.similarity.default.type: BM25

Great! Now that you’ve seen how to specify an alternative similarity, let’s talk about this alternate similarity and how it differs from TF-IDF.

6.2.1. Okapi BM25

Okapi BM25 is probably the second most popular scoring method behind TF-IDF for Lucene and is a probabilistic relevance algorithm, which means the score can be thought of as the probability that a given document matches the query. BM25 is also reputed to be better for shorter fields, though you should always test to ensure it remains true for your dataset! BM25 maps each document into an array of values corresponding to each term in the dictionary and uses a probabilistic model to determine the document’s ranking.

While discussing the full scoring formula for BM25 is beyond the scope of this book, you can read more about how BM25 is implemented in Lucene at http://arxiv.org/pdf/0911.5046.pdf.

BM25 has three main settings—k1, b, and discount_overlaps:

k1 and b are numeric settings used to tweak how the scoring is calculated.

k1 controls how important term frequency is to the score (how often the term occurs in the document, or TF from earlier in this chapter).

b is a number between 0 and 1 that controls what degrees of impact the length of the document has on the score.

k1 is set to 1.2 and b is set to 0.75 by default.

The discount_overlaps setting can be used to tell Elasticsearch that multiple tokens occurring at the same position within a field should or should not influence how the length is normalized. It defaults to true.

Testing your scoring

Keep in mind that if you do tweak these settings, you need to be sure to have a good testing infrastructure with which to judge changes in the ranking and scoring of your documents. It makes no sense at all to change relevancy algorithm settings without a way to evaluate your changes in a reproducible manner; anything less is just guessing!

Now that you’ve seen the default TF-IDF scoring formula as well as an alternative, BM25, let’s talk about how you can influence the scoring of documents in a more fine-grained manner, with boosting.