6.1. How scoring works in Elasticsearch

Although it may make sense to first think about documents matching queries in a binary sense, meaning either “Yes, it matches” or “No, it doesn’t match,” it makes much more sense to think about documents matching in a relevancy sense. Whereas before you could speak of a document either matching or not matching (the binary method), it’s more accurate to be able to say that document A is a better match for a query than document B. For example, when you use your favorite search engine to search for “elasticsearch,” it’s not enough to say that a particular page contains the term and therefore matches; instead, you want the results to be ranked according to the best and most relevant results.

The process of determining how relevant a document is to a query is called scoring, and although it isn’t necessary to understand exactly how Elasticsearch calculates the score of a document in order to use Elasticsearch, it’s quite useful.

6.1.1. How scoring documents works

Scoring in Lucene (and by extension, Elasticsearch) is a formula that takes the document in question and uses a few different pieces to determine the score for that document. We’ll first cover each piece and then combine them in the formula to better explain the overall scoring. As we mentioned previously, we want documents that are more relevant to be returned first, and in Lucene and Elasticsearch this relevancy is called the score.

To begin calculating the score, Elasticsearch uses the frequency of the term being searched for as well as how common the term is to influence the score. A short explanation is that the more times a term occurs in a document, the more relevant it is. But the more times the term appears across all the documents, the less relevant that term is. This is called TF-IDF (TF = term frequency, IDF = inverse document frequency), and we’ll talk about each of these types of frequency in more detail now.

6.1.2. Term frequency

The first way to think of scoring a document is to look at how often a term occurs in the text. For example, if you were searching for get-togethers in your area that are about Elasticsearch, you would want the groups that mention Elasticsearch more frequently to show up first. Consider the following text snippets, shown in figure 6.1.

Figure 6.1. Term frequency is how many times a term appears in a document.

The first sentence mentions Elasticsearch a single time, and the second mentions Elasticsearch twice, so a document containing the second sentence should have a higher score than a document containing the first. If we were to speak in absolute numbers, the first sentence would have a term frequency (TF) of 1, and the second sentence would have a term frequency of 2.

6.1.3. Inverse document frequency

Slightly more complicated than the term frequency for a document is the inverse document frequency (IDF). What this fancy-sounding description means is that a token (usually a word, but not always) is less important the more times it occurs across all of the documents in the index. This is easiest to explain with a few examples. Consider the three documents shown in figure 6.2.

Figure 6.2. Inverse document frequency checks to see if a term occurs in a document, not how often it occurs.

In the three documents in the figure, note the following:

The term “Elasticsearch” has a document frequency of 2 (because it occurs in two documents). The inverse part of the document frequency comes from the score being multiplied by 1/DF, where DF is the document frequency of the term. This means that because the term has a higher document frequency, its weight decreases.

The term “the” has a document frequency of 3 because it occurs in all three documents. Note that the frequency of “the” is still 3, even though “the” occurs twice in the last document, because the inverse document frequency only checks for a term occurring in the document, not how often it occurs in the document; that’s the job of the term frequency!

Inverse document frequency is an important factor in balancing out the frequency of a term. For instance, consider a user who searches for the term “the score”; the word the is likely to be in almost all regular English text, so if it were not balanced out, the frequency of it would totally overwhelm the frequency of the word score. The IDF balances the relevancy impact of common words like the, so the actual relevancy score gives a more accurate sense of the query’s terms.

Once the TF and the IDF have been calculated, you’re ready to calculate the score of a document using the TF-IDF formula.

6.1.4. Lucene’s scoring formula

Lucene’s default scoring formula, known as TF-IDF, as discussed in the previous section, is based on both the term frequency and the inverse document frequency of a term. First let’s look at the formula, shown in figure 6.3, and then we’ll tackle each part individually.

Figure 6.3. Lucene’s scoring formula for a score given a query and document

Reading this in human English, we would say “The score for a given query q and document d is the sum (for each term t in the query) of the square root of the term frequency of the term in document d, times the inverse document frequency of the term squared, times the normalization factor for the field in the document, times the boost for the term.”

Whew, that’s a mouthful! Don’t worry; you don’t need to have this formula memorized to use

Elasticsearch. We’re providing it here so you can understand how the formula is computed. The important part is to understand how the term frequency and the inverse document frequency of a term affect the score of the document and how they’re integral in determining the score for a document in an Elasticsearch index.

The higher the term frequency, the higher the score; similarly, the inverse document frequency is higher the rarer a term is in the index. Although we’re now finished with TF-IDF, we’re not finished with the default scoring function of Lucene. Two things are missing: the coordination factor and the query normalization. The coordination factor takes into account how many documents were searched and how many terms were found. The query norm is an attempt to make the results of queries comparable. It turns out that this is difficult, and in reality you shouldn’t compare scores among different queries. This default scoring method is a combination of the TF-IDF and the vector space model.

If you’re interested in learning more about this, we recommend checking out the Javadocs for the

org.apache.lucene.search.similarities.TFIDFSimilarity Java class in the Lucene documentation.