Chapter 6. Searching with relevancy

This chapter covers

How scoring works inside Lucene and Elasticsearch

Boosting the score of a particular query or field

Understanding term frequency, inverse document frequency, and relevancy scores with the explain API

Reducing the impact of scoring by rescoring a subset of documents

Gaining ultimate power over scoring using the function_score query

The field data cache and how it affects Elasticsearch instances

In the world of free text, being able match a document to a query is a feature touted by many different storage and search engines. What really makes an Elasticsearch query different from doing a SELECT * FROM users WHERE name LIKE 'bob%' is the ability to assign a relevancy, also known as a score, to a document. From this score you know how relevant the document is to the original query.

When users type a query into a search box on a website, they expect to find not only results matching their query but also those results ranked based on how closely they match the query’s criteria. As it turns out, Elasticsearch is quite flexible when it comes to determining the relevancy of a document, and there are a lot of ways to customize your searches to provide more relevant results.

Don’t fret if you find yourself in a position where you don’t particularly care about how well a document matches a query but only that it does or does not match. This chapter also deals with some flexible ways to filter out documents, and it’s important to understand the field data cache, which is the in-memory cache where Elasticsearch stores the values of the fields from documents in the index when it comes to sorting, scripting, or aggregating on the values inside these fields.

We’ll start the chapter by talking about the scoring Elasticsearch does, as well as an alternative to the default scoring algorithm, move on to affecting the scoring directly using boosting, and then talk about understanding how the score was computed using the explain API. After that we’ll cover how to reduce the impact of scoring using query rescoring, extending queries to have ultimate control over the scoring with the function score query, and custom sorting using a script. Finally, we’ll talk about the in-memory field data cache, how it affects and impacts your queries, and an alternative to it called doc values.

Before we get to the field data cache, though, let’s start at the beginning with how Elasticsearch calculates the score for documents.