C.2. Highlighting options

Besides choosing which fields to work with, you can configure highlighting with other options, like these:

Adjusting the size of highlighted fragments and their number

Changing highlighting tags and encoding

Specifying a different query for highlighting, instead of the main query

We’ll discuss all of these next.

C.2.1. Size, order, and number of fragments

Highlighting elasticsearch in an event’s description field will show only a fragment of about 100 characters around the highlighted terms. As you might have noticed from listings C.1 and C.2, this doesn’t always contain the whole field, so the context could be too large or too small:

"description" : [ "can meet and greet and I will present on some

<em>Elasticsearch</em> basics and how we use it." ]

We say about 100 characters because Elasticsearch tries to make sure that words aren’t truncated.

Fragment size

Naturally, there’s a fragment_size option to change the default fragment size. Setting it to 0 will show the entire field content, which works nicely for short fields like names.

You can set fragment size globally for all fields and individually for each field. Individual settings override global settings, as shown in the next listing, where you’ll search for “Elasticsearch,” “Logstash,” and “Kibana” in the description field.

Listing C.4. Field-specific fragment_size setting overrides the global setting

You can see from this listing that if the fragment size is small enough and there are enough occurrences of the term, multiple fragments are generated.

Order of fragments

By default, fragments are returned in the order in which they appear in the text, as you saw in listing C.4. This works well for short texts, where the natural order of fragments gives a better overview of the whole content. For example, the description fragments you got back in listing C.4 do a good job of showing the description.

For large documents, such as books, the natural order doesn’t work so well because fragments can be far apart, so the user won’t see any link. For example, if you searched for “elasticsearch parent child” in this book, the top two fragments might look like this:

"we will discuss how Elasticsearch works and"

"the child aggregation works on buckets generated by"

Not terribly relevant, assuming you were looking for parent-child relationships in Elasticsearch. Even though the book itself is relevant because it discusses the topic, it would have been nicer to show a fragment that appears later in the book:

"parent-child relationships work with different Elasticsearch documents"

When you’re highlighting large fields, it makes sense to arrange fragments in the order of their relevance to the query because users are likely to be interested in seeing those relevant parts in order, so they can decide if the result is what they expected.

The highlighter calculates a TF-IDF score for each fragment, much as it calculates scores for documents within the index. To order fragments by this score, you have to set order to score in the highlight part of the request. As is done with fragment sizes, you can set the order individually and/or globally. For example, the following highlight section will change the order of fragments for the “elasticsearch logstash kibana” query you ran in listing C.4:

"highlight": {

"fields": {

"description": {

"fragment_size": 40,

"order": "score" }

}

}

You can see that the fragment matching more terms appears first because it has a higher score:

"description" : [ "logging with <em>Logstash</em> as well as

<em>Kibana</em>!", "dive for what <em>Elasticsearch</em> is and how it" ]

Number of fragments

With big documents such as books, it makes sense to show only one large, relevant fragment. Multiple small fragments work better for describing smaller fields, like the event descriptions you’ve worked with so far. You can adjust the number of fragments by setting number_of_fragments (shocker!), which defaults to 5:

"highlight": {

"fields": {

"description": {

"number_of_fragments": 1

}

}

}

For really small fields, such as names or short descriptions, you can set number_of_fragments to

  1. This will skip using fragments altogether and return the whole field as a single fragment, ignoring the value of fragment_size.

With the size, order, and number of fragments figured out, let’s move on to configuring how those fragments are returned.

C.2.2. Highlighting tags and fragment encoding

You can change the <em> and </em> tags that are used by default through the pre_tags and post_tags options. In the following listing, you’ll use <b> and </b> instead.

Listing C.5. Custom highlighting tags

If your custom tags are HTML like the default ones, you probably want to render the fragments in HTML to show them in some user interface. Here you might encounter a problem: by default, Elasticsearch returns fragments without any encoding, so they won’t render properly if there are special characters, such as the ampersand (&). For example, a fragment that’s highlighted as <em>select</em>&copy would appear as shown in figure C.2, because the &copy sequence is interpreted as the copyright character.

Figure C.2. The lack of fragment encoding can make the browser interpret HTML incorrectly.

The ampersand needs to be escaped as &. You can do that by setting encoder to html:

"highlight": {

"encoder": "html",

"fields": {

"title": {}

}

}

The HTML encoder will make the text render properly, as shown in figure C.3.

Figure C.3. Using the HTML encoder avoids parsing mistakes.

Now that we’ve gone through customizing the contents of fragments, let’s take a step back and look at the query that generated the highlighted fragments in the first place. By default, terms from the main query are used, but you can define a custom query.

C.2.3. Highlight query

Using the main query for highlighting works for most use cases, but there are some that require special care—for example, if you use rescore queries.

You first met rescoring in chapter 6 when we discussed relevancy, because rescoring allows you to improve the ranking of results by running alternative—often expensive—queries only on the top N of the overall result set. Elasticsearch then combines the original score with the score from the rescore queries to get the final ranking. The problem: rescore queries don’t apply to highlighting.

This is where custom highlight queries become useful—for example, if the main query is looking for groups with elasticsearch or simply search in their name, and you also want to boost the presence of tags that end with search, like enterprise search. A wildcard query for *search is expensive, as you saw in chapter 10, section 10.4.1, so you can put this criterion in a rescore query that runs on only the top 200 documents.

In the listing that follows, you’ll see how you can put elasticsearch and search names plus *search tags in the highlight query to highlight all the terms involved in the search. You can see that wildcards are expanded and highlight matching tags like enterprise search.

Listing C.6. Highlight query contains terms from the main and the rescore query

Now let’s take a deeper look at how highlighting works under the hood. This will allow you to choose the implementation that works best for your use case.