C.3. Highlighter implementations

So far we’ve assumed that you’re using the default highlighter implementation called Plain. The Plain Highlighter works by re-analyzing the text from each field to identify terms to highlight and where those terms are located in the text. This is good for most use cases and only requires highlighted fields to be stored, either independently or in the _source field. Because it has to analyze the text again, the Plain Highlighter can be slow for large fields; for example, when you index books or blog post contents.

For such use cases, two other implementations come in handy:

Postings Highlighter

Fast Vector Highlighter

Both are faster than the Plain Highlighter on large fields, but both require additional data to be stored in the index—data on which their speed is based. Both also come up with their unique features, which will be discussed next.

If it’s not obvious which one is best for you, we suggest starting with the Plain Highlighter and moving on to the Postings Highlighter for fields where the Plain Highlighter proves to be too slow, because the Postings Highlighter adds little overhead in terms of index size and also works well if fields are smaller. If the Postings Highlighter doesn’t give you the needed functionality, try the Fast Vector Highlighter.

C.3.1. Postings Highlighter

The Postings Highlighter requires you to set index_options to offsets for highlighted fields, which will store each term’s location (position and offset) in the index. As you can see in listing C.7, offsets indicate the exact position of a certain term in the text, and with this information, the Postings Highlighter is able to identify which terms to highlight without having to re-analyze the text.

In this listing you’ll use the Analyze API, which you first encountered in chapter 5 on analysis.

Listing C.7. Analyze API showing offsets

When analyzing the text, Elasticsearch is able to extract each term’s offsets in order to store its exact location. With offsets stored, Elasticsearch doesn’t have to analyze the text again during highlighting in order to locate each term. Adding term offsets to the index is a typical tradeoff where you allow slower indexing and a bigger index in order to get better query latency. You saw many such performance tradeoffs in chapter 10.

When you set index_options to offsets, the Postings Highlighter is used automatically. For example, in the next listing you’ll enable offsets for the content field of a new index, add two documents, and highlight them.

Listing C.8. Using the Postings Highlighter

You can see from this listing that the highlighted samples are sentences, whether large or small. The Postings Highlighter will ignore the fragment_size option if you set it; fragments will always be sentences unless you set number_of_fragments to 0, in which case the whole field is treated as one fragment.

Tip

If you want to set the highlighter implementation manually, you can do so by setting type to plain (for the Plain Highlighter), postings (for the Postings Highlighter), or fvh (for the Fast Vector Highlighter). This can be done globally or per field and is useful if you change your mind about the implementation and you don’t want to re-index. For example, you index offsets but don’t like the sentence-

as-fragment approach of the Postings Highlighter, so you need a way to get back to using the Plain Highlighter.

Internally, the Postings Highlighter breaks the field into sentences (which then become fragments) and treats those sentences as separate documents, scoring them by using BM25 similarity. As we discussed in chapter 6, BM25 is a TF-IDF–based similarity that works well for short fields, like your sentences are supposed to be.

Because of the way it creates and scores fragments, the Postings Highlighter works well when you’re indexing natural language, such as books or blogs. It might not work so well when you’re indexing code, for example, because the concept of a sentence often doesn’t work, and you can end up with the entire field as a single fragment and no options to reduce the fragment size.

Another downside of the Postings Highlighter is that, at least in version 1.4, it doesn’t work well with phrase queries because it only accounts for individual terms. For example, in the next listing you’ll look for the phrase "Elasticsearch intro" by using a match_phrase query.

Listing C.9. Postings Highlighter matches all the terms and discounts phrases

You get individual terms highlighted even if they don’t belong to the phrase, which doesn’t happen with the Plain Highlighter. On the upside, although indexing offsets increase your index size and slow down indexing a bit, the overhead is lower than what you get when adding term vectors, which are needed by the Fast Vector Highlighter.

C.3.2. Fast Vector Highlighter

To enable the Fast Vector Highlighter for a field, you have to set term_vector to

with_positions_offsets in the mapping. This will allow Elasticsearch to identify terms as well as their location in the text without re-analyzing the field content. For large fields—for example, those over 1 MB—the Fast Vector Highlighter is faster than the Plain Highlighter.

What are term vectors?

Term vectors are a way to represent documents by using terms as dimensions. For example, the following diagram represents a document with the Elasticsearch and Logstash terms and another document

You can also represent a query as another vector and rank documents based on the distance between the query vector and each document’s vector. Another application is to add other metadata to each document —for example, the field’s total size—that will influence ranking. For more information about term vectors and their use, go to https://en.wikipedia.org/wiki/Vector_space_model.

For highlighting, this metadata has to be the list of positions and offsets for each term. This is why the Fast Vector Highlighter needs the with_positions_offsets setting. Alternative settings are no (default), yes, with_positions, and with_offsets.

Compared to the Postings Highlighter, the Fast Vector Highlighter takes up more space and requires more computation during indexing, because both need positions and offsets, but only the Fast Vector Highlighter has to compute the term vectors themselves, which are disabled by default.

When term_vector is set to with_positions_offsets for a field, Elasticsearch automatically uses the Fast Vector Highlighter for that field. For example, the get-together event and group descriptions from the code samples use this highlighter by default. Here’s a relevant snippet from the mapping:

"group" : {

"properties" : {

"description" : {

"type" : "string",

"term_vector": "with_positions_offsets"

Compared to the Postings Highlighter, this offers better phrase highlighting. Instead of highlighting every matching term, the Fast Vector Highlighter highlights only terms belonging to the phrase—as the Plain Highlighter did in listing C.9.

The Fast Vector Highlighter also comes with unique functionality:

It works nicely with multi-fields, because it’s able to combine matches from multi-fields into the

same set of fragments,

If there are multiple words to highlight, you can highlight them with different tags. You can configure how the boundaries of a fragment are selected.

Let’s take a deeper look at each of these features.

Highlighting multi-fields

You met multi-fields in chapter 3, section 3.3.2, as a way to index the same text in multiple ways. Multifields are a great way to refine your searches, but highlighting them properly may be tricky if variations of the same field produce different matches. Take the following listing, for example, where the description field is analyzed in two ways: the default is the english analyzer, which uses stemming to match search with searching. The suffix subfield uses a custom analyzer that makes use of Edge ngrams to match words with common suffixes, such as elasticsearch and search. When you do a multi_match query on both of them, the Plain Highlighter can match only one field at a time.

Listing C.10. Plain Highlighter doesn’t work well with multi-fields

Here’s where the Fast Vector Highlighter comes to the rescue because it can combine both multi-fields into one and highlight all the matches. It only requires term_vector to be set to with_positions_offsets on all the fields you need to highlight (which is the requirement for the Fast Vector Highlighter to work in the first place). You already added this in this listing. To combine multiple subfields into one, you have to indicate which subfields you want to highlight with the matched_fields option:

"highlight": {

"fields": {

"description": { "matched_fields": ["description","description.suffix"]

}

}

}

With the document and the query from listing C.10, you’ll have the highlighting that you’d expect:

"highlight": {

"description": ["<em>elasticsearch</em> is about <em>searching</em>"]

Using different tags for different fragments

To bold the first highlighted word and italicize the second, you can specify an array of tags:

"highlight": {

"fields": {

"description": {

"pre_tags": ["<b>", "<em>"],

"post_tags": ["</b>", "</em>"]

If there are more than two words to highlight, the Fast Vector Highlighter starts over: bold the third, italicize the fourth, and so on. If you have many words to highlight, you might want to keep track of their number. You can do that by setting tags_schema to styled, like in this query:

"query": {

"match": {

"description": "elasticsearch logstash kibana"

}

},

"highlight": { "tags_schema": "styled",

"fields": {

"description": {}

If you run it on the documents from the code samples, you’ll get the first hit highlighted like this:

"highlight": {

"description": [

"for what Elasticsearch</em> is and how it can be used for logging with Logstash</em> as well as Kibana</em>!"

This allows you to take the class name (hltX) and figure out which words matched first, second, and so on.

Configuring boundary characters

Recall from section C.2.1 that we said fragment_size is approximate because Elasticsearch tries to make sure words aren’t truncated. If you thought then that the explanation is a bit vague, it’s because the behavior depends on the highlighter implementation.

With the Postings Highlighter, fragment size is irrelevant because it breaks the text down into sentences. The Plain Highlighter adds terms around the highlighted term until it gets close to the fragment size, which means the boundary is always a term. As you’ve seen in the listings of this chapter, this works well for natural language, but it might become problematic in other use cases where the word and term concepts don’t overlap. For example, if you’re indexing code, you may have variable definitions like this: variable_with_a_very_very_very_very_long_name = 1

To search this kind of text effectively, you’ll need an analyzer that can break this long variable and allow you to search for terms within it.

Tip

You can do this with the Pattern Tokenizer, where you specify a pattern that includes underscores—for example, (\ |_)—which will tokenize on spaces and underscores. In chapter 5 you’ll find more information about analyzers and tokenizers.

If the analyzer will break the variable into tokens, the Plain Highlighter will break it, too, even if you don’t want it to. For example, a search for long with a fragment size of 20 would give you this:

very_very_very_very<em>long</em>_name = 1

The Fast Vector Highlighter works differently because words aren’t the same as terms. Words are strings delimited by the following characters: .,!? \t\n. You can change the list through the boundarychars option. When it builds fragments, it seeks those characters for boundary_max_scan characters (defaults to 20) from the limits that are normally set by fragment_size. If it doesn’t find such boundary characters while scanning, the fragment is truncated. By default, the Fast Vector Highlighter will truncate the code sample while highlighting long: ry_very<em>long</em>_name = 1

You can fix this by changing the defaults in two ways. One is to add the underscore to the list of boundary characters. This will still truncate the variable but in a more predictable way:

"highlight": {

"fields": {

"description": {

"fragment_size": 20,

"boundarychars": ".,!? \t\n"

# will yield veryvery<em>long</em>_name = 1

The other option is to leave boundarychars set to the default and extend boundary maxscan instead, which will increase the chances of having the whole variable included in the fragment, even if it implies a higher fragment size for this particular fragment: variable_with_a_very_very_very_very<em>long</em>_name = 1

Issues with fragment boundaries are typically visible when you need small fragments. For bigger chunks, inaccurate boundaries are less likely to be visible to users because their attention tends to focus on the highlighted bits and the words around them, not on the fragment as a whole. Another parameter to configure for the Fast Vector Highlighter is the fragment_offset. With this parameter you can control the margin to start the highlighting from.

Limiting the number of matches for the Fast Vector Highlighter

The final configuration option we discuss is the phrase_limit parameter. If the Fast Vector

Highlighter matches many phrases, it could consume a lot of memory. By default, only the 256 first matches are used. You can change this amount using the phrase_limit parameter.