F.1. Did-you-mean suggesters

Term and phrase suggesters can help you avoid those nasty “0 results found” pages by eliminating typos and/or showing more popular variations of the original keywords. For example, you may suggest Lucene/Solr for Lucene/Solar.

You can leave it up to the user to run the suggester query:

(Did you mean Lucene/Solr?)

Or you can run it automatically:

(Showing results for Lucene/Solr. Click here for Lucene/Solar).

Typically, you’d run the suggested query automatically if the original query produces no results or just a few results with tiny scores.

Before we dive into the details of how you’d use the term and phrase suggesters, let’s look at how they compare:

The term suggester is basic and fast, working well when you care only about the occurrence of each word, like when you search code or short texts.

The phrase suggester, on the other hand, takes the input text as a whole. This is slower and more complicated, as we’ll see in a bit, but also works much better for natural languages or other places where you need to consider the sequence of words, like product names. For example, apple iphone is probably a better suggestion than apple phone, even if the word phone appears more often in the index.

Both the term and the phrase suggester use Lucene’s SpellChecker module at their core. They look at terms from the index to come up with suggestions, so you can easily add DYM functionality on top of existing data if your data can be trusted. Otherwise, if your data would often contain typos—for example, if you’re indexing social media content—you might be better off maintaining a separate index with suggestions as a “dictionary.” That separate index could contain queries that are run often and return results that are typically clicked on.

F.1.1. Term suggester

The term suggester takes the input text, analyzes it into terms, and then provides a list of suggestions for each term. This process is best illustrated in listing F.1, where you provide suggestions for group members of the get-together site example you’ve been running throughout the book.

The term suggester’s structure applies to other types of suggesters as well:

Suggest options go under a suggest element at the root of the JSON—at the same level as query or aggregations, for example.

You can have one or more suggestions, each having a name, as you can with the aggregations we discussed in chapter 7. In listing F.1 you have dym-members.

Under each suggestion, you provide the text and the suggestion type; in this case, term. Under it, you’d put type-specific options. In the term suggester’s case, the only mandatory option is the field to use for getting suggestions. In this case, you’ll use the members field.

Note

For listing F.1 to work properly, you must download the code samples from https://github.com/dakrone/elasticsearch-in-action and run populate.sh to index some sample data.

Listing F.1. Using the term suggester to correct member typos

If you need only suggestions and not the query results, you can use the _suggest endpoint, skip the query object, and only pass the suggest object as a payload without the surrounding suggest keyword:

% curl localhost:9200/get-together/_suggest?pretty -d '{

"dym-members": {

"text": "leee daneil",

"term": {

"field": "members"

}

}

}'

This is useful when you want to check for missing terms before running the query, allowing you to correct the keywords instead of returning a potential “no results found” page.

Ranking suggestions

By default, the term suggester offers a number of suggestions (up to the value of size) for each provided term. Suggestions are sorted by how close they are to the provided text. For example, if you provide Willian you’ll get back William and then Williams. Of course, you can only get back these two values if they are available terms in the index. Also, Elasticsearch will provide suggestions only if the initial term Willian doesn’t exist in the index.

This won’t be ideal if you’re searching though documents about Formula 1, where Williams is more likely to be searched for than either William or Willian. And you probably want to show Williams even if Willian actually exists in the index.

As you might expect, you can change all these. You can rank popular words higher by changing sort to frequency instead of the default score. Finally, you can change suggest_mode to decide when to show suggestions. Compared to the default value of missing, popular will come up with terms with higher frequencies than the one provided, and always will come up with suggestions anyway.

In the next listing, you’ll get only the most popular suggestion for the event attendee mick.

Listing F.2. Getting the most popular suggestion for a term

Choosing which terms to be considered

In listing F.2 you got the winning suggestion, but who competed for that one spot? Let’s see how the term suggester works in order to understand which suggestions were considered in the first place.

As we mentioned before, the term suggester uses Lucene’s SpellChecker module. This returns terms from the index at a maximum edit distance from the provided term. You saw an example of how edit distance works in the fuzzy query in chapter 4; for example, to get from mik to mick you need to add a letter, so the edit distance between them is 1.

Like the fuzzy query, the term suggester has a number of options that let you balance flexibility and performance:

max_edits— This limits the edit distance from the provided term to the terms that might be suggested. For performance reasons, this is limited to values of 1 and 2, with 2 being the default value.

prefix_length— How much of the beginning of the word to assume is correct. The bigger the prefix, the faster Elasticsearch will find suggestions, but you also have a higher risk of typos in that prefix. The default for prefix_length is 1.

If you’re concerned about performance, you might also want to tweak these options:

min_doc_freq, which limits candidate suggestions to popular enough terms

max_term_freq, which excludes popular terms in the input text from being corrected in the first place

You can find more details about them in the documentation at www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-term.html.

If you’re more concerned about accuracy, take a look at the phrase suggester as well. It should provide better suggestions, especially on larger fields.

F.1.2. Phrase suggester

The phrase suggester also provides did-you-mean functionality, like the terms suggester, but instead of giving suggestions for individual terms, it gives suggestions for the overall text. This has a couple of advantages when you have multiple words in your search.

First, there’s less client-side logic to apply. For example, if you’re using the terms suggester for the input text abut using elasticsarch, you’ll probably get about as a suggestion for abut and elasticsearch for elasticsarch. Your application has to figure out that using has no suggestion and build up a message like “did you mean: about using elasticsearch.”

As you’ll see in the following listing, the phrase suggester gives you about using

elasticsearch out of the box. Plus, you can use highlighting to show the user which of the original terms have been corrected.

Listing F.3. Phrase suggester working with highlighting

Then you can expect suggestions to be better ranked, especially if you’re searching natural language, such as book content. The phrase suggester does that by adding new logic on top of the terms suggester to weigh candidate phrases based on how terms occur together in the index. This ranking technique is called ngram-language model, and it works if you have a shingle field with the same content as the field you’re searching on. You can get shingles by using the shingle token filter that we discussed in chapter 5; remember, this means you have to configure shingles in the mapping to create the index appropriately.

More on n-grams, shingles, and n-gram models

An n-gram is defined as a contiguous sequence of n items from a given sequence of text or speech.[1] These items could be letters or words, and in Elasticsearch we say n-grams for letter n-grams and shingles for word n-grams.

1 https://en.wikipedia.org/wiki/N-gram

An n-gram model uses frequencies of existing word n-grams (shingles, in Elasticsearch money) to determine the likelihood of different words existing next to each other. For example, a speech recognition device is more likely to encounter yellow fever than hello fever, assuming it finds more yellow fever than hello fever shingles in the training data.

The phrase suggester uses n-gram models to score candidate phrases based on the occurrence of consecutive words in a shingle field. You can expect a phrase suggestion like John has yellow fever to be scored higher than John has hello fever.

The shingles field is used for ranking suggestions by checking how often suggested words occur next to each other, as shown in figure F.3.

Figure F.3. Candidate suggestions are ranked based on the shingles field.

As you might expect, there are many options that allow you to configure much of this process, and we’ll discuss the most important ones here:

How candidate generators come up with candidate terms

How overall phrases get scored based on the shingles field

How shingles of different sizes influence a suggestion’s score

How to include and exclude suggestions based various criteria, such as score or whether they’ll actually return results

Candidate generators

The responsibility of the candidate generators is to come up with a list of possible terms based on the terms in the provided text. As of version 1.4, there’s only one type of candidate generator, called direct_generator. It works in a similar way to the terms suggester in that it finds suggestions for every term of the input text.

The direct generator has similar options to the term suggester, like max_edits or prefix_length. But the phrase suggester supports more than one generator, and it also allows you to specify an analyzer that is applied to input terms before they get spell checked (pre-filter), and one that is applied to suggested terms before they are returned.

Having multiple generators and filters lets you do some neat tricks. For instance, if typos are likely to happen both at the beginning and end of words, you can use multiple generators to avoid expensive suggestions with low prefix lengths by using the reverse token filter, as shown in figure F.4.

Figure F.4. Using filters and two direct generators to correct both prefix and suffix typos

You’ll implement what’s shown in figure F.4 in listing F.4:

First, you’ll need an analyzer that includes the reverse token filter.

Then you’ll index the correct product description in two fields: one analyzed with the standard analyzer and one with the reverse analyzer.

When you run the suggester, you can specify two candidate generators: one running on the standard field and one on the reversed field, which will make use of the reverse pre- and post-filters.

Listing F.4. Using the reverse token filter in one of the two generators

Using a shingles field for scoring candidates

Now that you have good candidates, you’ll use a shingles field for ranking. In listing F.5, you’ll use the shingle token filter to define another multi-field for the shop product descriptions.

You’ll have to decide how many consecutive words to allow in a shingle, or the shingle size. This is usually a tradeoff between performance and accuracy: lower-level shingles are needed in order to get partial matches, like boosting a suggestion for United States based on an indexed text saying United States of America. Higher-level shingles are good for boosting exact matches of longer texts such as United States of America above United States of Americas. The problem is, the more shingle sizes you add, the bigger your index gets, and suggestions will take longer.

A good balance for most use cases is to index sizes from 1 to 3. You can do it by setting min_shingle_size to 2 and max_shingle_size to 3, because the shingle filter outputs unigrams by default.

With the shingles field in place, you need to specify it as the field of your phrase suggester, whereas the regular description field will go under each candidate generator.

Listing F.5. Using a shingles field to get better ranking for suggestions

Using smoothing models to score shingles of different sizes

Take two possible suggestions: Elasticsearch in Action and Elasticsearch is

Auction. If the index contains the trigram Elasticsearch in Action, you’d expect this suggestion to rank higher. But if term frequencies are the only criterion, and the unigram auction appears many times in the index, Elasticsearch is Auction might win.

In most use cases, you want the score to be given not only by the frequency of a shingle but also by the shingle’s size. Luckily, there are smoothing models that do just that. By default, Elasticsearch uses an algorithm called Stupid Backoff in the phrase suggester. The name implies that it’s simple, but it works well:[2] it takes the highest order shingles as the reference—trigrams in the case of listing F.5. If no trigrams are found, it looks for bigrams but multiplies the score by 0.4. If no bigrams are found, it goes to unigrams but lowers the score once again by 0.4. The whole process is shown in figure F.5.

2

Stupid Backoff was the original name, because authors assumed such a simple algorithm couldn’t possibly work. It turns out it works, but the name stuck. More details here: www.aclweb.org/anthology/D07-1090.

Figure F.5. Stupid Backoff discounts the score of lower-order shingles.

That 0.4 multiplier can be configured through the discount parameter:

% curl localhost:9200/shop2/_suggest?pretty -d '{

"dym": {

"text": "ifone accesories",

"phrase": {

"field": "product.shingled",

"smoothing": {

"stupid_backoff": {

"discount": 0.5

}

},

"direct_generator": [{

"field": "product"

}]

}

}

}'

Note

Usually, Stupid Backoff works well, but there are other smoothing models available, such as Laplace smoothing or linear interpolation. For more information about them, go to www.elastic.co/guide/en/elasticsearch/reference/current/search-suggestersphrase.html#_smoothing_models.

Excluding suggestions based on different criteria

Besides ranking suggestions based on ngram-language models, you can include or exclude them based on certain criteria. Back in listing F.4, you saw max_errors, which allows only suggestions that correct a maximum number of terms. It’s usually recommended to set max_errors to a low value (it defaults to 1); otherwise, the suggest request will take a long time because it has to score too many suggestions.

You can also include or exclude possible suggestions based on their score or whether they would actually produce results, should you run a query with the suggested text.

For filtering by score, the main option is confidence—the higher the value, the more confident you are that the input text doesn’t need suggestions. It works like this: the phrase suggester scores the input text as well as possible suggestions. Suggestions with a score less than the input text’s score multiplied by confidence (which defaults to 1) are eliminated. Increasing the value improves performance and helps you get rid of embarrassing suggestions like “Did you mean lucene/solar?” On the other hand, a value that’s too high might miss providing suggestions for “solr panels.”

confidence works hand in hand with real_word_error_likelihood, which should describe

the proportion of misspelled words in the index itself (defaults to 0.95). Possible suggestions have their score multiplied by this value, so lowering it reduces chances of returning a misspelled word as a suggestion, because the score of that suggestion is more likely to be lower than that of the input text (multiplied by confidence). If you set it too low though, good suggestions might be missed as well, so it’s usually best to set real_word_error_likelihood to a value that describes the real likelihood of a misspelling in the index.

Finally, what happens if the query you suggest won’t return any results? That would be pretty bad, but luckily you can have Elasticsearch verify that for each suggestion. In the following listing, you’ll use the collate option to have Elasticsearch return only suggestions that return results. You need to specify a query, and in that query you’ll refer to the suggestion itself as the variable. Note how suggestions such as ifone accessories are removed from the list.

Listing F.6. Using collate to see which suggestions would return results

These are Mustache templates (more details at https://mustache.github.io) and can also be used for predefining regular queries. You can find more details on query templates here: www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-template-query.html.

Collating works well for ironing out a few bad suggestions. If you have a high rate of bad suggestions, consider running the phrase suggester against a separate index with successful queries. This requires a lot of maintenance, but you should get much more relevant suggestions. And because that index will probably be much smaller, you’ll get better performance, too.

Next, we’ll move on to autocomplete suggesters. You’re very likely to run them on separate indices, because they typically have to be very fast and relevant.