5.7. Summary

You should now understand how Elasticsearch breaks apart a field’s text before indexing or querying.

Text is broken into different tokens, and then filters are applied to create, delete, or modify these tokens:

Analysis is the process of making tokens out of the text in fields of your documents. The same process is applied to your search string in queries such as the match query. A document matches when its tokens match tokens from the search string.

Each field is assigned an analyzer through the mapping. That analyzer can be defined in your Elasticsearch configuration or index settings, or it could be a default analyzer.

Analyzers are processing chains made up by a tokenizer, which can be preceded by one or more char filters and succeeded by one or more token filters.

Char filters are used to process strings before passing them to the tokenizer. For example, you can use the mapping char filter to change “&” to “and.”

Tokenizers are used for breaking strings into multiple tokens. For example, the whitespace tokenizer can be used to make a token out of each word delimited by a space.

Token filters are used to process tokens coming from the tokenizer. For example, you can use stemming to reduce a word to its root and make your searches work across both plural and singular versions of that word.

Ngram token filters make tokens out of portions of words. For example, you can make a token out of every two consecutive letters. This is useful when you want your searches to work even if the search string contains typos.

Edge ngrams are like ngrams, but they work only from the beginning or the end of the word. For example, you can take “event” and make e, ev, and eve tokens.

Shingles are like ngrams at the phrase level. For example, you can generate terms out of every two consecutive words from a phrase. This is useful when you want to boost the relevance of multipleword matches, like in the short description of a product. We’ll talk more about relevancy in the next chapter.