Chapter 5. Analyzing your data

This chapter covers

Analyzing your document’s text with Elasticsearch

Using the analysis API

Tokenization

Character filters

Token filters

Stemming

Analyzers included with Elasticsearch

So far we’ve covered indexing and searching your data, but what actually happens when you send data to Elasticsearch? What happens to the text sent in a document to Elasticsearch? How can Elasticsearch find specific words within sentences, even when the case changes? For example, when a user searches for “nosql,” generally you’d like a document containing the sentence “share your experience with NoSql & big data technologies” to match, because it contains the word NoSql. You can use the information you learned in the previous chapter to do a query_string search for “nosql” and find the document. In this chapter you’ll learn why using the query string query will return the document. Once you finish this chapter you’ll have a better idea how Elasticsearch’s analysis allows you to search your document set in a more flexible manner.

5.1. What is analysis?

Analysis is the process Elasticsearch performs on the body of a document before the document is sent off to be added to the inverted index. Elasticsearch goes through a number of steps for every analyzed field before the document is added to the index:

Character filtering— Transforms the characters using a character filter

Breaking text into tokens— Breaks the text into a set of one or more tokens

Token filtering— Transforms each token using a token filter

Token indexing— Stores those tokens into the index

We’ll talk about each step in more detail next, but first let’s look at the entire process summed up in a diagram. Figure 5.1 shows the text “share your experience with NoSql & big data technologies” transformed into the analyzed tokens: share, your, experience, with, nosql, big, data, tools, and technologies. The presented analyzer is a custom analyzer created using provided character filters, tokenizers, and token filters. Later in this chapter we discuss the custom analyzer in more depth.

Figure 5.1. Overview of the analysis process of a custom analyzer using standard components