5.2. Using analyzers for your documents

Knowing about the different types of analyzers and token filters is fine, but before they can actually be used, Elasticsearch needs to know how you want to use them. For instance, you can specify in the mapping which individual tokenizer and token filters to use for an analyzer and which analyzer to use for which field.

There are two ways to specify analyzers that can be used by your fields:

When the index is created, as settings for that particular index

As global analyzers in the configuration file for Elasticsearch

Generally, to be more flexible, it’s easier to specify analyzers at the index-creation time, which is also when you want to specify your mappings. This allows you to create new indices with updated or entirely different analyzers. On the other hand, if you find yourself using the same set of analyzers across your indices without changing them very often, you can also save some bandwidth by putting the analyzers into the configuration file. Examine how you’re using Elasticsearch and pick the option that works best for you. You could even combine the two and put the analyzers that are used by all of your indices into the configuration file and specify additional analyzers for added stability when you create indices.

Regardless of the way you specify your custom analyzers, you’ll need to specify which field uses which analyzer in the mapping of your index, either by specifying the mapping when the index is created or using the “put mapping API” to specify it at a later time.

5.2.1. Adding analyzers when an index is created

In chapter 3 you saw some of the settings when an index is created, notably settings for the number of primary and replica shards for an index, which look something like the following listing.

Listing 5.1. Setting the number of primary and replica shards

Adding a custom analyzer is done by specifying another map in the settings configuration under the index key. This key should specify the custom analyzer you want to use, and it can also contain the custom tokenizer, token filters, and char filters that the index can use. The next listing shows a custom analyzer that specifies custom parts for all the analysis steps. This is a complex example, so we’ve added some headings to show the different parts. Don’t worry about all the code details yet because we’ll go through them later on in this chapter.

Listing 5.2. Adding a custom analyzer during index creation

The mappings have been left out of the code listing here because we’ll cover how to specify the analyzer for a field in section 5.2.3. In this example you create a custom analyzer called myCustomAnalyzer,

which uses the custom tokenizer myCustomTokenizer, two custom filters named myCustomFilter1 and myCustomFilter2, and a custom character filter named

myCustomCharFilter (notice a trend here?). Each of these separate analysis parts is given in its respective JSON submaps. You can specify multiple analyzers with different names and combine them into custom analyzers to give you flexible analysis options when indexing and searching.

Now that you have a sense of what adding custom analyzers looks like when an index is created, let’s look at the same analyzers added to the Elasticsearch configuration itself.

5.2.2. Adding analyzers to the Elasticsearch configuration

In addition to specifying analyzers with settings during index creation, adding analyzers into the Elasticsearch configuration file is another supported way to specify custom analyzers. But there are tradeoffs to this method; if you specify the analyzers during index creation, you’ll always be able to make changes to the analyzers without restarting Elasticsearch. But if you specify the analyzers in the

Elasticsearch configuration, you’ll need to restart Elasticsearch to pick up any changes you make to the analyzers. On the flip side, you’ll have less data to send when creating indices. Although it’s generally easier to specify them at index creation for a larger degree of flexibility, if you plan to never change your analyzers, you can go ahead and put them into the configuration file.

Specifying analyzers in the elasticsearch.yml configuration file is similar to specifying them as JSON; here are the same custom analyzers from the previous section but specified in the configuration YAML file:

index: analysis: analyzer:

myCustomAnalyzer: type: custom

tokenizer: myCustomTokenizer

filter: [myCustomFilter1, myCustomFilter2] char_filter: myCustomCharFilter tokenizer: myCustomTokenizer:

type: letter filter: myCustomFilter1: type: lowercase myCustomFilter2: type: kstem char_filter: myCustomCharFilter:

type: mapping

mappings: ["ph=>f", "u =>you"]

5.2.3. Specifying the analyzer for a field in the mapping

There’s one piece of the puzzle left to solve before you can analyze fields with custom analyzers: how to specify that a particular field in the mapping should be analyzed using one of your custom analyzers. It’s simple to specify the analyzer for a field by setting the analyzer field on a mapping. For instance, if you had the mapping for a field called description, specifying the analyzer would look like this:

If you want a particular field to not be analyzed at all, you need to specify the index field with the not_analyzed setting. This keeps the text as a single token without any kind of modification (no lowercasing or anything). It looks something like this:

A common pattern for fields where you may want to search on both the analyzed and verbatim text of a field is to place them in multi-fields.

Using multi-field type to store differently analyzed text

Often it’s helpful to be able to search on both the analyzed version of a field as well as the original, nonanalyzed text. This is especially useful for things like aggregations or sorting on a string field.

Elasticsearch makes this simple to do by using multi-fields, which you first saw in chapter 3. Take the name field of groups in the get-together index, for example; you may want to be able to sort on the name field but search through it using analysis. You can specify a field that does both like this:

We’ve covered how to specify analyzers; now we’ll show you a neat way to check how any arbitrary text can be analyzed: the analyze API.