5.3. Analyzing text with the analyze API

Using the analyze API to test the analysis process can be extremely helpful when tracking down how information is being stored in your Elasticsearch indices. This API allows you to send any text to Elasticsearch, specifying what analyzer, tokenizer, or token filters to use, and get back the analyzed tokens. The following listing shows an example of what the analyze API looks like, using the standard analyzer to analyze the text “share your experience with NoSql & big data technologies.” Listing 5.3. Example of using the analyze API

The most important output from the analysis API is the token key. The output is a list of these maps, which gives you a representation of what the processed tokens (the ones that are going to actually be written to the index) look like. For example, with the text “share your experience with NoSql & big data technologies,” you get back eight tokens: share, your, experience, with, nosql, big, data, and technologies. Notice that in this case, with the standard analyzer, each token was lowercased and the punctuation at the end of the sentence was removed. This is a great way to test documents to see how Elasticsearch will analyze them, and it has quite a few ways to customize the analysis that’s performed on the text.

5.3.1. Selecting an analyzer

If you already have an analyzer in mind and want to see how it handles some text, you can set the analyzer parameter to the name of the analyzer. We’ll go over the different built-in analyzers in the next section, so keep this in mind if you want to try out any of them!

If you configured an analyzer in your elasticsearch.yml file, you can also reference it by name in the analyzer parameter. Additionally, if you’ve created an index with a custom analyzer similar to the example in listing 5.2, you can still use this analyzer by name, but instead of using the HTTP endpoint of /_search, you’ll need to specify the index first. An example using the index named get-together and an analyzer called myCustomAnalyzer is shown here:

% curl -XPOST 'localhost:9200/get-together/_analyze?analyzer=myCustomAnalyzer'

–d 'share your experience with NoSql & big data technologies'

5.3.2. Combining parts to create an impromptu analyzer

Sometimes you may not want to use a built-in analyzer but instead try out a combination of tokenizers and token filters—for instance, to see how a particular tokenizer breaks up a sentence without any other analysis. With the analysis API you can specify a tokenizer and a list of token filters to be used for analyzing the text. For example, if you wanted to use the whitespace tokenizer (to split the text on spaces) and then use the lowercase and reverse token filters, you could do so as follows:

% curl -XPOST 'localhost:9200/

_analyze?tokenizer=whitespace&filters=lowercase,reverse' -d 'share your experience with NoSql & big data technologies'

You’d get back the following tokens: erahs, ruoy, ecneirepxe, htiw, lqson, &, gib, atad, seigolonhcet

This tokenizer first tokenized the sentence “share your experience with NoSql & big data technologies” into the tokens share, your, experience, with, NoSql, &, big, data, technologies. Next, it lowercased the tokens, and finally, it reversed each token to get the provided terms.

5.3.3. Analyzing based on a field’s mapping

One more helpful thing about the analysis API once you start creating mappings for an index is that Elasticsearch allows you to analyze based on a field where the mapping has already been created. If you create a mapping with a field description that looks like this snippet

... other mappings ...

"description": {

"type": "string",

"analyzer": "myCustomAnalyzer"

}

you can then use the analyzer associated with the field by specifying the field parameter with the request:

% curl -XPOST 'localhost:9200/get-together/_analyze?field=description' –d ' share your experience with NoSql & big data technologies'

The custom analyzer will automatically be used because it’s the analyzer associated with the description field. Keep in mind that in order to use this, you’ll need to specify an index, because Elasticsearch needs to be able to get the mappings for a particular field from an index.

Now that we’ve covered how to test out different analyzers using cURL, we’ll jump into all the different analyzers that Elasticsearch provides for you out of the box. Keep in mind that you can always create your own analyzer by combining the different parts (tokenizers and token filters).

5.3.4. Learning about indexed terms using the terms vectors API

When thinking about the right analyzer, the _analyze endpoint of the previous section is a fine method. But if you want to learn more about the terms in a certain document, there’s a more effective way than going over all the separate fields. You can use the endpoint _termvector to get more information about all the terms. Using the endpoint you can learn about the terms, how often they occur in the document, the index, and where they occur in the document.

The basic usage of the _termvector endpoint looks like this:

There are some things you can configure, one of them being the term statistics; be aware that this is a heavy operation. The following command shows how to change this request. Now you request the terms statistics as well and mention the fields you want statistics for:

% curl 'localhost:9200/get-together/group/1/_termvector?pretty=true' -d '{

"fields" : ["description","tags"],

"term_statistics" : true

}'

Here’s part of the response. Only one term is shown, and the structure is the same as the previous code sample:

By now you’ve learned a lot about what analyzers do and how you can explore the outcome of analyzers. You’ll keep using the _analyze and _termvector APIs when exploring built-in analyzers in the next section.

5.4. Analyzers, tokenizers, and token filters, oh my!

In this section we’ll discuss the built-in analyzers, tokenizers, and token filters that Elasticsearch provides. Elasticsearch provides a large number of them, such as lowercasing, stemming, languagespecific, synonyms, and so on, so you have a lot of flexibility to combine them in different ways to get your desired tokens.

5.4.1. Built-in analyzers

This section provides a rundown of the analyzers that Elasticsearch comes with out of the box. Remember that an analyzer consists of an optional character filter, a single tokenizer, and zero or more token filters. Figure 5.2 is a visualization of an analyzer.

We’ll be referencing tokenizers and token filters, which we’ll cover in more detail in the following sections. With each analyzer, we’ll include an example of some text that demonstrates what analysis using that analyzer looks like.

Standard

The standard analyzer is the default analyzer for text when no analyzer is specified. It combines sensible defaults for most European languages by combining the standard tokenizer, the standard token filter, the lowercase token filter, and the stop token filter. There isn’t much to say about the standard analyzer. We’ll talk about what the standard tokenizer and standard token filter do in sections 5.4.2 and 5.4.3; just keep in mind that if you don’t specify an analyzer for a field, the standard analyzer will be used.

Simple

The simple analyzer is just that—simple! It uses the lowercase tokenizer, which means tokens are split at nonletters and automatically lowercased. This analyzer doesn’t work well for Asian languages that don’t separate words with whitespace, though, so use it only for European languages.

Whitespace

The whitespace analyzer does nothing but split text into tokens around whitespace—very simple!

Stop

The stop analyzer behaves like the simple analyzer but additionally filters out stopwords from the token stream.

Keyword

The keyword analyzer takes the entire field and generates a single token on it. Keep in mind that rather than using the keyword tokenizer in your mappings, it’s better to set the index setting to not_analyzed.

Pattern

The pattern analyzer allows you to specify a pattern for tokens to be broken apart. But because the pattern would have to be specified regardless, it often makes more sense to use a custom analyzer and combine the existing pattern tokenizer with any needed token filters.

Language and multilingual

Elasticsearch supports a wide variety of language-specific analyzers out of the box. There are analyzers for arabic, armenian, basque, brazilian, bulgarian, catalan, chinese, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, irish, hindi, hungarian, indonesian, italian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, and thai. You can specify the languagespecific analyzer by using one of those names, but make sure you use the lowercase name! If you want to analyze a language not included in this list, there may be a plugin for it as well.

Snowball

The snowball analyzer uses the standard tokenizer and token filter (like the standard analyzer), with the lowercase token filter and the stop filter; it also stems the text using the snowball stemmer. Don’t worry if you aren’t sure what stemming is; we’ll discuss it in more detail near the end of this chapter.

Before you can fully comprehend these analyzers, you need to understand the parts that make up an analyzer, so we’ll now discuss the tokenizers that Elasticsearch supports.

5.4.2. Tokenization

As you may recall from earlier in the chapter, tokenization is taking a string of text and breaking it into smaller chunks called tokens. Just as Elasticsearch includes analyzers out of the box, it also includes a number of built-in tokenizers.

Standard tokenizer

The standard tokenizer is a grammar-based tokenizer that’s good for most European languages; it also handles segmenting Unicode text but with a default max token length of 255. It also removes punctuation like commas and periods:

% curl -XPOST 'localhost:9200/_analyze?tokenizer=standard' -d 'I have, potatoes.'

The tokens are I, have, and potatoes.

Keyword

Keyword is a simple tokenizer that takes the entire text and provides it as a single token to the token filters. This can be useful when you only want to apply token filters without doing any kind of tokenization:

% curl -XPOST 'localhost:9200/_analyze?tokenizer=keyword' -d 'Hi, there.' The tokens are Hi and there.

Letter

The letter tokenizer takes the text and divides it into tokens at things that are not letters. For example, with the sentence “Hi, there.” the tokens would be Hi and there because the comma, space, and period are all nonletters:

% curl -XPOST 'localhost:9200/_analyze?tokenizer=letter' -d 'Hi, there.' The tokens are Hi and there.

Lowercase

The lowercase tokenizer combines both the regular letter tokenizer’s action as well as the action of the lowercase token filter (which, as you can imagine, lowercases the entire token). The main reason to do this with a single tokenizer is that you gain better performance by doing both at once:

% curl -XPOST 'localhost:9200/_analyze?tokenizer=letter' -d 'Hi, there.' The tokens are hi and there.

Whitespace

The whitespace tokenizer separates tokens by whitespace: space, tab, line break, and so on. Note that this tokenizer doesn’t remove any kind of punctuation, so tokenizing the text “Hi, there.” results in two tokens: Hi and there:

% curl -XPOST 'localhost:9200/_analyze?tokenizer=whitespace' -d 'Hi, there.' The tokens are Hi and there.

Pattern

The pattern tokenizer allows you to specify an arbitrary pattern where text should be split into tokens. The pattern that’s specified should match the spacing characters; for example, if you wanted to split text on any two-digit number, you could create a custom analyzer that breaks tokens at wherever the text .-. occurs, which would look like this:

% curl -XPOST 'localhost:9200/pattern' -d '{

"settings": {

"index": {

"analysis": {

"tokenizer": {

"pattern1": {

"type": "pattern",

"pattern": "\.-\."

}

}

}

}

}

}'

% curl -XPOST 'localhost:9200/pattern/_analyze?tokenizer=pattern1' \

-d 'breaking.-.some.-.text'

The tokens are breaking, some, and text.

UAX URL email

The standard tokenizer is pretty good at figuring out English words, but these days there’s quite a bit of text that ends up containing website addresses and email addresses. The standard analyzer breaks these apart in places where you may not intend; for example, if you take the example email address [email protected] and analyze it with the standard tokenizer, it gets split into multiple tokens:

% curl -XPOST 'localhost:9200/_analyze?tokenizer=standard' \

-d '[email protected]'

The tokens are john.smith and example.com.

Here you see it’s been split into the john.smith part and the example.com part. It also splits URLs into separate parts:

% curl -XPOST 'localhost:9200/_analyze?tokenizer=standard' \

-d 'http://example.com?q=foo'

The tokens are http, example.com, q, and foo.

The UAX URL email tokenizer will preserve both emails and URLs as single tokens:

This can be extremely helpful when you want to search for exact URLs or email addresses in a text field. In this case we included the response to make it visible that the type of the fields is also set to email and url.

Path hierarchy

The path hierarchy tokenizer allows you to index filesystem paths in a way where searching for files sharing the same path will return results. For example, let’s assume you have a filename you want to index that looks like /usr/local/var/log/elasticsearch.log. Here’s what the path hierarchy tokenizer tokenizes this into:

% curl 'localhost:9200/_analyze?tokenizer=path_hierarchy' \

-d '/usr/local/var/log/elasticsearch.log'

The tokens are /usr, /usr/local, /usr/local/var, /usr/local/var/log, and /usr/local/var/log/elasticsearch.log.

This means a user querying for a file sharing the same path hierarchy (hence the name!) as this file will find a match. Querying for “/usr/local/var/log/es.log” will still share the same tokens as “/usr/local/var/log/elasticsearch.log,” so it can still be returned as a result.

Now that we’ve touched on the different ways of splitting a block of text into different tokens, let’s talk about what you can do with each of those tokens.

5.4.3. Token filters

There are a lot of token filters included in Elasticsearch; we’ll cover only the most popular ones in this section because enumerating all of them would make this section much too verbose. Like figure 5.1, figure 5.3 provides an example of three token filters: the lowercase filter, the stopword filter, and the synonym filter.

Figure 5.3. Token filters accept tokens from tokenizer and prep data for indexing.

Standard

Don’t be fooled into thinking that the standard token filter performs complex calculation; it actually does nothing at all! In the older versions of Lucene it used to remove the “’s” characters from the end of words, as well as some extraneous period characters, but these are now handled by some of the other token filters and tokenizers.

Lowercase

The lowercase token filter does just that: it lowercases any token that gets passed through it. This should be simple enough to understand:

% curl 'localhost:9200/_analyze?tokenizer=keyword&filters=lowercase' -d 'HI THERE!' The token is hi there!.

Length

The length token filter removes words that fall outside a boundary for the minimum and maximum length of the token. For example, if you set the min setting to 2 and the max setting to 8, any token shorter than two characters will be removed and any token longer than eight characters will be removed:

% curl -XPUT 'localhost:9200/length' -d '{

"settings": {

"index": {

"analysis": {

"filter": {

"my-length-filter": {

"type": "length",

"max": 8,

"min": 2

}

}

}

}

}

}'

Now you have the index with the configured custom filter called my-length-filter. In the next request you use this filter to filter out all tokens smaller than 2 or bigger than 8.

% curl 'localhost:9200/length/_analyze?tokenizer=standard&filters=my-lengthfilter&pretty=true' -d 'a small word and a longerword' The tokens are small, word, and and.

Stop

The stop token filter removes stopwords from the token stream. For English, this means all tokens that fall into this list are entirely removed. You can also specify a list of words to be removed for this filter.

What are the stopwords? Here’s the default list of stopwords for the English language:

a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with

To specify the list of stopwords, you can create a custom token filter with a list of words like this:

% curl -XPOST 'localhost:9200/stopwords' -d'{

"settings": {

"index": {

"analysis": {

"analyzer": {

"stop1": {

"type": "custom",

"tokenizer": "standard",

"filter": ["my-stop-filter"]

}

},

"filter": {

"my-stop-filter": {

"type": "stop",

"stopwords": ["the", "a", "an"]

}

}

}

}

}

}'

To read the list of stopwords from a file, using either a path relative to the configuration location or an absolute path, each word should be on a new line and the file must be UTF-8 encoded. You’d use the following to use the stop word filter configured with a file:

% curl -XPOST 'localhost:9200/stopwords' -d'{

"settings": {

"index": {

"analysis": {

"analyzer": {

"stop1": {

"type": "custom",

"tokenizer": "standard",

"filter": ["my-stop-filter"]

}

},

"filter": {

"my-stop-filter": {

"type": "stop",

"stopwords_path": "config/stopwords.txt"

}

}

}

}

}

}'

A final option would be to use a predefined language list of stop words. In that case the value for stopwords could be “dutch”, or any of the other predefined languages.

Truncate, trim, and limit token count

The next three token filters deal with limiting the token stream in some way:

The truncate token filter allows you to truncate tokens over a certain length by settings the length parameter in the custom configuration; by default it truncates to 10 characters.

The trim token filter removes all of the whitespace around a token; for example, the token " foo" will be transformed into the token foo.

The limit token count token filter limits the maximum number of tokens that a particular field can contain. For example, if you create a customized token count filter with a limit of 8, only the first eight tokens from the stream will be indexed. This is set using the max_token_count parameter, which defaults to 1 (only a single token will be indexed).

Reverse

The reverse token filter allows you to take a stream of tokens and reverse each one. This is particularly useful if you’re using the edge ngram filter or want to do leading wildcard searches. Instead of doing a leading wildcard search for “bar,” which is very slow for Lucene, you can search using “rab” on a field that has been reversed, resulting in a much faster query. The following listing shows an example of reversing a stream of tokens.

Listing 5.4. Example of the reverse token filter

You can see that each token has been reversed, but the order of the tokens has been preserved.

Unique

The unique token filter keeps only unique tokens; it keeps the metadata of the first token that matches, removing all future occurrences of it:

% curl 'localhost:9200/_analyze?tokenizer=standard&filters=unique' \

-d 'foo bar foo bar baz'

{

"tokens" : [ {

"token" : "foo",

"start_offset" : 0,

"end_offset" : 3,

"type" : "<ALPHANUM>",

"position" : 1

}, {

"token" : "bar",

"start_offset" : 4,

"end_offset" : 7,

"type" : "<ALPHANUM>",

"position" : 2

}, {

"token" : "baz",

"start_offset" : 16,

"end_offset" : 19,

"type" : "<ALPHANUM>",

"position" : 3

} ]

}

Ascii folding

The ascii folding token filter converts Unicode characters that aren’t part of the regular ASCII character set into the ASCII equivalent, if one exists for the character. For example, you can convert the Unicode “ü” into an ASCII “u” as shown here:

% curl 'localhost:9200/_analyze?tokenizer=standard&filters=asciifolding' -d 'ünicode'

{

"tokens" : [ {

"token" : "unicode",

"start_offset" : 0,

"end_offset" : 7,

"type" : "<ALPHANUM>",

"position" : 1

} ]

}

Synonym

The synonym token filter replaces synonyms for words in the token stream at the same offset as the original tokens. For example, let’s take the text “I own that automobile” and the synonym for “automobile,” “car.” Without the synonym token filter you’d produce the following tokens:

% curl 'localhost:9200/_analyze?analyzer=standard' -d'I own that automobile'

{

"tokens" : [ {

"token" : "i",

"start_offset" : 0,

"end_offset" : 1,

"type" : "<ALPHANUM>",

"position" : 1

}, {

"token" : "own",

"start_offset" : 2,

"end_offset" : 5,

"type" : "<ALPHANUM>",

"position" : 2

}, {

"token" : "that",

"start_offset" : 6,

"end_offset" : 10,

"type" : "<ALPHANUM>",

"position" : 3

}, {

"token" : "automobile",

"start_offset" : 11,

"end_offset" : 21,

"type" : "<ALPHANUM>",

"position" : 4

} ]

}

You can define a custom analyzer that specifies a synonym for “automobile” like this:

% curl -XPOST 'localhost:9200/syn-test' -d'{

"settings": {

"index": {

"analysis": {

"analyzer": {

"synonyms": {

"type": "custom",

"tokenizer": "standard",

"filter": ["my-synonym-filter"]

}

},

"filter": {

"my-synonym-filter": {

"type": "synonym",

"expand": true,

"synonyms": ["automobile=>car"]

}

}

}

}

}

}'

When you use it, you can see that the automobile token has been replaced by the car token in the results:

In the example you configure the synonym filter to replace the token, but it’s also possible to add the synonym token to the tokens using the filter. In that case you should replace automobile=>car with automobile,car.