5.6. Stemming

Stemming is the act of reducing a word to its base or root word. This is extremely handy when searching because it means you’re able to match things like the plural of a word as well as words sharing the root or stem of the word (hence the name stemming). Let’s look at a concrete example. If the word is

“administrations,” the root of the word is “administr.” This allows you to match all of the other roots for this word, like “administrator,” “administration,” and “administrate.” Stemming is a powerful way of making your searches more flexible than rigid exact matching.

5.6.1. Algorithmic stemming

Algorithmic stemming is applied by using a formula or set of rules for each token in order to stem it. Elasticsearch currently offers three different algorithmic stemmers: the snowball filter, the porter stem filter, and the kstem filter. They behave in almost the same way but have some slight differences in how aggressive they are with regard to stemming. By aggressive we mean that the more aggressive stemmers chop off more of the word than the less aggressive stemmers. Table 5.1 shows a comparison of the different algorithmic stemmers.

Table 5.1. Comparing stemming of snowball, porter stem, and kstem

stemmer administrations administrators Administrate
snowball administr administr Administer
porter_stem administr administr Administer
kstem administration administrator Administrate

To see how a stemmer stems a word, you can specify it as a token filter with the analyze API:

curl -XPOST 'localhost:9200/_analyze?tokenizer=standard&filters=kstem' -d 'administrators'

Use either snowball, porter_stem, or kstem for the filter to test it out.

As an alternative to algorithmic stemming, you can stem using a dictionary, which is a one-to-one mapping of the original word to its stem.

5.6.2. Stemming with dictionaries

Sometimes algorithmic stemmers can stem words in a strange way because they don’t know any of the underlying language. Because of this, there’s a more accurate way to stem words that uses a dictionary of words. In Elasticsearch you can use the hunspell token filter, combined with a dictionary, to handle the stemming. Because of this, the quality of the stemming is directly related to the quality of the dictionary that you use. The stemmer will only be able to stem words it has in the dictionary.

When creating a hunspell analyzer, the dictionary files should be in a directory called hunspell in the same directory as elasticsearch.yml. Inside the hunspell directory dictionary for each language should be a folder named after its associated locale. Here’s how to create an index with a hunspell analyzer:

% curl -XPOST 'localhost:9200/hspell' -d'{

"analysis" : {

"analyzer" : {

"hunAnalyzer" : {

"tokenizer" : "standard",

"filter" : [ "lowercase", "hunFilter" ]

}

},

"filter" : {

"hunFilter" : {

"type" : "hunspell",

"locale" : "en_US",

"dedup" : true

}

}

}

}

The hunspell dictionary files should be inside /hunspell/en_US (replace with the location of your Elasticsearch configuration directory). The en_US folder is used because this hunspell analyzer is for the English language and corresponds to the locale setting in the previous

example. You can also change where Elasticsearch looks for hunspell dictionaries by setting the indices.analysis .hunspell.dictionary.location setting in elasticsearch.yml. To test that your analyzer is working correctly, you can use the analyze API again:

% curl -XPOST 'localhost:9200/hspell/_analyze?analyzer=hunAnalyzer' d'administrations'

5.6.3. Overriding the stemming from a token filter

Sometimes you may not want to have words stemmed because either the stemmer treats them incorrectly or you want to do exact matches on a particular word. You can accomplish this by placing a keyword marker token filter before the stemming filter in the chain of token filters. In this keyword marker token filter, you can specify either a list of words or a file with a list of words that shouldn’t be stemmed.

Other than preventing a word from being stemmed, it may be useful for you to manually specify a list of rules to be used for stemming words. You can achieve this with the stemmer override token filter, which allows you to specify rules like cats => cat to be applied. If the stemmer override finds a rule and applies it to a word, that word can’t be stemmed by any other stemmer.

Keep in mind that both of these token filters must be placed before any other stemming filters because they’ll protect the term from having stemming applied by any other token filters later in the chain.