9.8. Routing

In chapter 8, we talked about how documents end up in a particular shard; this process is called routing the document. To refresh your memory, routing a document occurs when Elasticsearch hashes the ID of the document, either specified by you or generated by Elasticsearch, to determine which shard a document should be indexed into. Elasticsearch also allows you to manually specify the routing of a document when indexing, which is what you do when using parent-child relationships because the child document has to be in the same shard as the parent document.

Routing can also use a custom value for hashing, instead of the ID of the document. By specifying the routing query parameter on the URL, that value will be hashed and used instead of the ID:

curl -XPOST 'localhost:9200/get-together/group/9?routing=denver' -d'{

"title": "Denver Knitting"

}'

In this example, denver is the value that’s hashed to determine which shard the document ends up in, instead of 9, the document’s ID. Routing can be useful for scaling strategies, which is why we talk about it in detail in this chapter.

9.8.1. Why use routing?

If you don’t use routing at all, Elasticsearch will ensure that your documents are distributed in an even manner across all the different shards, so why would you want to use routing? Custom routing allows you to collect multiple documents sharing a routing value into a single shard, and once these documents are in the same index, it allows you to route certain queries so that they are executed on a subset of the shards for an index. Sound confusing? We’ll go over it in more detail to clarify what we mean.

9.8.2. Routing strategies

Routing is a strategy that takes effort in two areas: you’ll need to pick good routing values while you’re indexing documents, and you’ll need to reuse those values when you perform queries. With our gettogether example, you first need to decide on a good way to separate each document. In this case, pick the city that a get-together group or event happens to use as the routing value. This is a good choice for a routing value because the cities vary widely enough that you have quite a few values to pick from, and each event and group are already associated with a city, so it’s easy to extract that from a document before indexing. If you were to pick something that had only a few different values, you could easily end up with unbalanced shards for the index. If there are only three possible routing values for all documents, all documents will end up routed between a maximum of three shards. It’s important to pick a value that will have enough cardinality to spread data among shards in an index.

Now that you’ve picked what you want to use for the routing value, you need to specify this routing value when indexing documents, as shown in the listing that follows.

Listing 9.10. Indexing documents with custom routing values

In this example, you use three different routing values—denver, boulder, and amsterdam—for three different documents. This means that instead of hashing the IDs 10, 11, and 12 to determine which shard to put the document in, you use the routing values instead. On the index side, this doesn’t help you much; the real benefit comes when you combine routing on the query side, as the next listing shows. On the query side, you can combine multiple routing values with a comma.

Listing 9.11. Specifying routing when querying

Interesting! Instead of returning all three groups, only two were returned. So what actually happened? Internally, when Elasticsearch received the request, it hashed the values of the two provided routing values, denver and amsterdam, and then executed the query on all the shards they hashed to. In this case denver and amsterdam both hash to the same shard, and boulder hashes to a different shard.

Extrapolate this to hundreds of thousands of groups, in hundreds of cities, by specifying the routing for each group both while indexing and while querying, and you’re able to limit the scope of where a search request is executed. This can be a great scaling improvement for an index that might have 100 shards; instead of running the query on all 100 shards, it can be limited and thus run faster with less impact to your Elasticsearch cluster.

In the previous example, denver and amsterdam happen to route to the same shard value, but they could have just as easily hashed to different shard values. How can you tell which shard a request will be executed on? Thankfully, Elasticsearch has an API that can show you the nodes and shards a search request will be performed on.

9.8.3. Using the _search_shards API to determine where a search is performed

Let’s take the prior example and use the search shards API to see which shards the request is going to be executed on, with and without the routing values, as shown in the following listing.

Listing 9.12. Using the _search_shards API with and without routing

You can see that even though there are two shards in the index, when the routing value denver is specified, only shard 1 is going to be searched. You’ve effectively cut the amount of data the search must execute on by half!

Routing can be useful when dealing with indices that have a large number of shards, but it’s definitely not required for regular usage of Elasticsearch. Think of it as a way to scale more efficiently in some cases, and be sure to experiment with it.

9.8.4. Configuring routing

It can also be useful to tell Elasticsearch that you want to use custom routing for all documents and to refuse to allow you to index a document without a custom routing value. You can configure this through the mapping of a type. For example, to create an index called routed-events and required routing for each event, you can use the code in the following listing.

Listing 9.13. Defining routing as required in a type’s mapping

There’s one more way to use routing, and that’s by associated a routing value with an alias.

9.8.5. Combining routing with aliases

As you saw in the previous section, aliases are a powerful and flexible abstraction on top of indices. They can also be used with routing to automatically apply routing values when querying or when indexing, assuming the alias points to a single index. If you try to index into an alias that points to more than a single index, Elasticsearch will return an error because it doesn’t know which concrete index the document should be indexed into.

Reusing the previous example, you can create an alias called denver-events that automatically filters out events with “denver” in the name and adds “denver” to the routing when searching and indexing to limit where queries are executed, as shown in the next listing.

Listing 9.14. Combining routing with an alias

You can also use the alias you just created for indexing. When indexing with the denver-events alias, it’s the same as if documents were indexed with the routing=denver query string parameter. Because aliases are lightweight, you can create as many as you need when using custom routing in order to scale out better.