E.2. Performance tips

For different percolator use cases, there are different things you can do to improve performance. In this section, we’ll look at the most important techniques and divide them into two categories:

Optimizations to the format of the request or the reply— You can percolate existing documents, percolate multiple documents in one request, and ask for only the number of matching queries, instead of the whole list of IDs.

Optimizations to the way you organize queries— As we mentioned earlier, you can use one or more separate indices to store registered queries. Here, you’ll apply this advice, and we’ll also look at how you can use routing and filtering to reduce the number of queries being run for each percolation.

E.2.1. Options for requests and replies

In some use cases, you can get away with fewer requests or less data going through the network. Here, we’ll look at three ways to achieve this:

Percolating existing documents

Using multi percolate, which is the bulk API of percolation

Counting the number of matching queries instead of getting the full list

Percolating existing documents

This works well if what you percolate is what you index, especially if documents are big. For example, if you index blogs, it might be slow to send every post twice over HTTP: once for indexing and once for alerting subscribers of posts matching their interests. In such cases, indexing a document and then percolating it by ID, instead of submitting it again, makes sense.

Note

Percolating existing documents doesn’t work well for all use cases. For example, if social media posts have a geo point field, you can register geo queries matching each country’s area. This way, you can percolate each post to determine its country of origin and add this information to the post before indexing it. In such use cases, you need to percolate and then index; it doesn’t make sense to do it the other way around. The use case to determine the country of origin is described in the following blog post by Elastic: www.elastic.co/blog/using-percolator-geo-tagging/.

In the next listing, you’ll register a query for groups matching elasticsearch. Then you’ll percolate the group with ID 2 (Elasticsearch Denver), which is already indexed, instead of sending its content all over again.

Listing E.2. Percolating an existing group document

Multi percolate API

Whether you percolate existing documents or not, you can do multiple percolations at once. This works well if you also index in bulks. For example, you can use the percolator for some automated tagging of blog posts by having one query for each tag. When a batch of posts arrives, you can do as shown in figure E.3:

Figure E.3. Percolator for automated tagging. The multi percolate and bulk APIs reduce the number of requests. Before step 1, the percolation queries have been indexed. In step 1 you use the multi percolate API to find matching percolation queries. The application maps the IDs to the tags and adds them to the documents to index. In step 2 you use the bulk index API to index the documents.

  1. Percolate them all at once through the multi percolate API. Then, in your application, appendmatching tags. Be aware that the percolate API returns only the IDs of the matching queries. Your application has to map the IDs of the percolation queries to the tags, so it has to map 1 to elasticsearch, 2 to release, and 3 to book. Another approach would be to give the percolation queries the ID equal to the tag.
  2. Finally, index all posts at once through the bulk API we introduced in chapter 10.

Be aware that sending the document twice, once for percolation and once for indexing, does imply more network traffic. The advantage would be that you wouldn’t have to re-index the document if you added the tag using an update. That would be the alternative if you indexed the document first, did the percolation by ID, and used the multi update API to update the indexed documents.

In the following listing you’ll apply what’s described in figure E.3.

Listing E.3. Using the multi percolate and bulk APIs for automated tagging

Note how similar the multi percolate API is to the bulk API:

Every request takes two lines in the request body.

The first line shows the operation (percolate) and identification information (index, type, and for existing documents, the ID). Note that the bulk API uses underscores like _index and _type, but multi percolate doesn’t (index and type).

The second line contains metadata. You’d put the document in there under the doc field. When you’re percolating existing documents, the metadata JSON would be empty.

Finally, the body of the request is sent to the _mpercolate endpoint. As with the bulk API, this endpoint can contain the index and the type name, which can later be omitted from the body.

Getting only the number of matching queries

Besides the percolate action, the multi percolate API supports a count action, which will return the same reply as before with the total number of matching queries for each document, but you won’t get the matches array:

echo '{"count" : {"index" : "blog", "type" : "posts"}}

{"doc": {"title": "New Elasticsearch Release"}}

{"count" : {"index" : "blog", "type" : "posts"}}

{"doc": {"title": "New Elasticsearch Book"}}

' > percolate_requests curl 'localhost:9200/_mpercolate?pretty' --data-binary @percolate_requests

Using count doesn’t make sense for the tagging use case, because you need to know which queries match, but this might not be the case everywhere. Let’s say you have an online shop and you want to add a new item. If you collect user queries and register them for percolation, you can percolate new items against those queries to predict how many users will find them while searching.

In the get-together site example, you could get an idea of how many attendees to expect for an event before submitting it—assuming you can get each user’s availability and register time ranges as queries.

You can, of course, get counts for individual percolations, not just multi percolations. Add /count to the _percolate endpoint:

% curl 'localhost:9200/get-together/event/_percolate/count?pretty' -d '{

"doc": {

"title": "Discussion on Elasticsearch Percolator"

}

}'

Counting will help with performance the more queries match because Elasticsearch won’t have to load all their IDs in memory and send them over the network. But if you have many queries to begin with, you might want to look into keeping them in separate indices and make sure you run only the relevant ones. We’ll look at how you can do that next.

E.2.2. Separating and filtering percolator queries

If you’re registering lots of queries and/or percolating lots of documents, you’re probably looking for scaling and performance tips. Here we’ll discuss the most important ones:

Keep percolations in a separate index. This lets you scale them separately from the rest of your data, especially if you store these indices in a separate Elasticsearch cluster.

Reduce the number of queries run for each percolation. Strategies include routing and filtering.

Using separate indices for percolator

When you register queries in a separate index, the thing to keep in mind is to define a mapping for all the fields you want to query. In the get-together example, if you want percolator queries to run on the title field, you need to define it in the mapping. You can do this while creating the index, at which time you can also specify other index-specific settings, such as the number of shards:

% curl -XPUT 'localhost:9200/attendance-percolate' -d '{

"settings": { "number_of_shards": 4

},

"mappings": {

"event": {

"properties": {

"title": {

"type": "string"

}

}

}

}

}'

Your new attendance-percolate index has four shards, compared to the existing get-together index with two. This means you can potentially run a single percolation on up to four nodes. Such an index can also be stored in a separate Elasticsearch cluster so that percolations don’t take CPU away from the queries you’d run on the get-together index.

Once your separate index is set up with the mapping, you’d register queries and run percolations in the same way you did in section E.1.1:

% curl -XPUT 'localhost:9200/attendance-percolate/.percolator/1' -d '{

"query": {

"match": { "title": "elasticsearch percolator"

}

}

}'

% curl 'localhost:9200/attendance-percolate/event/_percolate?pretty' -d '{

"doc": {

"title": "Discussion on Elasticsearch Percolator"

}

}'

Most of the scaling strategies you saw in chapter 9 apply to percolator as well. You can use multiple indices—for example, one per customer—to make sure you run only the queries that are relevant for each percolation. You can also use aliases to limit the customer in your query; that way you overcome the toomany-indices problem if each customer gets their own index.

Using percolator with routing

Percolator also supports routing, another scaling strategy discussed in chapter 9. Routing works well when you have many nodes as well as many users running many percolations. Routing lets you keep each user’s queries in a single shard, avoiding the excessive chatter between nodes shown in figure E.4.

The main downside of routing is that shards might become imbalanced because queries won’t be distributed randomly as they are by default. If you have some users with more queries than others, their shards might become bigger and thus more difficult to scale. See chapter 9 for more information.

To use routing, you’d register queries with a routing value:

% curl -XPUT 'localhost:9200/\

attendance-percolate/.percolator/1?routing=radu' -d '{

"query": {

"match": {

"title": "Elasticsearch Aggregations"

}

}

}'

Then you’d percolate with routing by specifying the same value:

% curl 'localhost:9200/\

attendance-percolate/event/_percolate?routing=radu&pretty' -d '{

"doc": {

"title": "Introduction to Aggregations"

}

}'

Or you’d percolate against all registered queries by omitting the routing value. Beware that you’ll lose the advantage of sending the queries to appropriate shards only:

% curl 'localhost:9200/attendance-percolate/event/_percolate?pretty' -d '{

"doc": {

"title": "Introduction to Aggregations"

}

}'

Filtering registered queries

Percolator performance depends directly on the number of queries being run, and filtering can help keeping this number at bay.

Typically, you’d add some metadata next to the query and filter on it. The names for these fields can be chosen freely. Because these fields are metadata fields and not part of the documents to match, these fields aren’t added to the mapping. For example, you can tag queries for events:

% curl -XPUT 'localhost:9200/\

attendance-percolate/.percolator/1' -d '{

"query": {

"match": {

"title": "introduction to aggregations"

}

},

"tags": ["elasticsearch"]

}

Then, when percolating documents, add a filter for that tag to make sure only the relevant queries are being run:

% curl 'localhost:9200/attendance-percolate/event/_percolate?pretty' -d '{

"doc": { "title": "nesting aggregations"

},

"filter": {

"term": { "tags": "elasticsearch"

}

}

}'

Alternatively, you can filter on the query itself. This requires a mapping change because the query object isn’t indexed by default, as if the mapping for the .percolator type looked like this:

".percolator": {

"properties": {

"query": {

"type": "object",

"enabled": false

}

}

}

Tip

You can find more information about objects and their options in chapter 7, section 7.1.

In the next listing, you’ll change the mapping to enable the query object and then use the filter on the query string itself.

Listing E.4. Filtering queries by their content

There are advantages and disadvantages to both methods. Metadata filtering works well when you have clear categories to filter on. On the other hand, filtering on the query text might work like a heuristic mechanism when metadata isn’t available or reliable.

You may be wondering why you wrapped a query in a filter in listing F.4. It’s because you didn’t need the score when filtering registered queries for this use case. As you saw in chapter 4, filters are faster because they don’t calculate scores and are cacheable. But there are use cases where the score—or other features, such as highlighting or aggregations—turns out to be useful during percolation. We’ll discuss such use cases next.