E.1. Percolator basics

There are three steps needed for percolation:

Make sure there’s a mapping in place for all the fields referenced by the registered queries.
Register the queries themselves.
Percolate documents.

Figure E.2 shows these steps.

Figure E.2. You need a mapping and some registered queries in order to percolate documents.

We’ll take a closer look at these three steps next, and then we’ll move on to how the percolator works and what its limitations are.

E.1.1. Define a mapping, register queries, then percolate documents

Assume you want to send alerts for any new events about the Elasticsearch percolator. Before registering queries, you need a mapping for all the fields you run queries on. In the case of our get-together example, you might already have mappings for groups and events if you ran populate.sh from the code samples. If you didn’t do that already, you can download the code samples from https://github.com/dakrone/elasticsearch-in-action so you can run populate.sh.

With the data from the code samples in place, you can register a query looking for Elasticsearch Percolator in the title field. You already have the mapping for title in place because you ran populate.sh:

% curl -XPUT 'localhost:9200/get-together/.percolator/1' -d '{

"query": {

"match": {

"title": "elasticsearch percolator"

}

Note that the body of your request is the match query, but to register it, you have to send it through a PUT request as you would while adding a document. To let Elasticsearch know this isn’t your average document but a percolator query, you have to specify the .percolator type.

Note

As you might expect, you can add as many queries as you want, at any point in time. The percolator is real time, so a new query will account for percolation right after it’s added.

With your mapping and queries in place, you can start percolating documents. To do that, you’ll hit the _percolate endpoint of the type where the document would go and put the contents of the document under the doc field:

% curl 'localhost:9200/get-together/event/_percolate?pretty' -d '{

"doc": {

"title": "Discussion on Elasticsearch Percolator"

}

You’ll get back a list of matching queries, identified by the index name and ID:

"total" : 1,

"matches" : [ {

"_index" : "get-together",

"_id" : "1"

} ]

Tip

If you have lots of queries registered in the same index, you might want only the IDs to shorten the reply. To do that, add the percolate_format=ids parameter to the request URI.

Next, let’s look at how the percolator works and what kind of limitations you can expect.

E.1.2. Percolator under the hood

In the percolation you just did, Elasticsearch loaded the registered query and ran it against a tiny, inmemory index containing the document you percolated. If you had registered more queries, all of them would have been run on that tiny index.

Registering queries

It’s convenient that in Elasticsearch, queries are normally expressed in JSON, just as documents are;

when you register a query, it’s stored in the .percolator type of the index you point it to. This is good for durability because those queries would be stored like any other documents. In addition to storing the query, Elasticsearch loads it in memory so it can be executed quickly.

Warning

Because registered queries are parsed and kept in memory, you need to make sure you have enough heap on each node to hold those queries. As we’ll see in section E.2.2 of this appendix, one way to deal with large amounts of queries would be to use a separate index (or more indices) for percolation. This way you can scale out with percolation independent of the actual data.

Unregistering queries

To unregister a query, you have to delete it from the index using the .percolator type and the ID of the query:

% curl -XDELETE 'localhost:9200/get-together/.percolator/1'

Because queries are also loaded in memory, deleting a query doesn’t always unregister the query. A delete-by-ID does remove the percolation query from the memory, but as of version 1.4, a delete-by-query request doesn’t unregister matching queries from memory. For that to happen, you’d need to reopen the index; for example:

% curl -XDELETE 'localhost:9200/get-together/.percolator/_query?q=:'

right now, any deleted queries are still in memory

% curl -XPOST 'localhost:9200/get-together/_close'

% curl -XPOST 'localhost:9200/get-together/_open'

now they're unregistered from memory, too

Percolating documents

When you percolate a document, that document is first indexed in an in-memory index; then all registered queries are run against that index to see which ones match.

Because you can only percolate one Elasticsearch document at a time, as of version 1.4 the parent-child queries you saw in chapter 8 don’t work with percolator because they imply multiple documents. Plus, you can always add new children to the same parent, so it’s difficult to keep all relevant data in the inmemory index.

By contrast, nested queries work because nested documents are always indexed together in the same Elasticsearch document. You can see such an example in the following listing, where you’ll percolate events with attendee names as nested documents.

Listing E.1. Using percolator with nested attendee names

As the number of queries grows, percolating a single document requires more CPU. That’s why it’s important to register cheap queries wherever possible; for example, by using ngrams instead of wildcards or regular expressions. You can look back at chapter 10 for performance tips, and section 10.4.1 describes the tradeoff between ngrams and wildcards.

Percolation performance may be a concern for you, and in the next section we’ll show you percolatorspecific tips depending on your use case.