10.1. Grouping requests

The single best thing you can do for faster indexing is to send multiple documents to be indexed at once via the bulk API. This will save network round-trips and allow for more indexing throughput. A single bulk can accept any indexing operation; for example, you can create documents or overwrite them. You can also add update or delete operations to a bulk; it’s not only for indexing.

If your application needs to send multiple get or search operations at once, there are bulk equivalents for them, too: the multiget and multisearch APIs. We’ll explore them later, but we’ll start with the bulk API because in production it’s “the way” to index for most use cases.

10.1.1. Bulk indexing, updating, and deleting

So far in this book you’ve indexed documents one at a time. This is fine for playing around, but it implies performance penalties from at least two directions:

Your application has to wait for a reply from Elasticsearch before it can move on.

Elasticsearch has to process all data from the request for every indexed document.

If you need more indexing speed, Elasticsearch offers a bulk API, which you can use to index multiple documents at once, as shown in figure 10.1.

Figure 10.1. Bulk indexing allows you to send multiple documents in the same request.

As the figure illustrates, you can do that using HTTP, as you’ve used for indexing documents so far, and you’ll get a reply containing the results of all the indexing requests.

Indexing in bulks

In listing 10.1 you’ll index a bulk of two documents. To do that, you have to do an HTTP POST to the _bulk endpoint, with data in a specific format. The format has the following requirements:

Each indexing request is composed of two JSON documents separated by a newline: one with the operation (index in your case) and metadata (like index, type, and ID) and one with the document contents.

JSON documents should be one per line. This implies that each line needs to end with a newline (\n, or the ASCII 10 character), including the last line of the whole bulk of requests.

Listing 10.1. Indexing two documents in a single bulk

For each of the two indexing requests, in the first line you add the operation type and some metadata. The main field name is the operation type: it indicates what Elasticsearch has to do with the data that follows. For now, you’ve used index for indexing, and this operation will overwrite documents with the same ID if they already exist. You can change that to create, to make sure documents don’t get overwritten, or even update or delete multiple documents at once, as you’ll see later.

_index and _type indicate where to index each document. You can put the index name or both the index and the type in the URL. This will make them the default index and type for every operation in the bulk. For example: curl -XPOST localhost:9200/get-together/_bulk --data-binary @$REQUESTS_FILE or

curl -XPOST localhost:9200/get-together/group/_bulk --data-binary @$REQUESTS_FILE

You can then omit the _index and _type fields from the request itself. If you specify them, index and type values from the request override those from the URL.

The _id field indicates the ID of the document you’re indexing. If you omit that, Elasticsearch will automatically generate an ID for you, which is helpful if you don’t already have a unique ID for your documents. Logs, for example, work well with generated IDs because they don’t typically have a natural unique ID and you don’t need to retrieve logs by ID.

If you don’t need to provide IDs and you index all documents in the same index and type, the bulk request from listing 10.1 gets quite a lot simpler, as shown in the following listing.

Listing 10.2. Indexing two documents in the same index and type with automatic IDs

The result of your bulk insert should be a JSON containing the time it took to index your bulk and the responses for each operation. There’s also an errors flag, which indicates whether any of the operations failed. The whole response should look something like this:

{

"took" : 2,

"errors" : false,

"items" : [ {

"create" : {

"_index" : "get-together",

"_type" : "group",

"_id" : "AUyDuQED0pziDTnH-426",

"_version" : 1,

"status" : 201

}

}, {

"create" : {

"_index" : "get-together",

"_type" : "group",

"_id" : "AUyDuQED0pziDTnH-426",

"_version" : 1,

"status" : 201

}

} ]

}

Note that because you’ve used automatic ID generation, the index operations were changed to create. If one document can’t be indexed for some reason, it doesn’t mean the whole bulk has failed, because items from the same bulk are independent of each other. That’s why you get a reply for each operation, instead of one for the whole bulk. You can use the response JSON in your application to determine which operation succeeded and which failed.

Tip

When it comes to performance, bulk size matters. If your bulks are too big, they take too much memory. If they’re too small, there’s too much network overhead. The sweet spot depends on document size—you’d put a few big documents or more smaller ones in a bulk—and on the cluster’s firepower. A big cluster with strong machines can process bigger bulks faster and still serve searches with decent performance. In the end, you have to test and find the sweet spot for your use case. You can start with values like 1,000 small documents (such as logs) per bulk and increase until you don’t get a significant gain. Be sure to monitor your cluster in the meantime, as we’ll discuss in chapter 11.

Updating or deleting in bulks

Within a single bulk, you can have any number of index or create operations and also any number of update or delete operations.

update operations look similar to the index/create operations we just discussed, except for the fact that you must specify the ID. Also, the document content would contain doc or script according to the way you want to update, just as you specified doc or script in chapter 3 when you did individual updates.

delete operations are a bit different than the rest because you have no document content. You just have the metadata line, like with updates, which has to contain the document’s ID.

In the next listing you have a bulk that contains all four operations: index, create, update, and delete.

Listing 10.3. Bulk with index, create, update, and delete

If the bulk APIs can be used to group multiple index, update, and delete operations together, you can do the same for search and get requests with the multisearch and multiget APIs, respectively. We’ll look at these next.

10.1.2. Multisearch and multiget APIs

The benefit of using multisearch and multiget is the same as with bulks: when you have to do multiple search or get requests, grouping them together saves time otherwise spent on network latency.

Multisearch

One use case for sending multiple search requests at once occurs when you’re searching in different types of documents. For example, let’s assume you have a search box in your get-together site. You don’t know whether a search is for groups or for events, so you’re going to search for both and offer different tabs in the UI: one for groups and one for events. Those two searches would have completely different scoring criteria, so you’d run them in different requests, or you could group these requests together in a multisearch request.

The multisearch API has many similarities with the bulk API:

You hit the _msearch endpoint, and you may or may not specify an index and a type in the URL. Each request has two single-line JSON strings: the first may contain parameters like index, type, routing value, or search type—that you’d normally put in the URI of a single request. The second line contains the query body, which is normally the payload of a single request.

The listing that follows shows an example multisearch request for events and groups about Elasticsearch.

Listing 10.4. Multisearch request for events and groups about Elasticsearch

Multiget

Multiget makes sense when some processing external to Elasticsearch requires you to fetch a set of documents without doing any search. For example, if you’re storing system metrics and the ID is a timestamp, you might need to retrieve specific metrics from specific times without doing any filtering. To do that, you’d call the _mget endpoint and send a docs array with the index, type, and ID of the documents you want to retrieve, as in the next listing.

Listing 10.5. _mget endpoint and docs array with index, type, and ID of documents

As with most other APIs, the index and type are optional, because you can also put them in the URL of the request. When the index and type are common for all IDs, it’s recommended to put them in the URL and put the IDs in an ids array, making the request from listing 10.5 much shorter:

% curl localhost:9200/get-together/group/_mget?pretty -d '{

"ids" : [ "1", "2" ]

}'

Grouping multiple operations in the same requests with the multiget API might introduce a little complexity to your application, but it will make such requests faster without significant costs. The same applies to the multisearch and bulk APIs, and to make the best use of them, you can experiment with different request sizes and see which size works best for your documents and your hardware.

Next, we’ll look at how Elasticsearch processes documents in bulks internally, in the form of Lucene segments, and how you can tune these processes to speed up indexing and searching.