2.3. Indexing new data

Although chapter 3 gets into the details of indexing, here the goal is to give you a feel for what indexing is about. In this section we’ll discuss the following processes:

Using cURL, you’ll use the REST API to send a JSON document to be indexed with Elasticsearch.

You’ll also look at the JSON reply that comes back.

You’ll see how Elasticsearch automatically creates the index and type to which your document belongs if they don’t exist already.

You’ll index additional documents from the source code for the book so you have a data set ready to search through.

You’ll index your first document by hand, so let’s start by looking at how to issue an HTTP PUT request to a URI. A sample URI is shown in figure 2.10 with each part labeled. Figure 2.10. URI of a document in Elasticsearch

Let’s walk through how you issue the request.

2.3.1. Indexing a document with cURL

For most snippets in this book you’ll use the cURL binary. cURL is a command-line tool for transferring data over HTTP. You’ll use the curl command to make HTTP requests, as it has become a convention to use cURL for Elasticsearch code snippets. That’s because it’s easy to translate a cURL example into any programming language. In fact, if you ask for help on the official mailing list for Elasticsearch, it’s recommended that you provide a curl recreation of your problem. A curl recreation is a command or a sequence of curl commands that reproduces the problem you’re experiencing, and anyone who has Elasticsearch installed locally can run it.

Installing cURL

If you’re running a UNIX-like operating system, such as Linux or Mac OS X, you’re likely to have the curl command available. If you don’t have it already or if you’re on Windows, you can download it from http://curl.haxx.se. You can also install Cygwin and then select cURL as part of the Cygwin installation, which is the approach we recommend.

Using Cygwin to run curl commands on Windows is preferred because you can copy-paste the commands that work on UNIX-like systems. If you choose to stick with the Windows shell, take extra care because single quotes behave differently on Windows. In most situations, you must replace single quotes (') with double-quotes (") and escape double quotes with a backslash (\"). For example, a UNIX command like this

curl 'http://localhost' -d '{"field": "value"}' looks like this on Windows:

curl "http://localhost" -d "{\"field\": \"value\"}"

There are many ways to use curl to make HTTP requests; run man curl to see all of them. Throughout this book, we use the following curl usage conventions:

The method, which is typically GET, PUT, or POST, is the argument of the -X parameter. You can add a space between the parameter and its argument, but we don’t add one. For example, we use XPUT instead of -X PUT. The default method is GET, and when we use it, we skip the -X parameter altogether.

In the URI, we skip specifying the protocol; it’s always http, and curl uses http by default when no protocol is specified.

We put single quotes around the URI because it can contain multiple parameters and you have to separate the parameters with an ampersand (&), which normally sends the process to the background. The data that we send through HTTP is typically JSON, and we surround it with single quotes because the JSON itself contains double quotes. If single quotes are needed in the JSON itself, we first close the single quotes and then surround the needed single quote with double quotes, as shown in this example:

'{"name": "Scarlet O'"'"'Hara"}'

For consistency, most URLs will be surrounded by single quotes, too (except when using single quotes prevents escaping a character or including a variable, when double quotes will be used).

The URLs we use for HTTP requests sometimes contain parameters such as pretty=true or simply pretty. We use the latter, whether the request is done with curl or not. The pretty parameter in particular makes the JSON reply look more readable than the default, which is to return the reply all in one line.

Using Elasticsearch from your browser via Head, kopf, or Marvel

If you prefer graphical interfaces to the command line, several tools are available.

Elasticsearch Head You can install this tool as an Elasticsearch plugin, a standalone HTTP server, or a web page that you can open from your file system. You can send HTTP requests from there, but Head is most useful as a monitoring tool to show you how shards are distributed in your cluster. You can find Elasticsearch Head at https://github.com/mobz/elasticsearch-head.

Elasticsearch kopf Similar to Head in that it’s good for both monitoring and sending requests, this tool runs as a web page from your file system or as an Elasticsearch plugin. Both Head and kopf evolve quickly, so any comparison might become obsolete quickly as well. You can find Elasticsearch kopf at https://github.com/lmenezes/elasticsearch-kopf.

Marvel This tool is a monitoring solution for Elasticsearch. We talk more about monitoring in chapter 11, which is all about administering your cluster. Then we’ll describe monitoring tools like Marvel in appendix D. For now, the thing to remember is that Marvel also provides a graphical way to send requests to Elasticsearch called Sense, providing an autocomplete feature, which is a useful learning aid. You can download Marvel at www.elastic.co/downloads/marvel. Note that Marvel is a commercial product, though it’s free for development.

Assuming you can use the curl command and you have Elasticsearch installed with the defaults settings on your local machine, you can index your first get-together group document with the following command:

% curl -XPUT 'localhost:9200/get-together/group/1?pretty' -d '{ "name": "Elasticsearch Denver",

"organizer": "Lee"

}'

You should get the following output:

{

"_index" : "get-together",

"_type" : "group",

"_id" : "1",

"_version" : 1, "created" : true

}

The reply contains the index, type, and ID of the indexed document. In this case, you get the ones you specified, but it’s also possible to rely on Elasticsearch to generate IDs, as you’ll learn in chapter 3. You also get the version of the document, which begins at 1 and is incremented with each update. You’ll learn about updates in chapter 3.

Now that you have your first document indexed, let’s look at what happened with the index and the type containing this document.

2.3.2. Creating an index and mapping type

If you installed Elasticsearch and ran the curl command to index a document, you might be wondering why it worked given the following factors:

The index wasn’t there before. You didn’t issue any command to create an index named get-together.

The mapping wasn’t previously defined. You didn’t define any mapping type called group in which

to define the fields from your document.

The curl command works because Elasticsearch automatically adds the get-together index for you and also creates a new mapping for the type group. That mapping contains a definition of your field as strings. Elasticsearch handles all this for you by default, which enables you to start indexing without any prior configuration. You can change this default behavior if you need to, as you’ll see in chapter 3.

Creating an index manually

You can always create an index with a PUT request similar to the request used to index a document:

% curl -XPUT 'localhost:9200/new-index'

{"acknowledged":true}

Creating the index itself takes more time than creating a document, so you might want to have the index ready beforehand. Another reason to create indices in advance is if you want to specify different settings than the ones Elasticsearch defaults to—for example, you may want a specific number of shards. We’ll show you how to do these things in chapter 9—because you’d typically use many indices as a way of scaling out.

Getting the mapping

As we mentioned, the mapping is automatically created with your new document, and Elasticsearch automatically detects your name and organizer fields as strings. If you add a new document with yet another new field, Elasticsearch guesses its type, too, and appends the new field to the mapping.

To view the current mapping, issue an HTTP GET to the _mapping endpoint of the index. This would show you mappings for all types within that index, but you can get a specific mapping by specifying the type name under the _mapping endpoint:

% curl 'localhost:9200/get-together/_mapping/group?pretty'

{

"get-together" : {

"mappings" : {

"group" : {

"properties" : {

"name" : { "type" : "string"

},

"organizer" : { "type" : "string"

}

}

}

}

}

}

The response contains the following relevant data:

Index name—get-together

_Type name—_group

_Property list—_name and organizer

Property options— The type option is string for both properties

We talk more about indices, mappings, and mapping types in chapter 3. For now, let’s define a mapping and then index some documents by running a script from the code samples that come with this book.

2.3.3. Indexing documents from the code samples

Before we look at searching through the indexed documents, let’s do some more indexing by running populate.sh from the code samples. This will give you some more sample data in order to do searches later on.

Downloading the code samples

To download the source code, visit https://github.com/dakrone/elasticsearch-in-action, and then follow the instructions from there. The easiest way to get them is by cloning the repository: git clone https://github.com/dakrone/elasticsearch-in-action.git

If you’re on Windows, it’s best to install Cygwin first from https://cygwin.com. During the installation, add git and curl to the list of packages to be installed. Then you’ll be able to use git to download the code samples and bash to run them.

The script first deletes the get-together index you created. Then it recreates it and creates the mapping that’s defined in mapping.json. The mapping file specifies options other than those you’ve seen so far, and we explore them in the rest of the book, mostly in chapter 3. Finally, the script indexes documents in two types: group and event. There is a parent-child relationship between those types (events belonging to groups), which we explore in chapter 8. For now, ignore this relationship.

Running the populate.sh script should look similar to the following listing.

Listing 2.1. Indexing documents with populate.sh

% ./populate.sh WARNING, this script will delete the 'get-together' index and re-index all data!

Press Control-C to cancel this operation. Press [Enter] to continue.

After running the script, you’ll have a handful of groups that meet and the events planned for those groups. Let’s look at how you can search through those documents.