2.5. Configuring Elasticsearch

One of Elasticsearch’s strong points is that it has developer-friendly defaults, making it easy to get started. As you saw in the previous section, you can do indexing and searching on your own test server without making any configuration changes. Elasticsearch automatically creates an index for you and detects the type of new fields in your documents.

Elasticsearch also scales easily and efficiently, which is another important feature when you’re dealing with large amounts of data or requests. In the final section of this chapter, you’ll start a second

Elasticsearch instance, in addition to the one you already started in chapter 1, and let them form a cluster. This way, you’ll see how Elasticsearch scales out and distributes your data throughout the cluster.

Although scaling out can be done without any configuration changes, you’ll tweak a few knobs in this section to avoid surprises later when you add a second node. You’ll make the following changes in three different configuration files:

Specify a cluster name in elasticsearch.yml— This is the main configuration file where Elasticsearch-specific options go.

Edit logging options in logging.yml— The logging configuration file is for logging options of log4j, the library that Elasticsearch uses for logging.

Adjust memory settings in environment variables or elasticsearch.in.sh— This file is for configuring the Java virtual machine (JVM) that powers Elasticsearch.

There are a many other options, and we’ll point out a few as they appear, but those listed are the most commonly used. Let’s walk through each of these configuration changes.

2.5.1. Specifying a cluster name in elasticsearch.yml

The main configuration file of Elasticsearch can be found in the config/ directory of the unpacked tar.gz or ZIP archive.

Tip

The file is in /etc/elasticsearch/ if you installed it from the RPM or DEB package.

Like the REST API, the configuration can be in JSON or YAML. Unlike the REST API, the most popular format is YAML. It’s easier to read and use, and all the configuration samples in this book are based on elasticsearch.yml.

By default, new nodes discover existing clusters via multicast—by sending a ping to all hosts listening on a specific multicast address. If a cluster is discovered, the new node joins it if it has the same cluster name. You’ll customize the cluster name to prevent instances of the default configuration from joining your cluster. To change the cluster name, uncomment and change the cluster.name line your elasticsearch.yml: cluster.name: elasticsearch-in-action

After you update the file, stop Elasticsearch by pressing Control-C and then start it again with the following command: bin/elasticsearch

Warning

If you’ve indexed some data, you might notice that after restarting Elasticsearch with a new cluster name, there’s no more data. That’s because the directory in which data is stored contains the cluster name, so you can come back to your indexed data by changing back the cluster name and restarting again. For now, you can rerun populate.sh from the code samples to put the sample data back in.

2.5.2. Specifying verbose logging via logging.yml

When something goes wrong, application logs are the first place to look for clues. They’re also useful when you just want to see what’s going on. If you need to look in Elasticsearch’s logs, the default location is the logs/ directory under the path where you unpacked the zip/tar.gz archive.

Tip

If you installed it from the RPM or DEB package, the default path is /var/log/elasticsearch/.

Elasticsearch log entries are organized in three types of files:

Main log (cluster-name.log)— Here you can find general information about what happens when Elasticsearch is running; for example, whether a query failed or a new node joined the cluster.

Slow-search log (cluster-name_index_search_slowlog.log)— This is where Elasticsearch logs when a query runs too slow. By default, if a query takes more than half a second, it logs an entry here.

Index-slow log (cluster-name_index_indexing_slowlog.log)— This is similar to the slow-search log, but by default, it writes an entry if an indexing operation takes more than half a second.

To change logging options, you edit the logging.yml file, which is located in the same place as elasticsearch.yml. Elasticsearch uses log4j (http://logging.apache.org/log4j/), and the configuration options in logging.yml are specific to this logging utility.

As with other settings, the defaults are sensible, but if, for example, you need more verbose logging, a good first step is to change the rootLogger, which influences all the logging. We’ll leave the defaults for now, but if you wanted to make it log everything, you’d change the first line of logging.yml to this: rootLogger: TRACE, console, file

By default, the logging level is INFO, which writes all events with a severity level of INFO or above.

2.5.3. Adjusting JVM settings

As a Java application, Elasticsearch runs in a JVM, which, like a physical machine, has its own memory. The JVM comes with its own configuration, and the most important one is how much memory you allow it to use. Choosing the correct memory setting is important for Elasticsearch’s performance and stability.

Most of the memory used by Elasticsearch is called heap. The default setting lets Elasticsearch allocate 256 MB of your RAM for its heap, initially, and expand it up to 1 GB. If your searches or indexing operations need more than 1 GB of RAM, those operations will fail and you’ll see out-of-memory errors in your logs. Conversely, if you run Elasticsearch on an appliance that has only 256 MB of RAM, the default settings might allocate too much memory.

To change the default values, you can use ES_HEAP_SIZE environment variable. You can set it on the command line before starting Elasticsearch.

On UNIX-like systems, use the export command: export ES_HEAP_SIZE=500m; bin/elasticsearch On Windows, use the SET command:

SET ES_HEAP_SIZE=500m & bin\elasticsearch.bat

A more permanent way to set the heap size is by changing bin/elasticsearch.in.sh (and elasticsearch.bat on Windows). Add ES_HEAP_SIZE=500m at the beginning of the file, after #!/bin/sh.

Tip

If you installed Elasticsearch though the DEB package, change these variables in

/etc/default/elasticsearch. If you installed from the RPM package, the same settings can be configured in /etc/sysconfig/elasticsearch.

For the scope of this book, the default values should be adequate. If you run more extensive tests, you may need to allocate more memory. If you’re on a machine with less than 1 GB of RAM, lowering those values to something like 200m should also work.

How much memory to allocate in production

Start with half of your total RAM as ES_HEAP_SIZE if you run Elasticsearch only on that server. Try with less if other applications need significant memory. The other half is used by the operating system for caches, which make for faster access to your stored data. Beyond that rule of thumb, you’ll have to run some tests while monitoring your cluster to see how much memory Elasticsearch needs. We talk more about performance tuning and monitoring in part 2 of the book.

Now that you’ve gotten your hands dirty with Elasticsearch configuration options and you’ve indexed and searched through some data, you’ll get a taste of the “elastic” part of Elasticsearch: the way it scales. (We cover this topic in depth in chapter 9.) You could work through all chapters with a single node, but to get an overview of how scaling works, you’ll add more nodes to the same cluster.