2.6. Adding nodes to the cluster

In chapter 1, you unpacked the tar.gz or ZIP archive and started up your first Elasticsearch instance. This created your one-node cluster. Before you add a second node, you’ll check the cluster’s status to visualize how data is currently allocated. You can do that with a graphical tool such as Elasticsearch kopf or Elasticsearch Head, which we mentioned previously (see section 2.3.1) when you indexed a document. Figure 2.12 shows the cluster in kopf.

Figure 2.12. One-node cluster shown in Elasticsearch kopf

If you don’t have either of these plugins installed, you can always get most of this information from the Cat Shards API:

% curl 'localhost:9200/_cat/shards?v'

index shard prirep state docs store ip node get-together 0 p STARTED 12 15.1kb 192.168.1.4 Hammond, Jim get-together 0 r UNASSIGNED

get-together 1 p STARTED 8 11.4kb 192.168.1.4 Hammond, Jim get-together 1 r UNASSIGNED

Tip

Most Elasticsearch APIs return JSON, but Cat APIs are an exception to this rule, and the Cat Shards API is one of them. There are many more and they’re useful to get information about what the cluster is doing at a point in time in a format that’s easy to parse by both humans and shell scripts. We’ll talk about Cat APIs more in chapter 11, which is focused on administration.

Either way, you should see the following information:

Cluster name, as you defined it previously in elasticsearch.yml.

There’s only one node.

The get-together index has two primary shards, which are active. The unassigned shards represent a set of replicas that were configured for this index. Because there’s only one node, those replicas remain unallocated.

The unallocated replica shards cause the status to be yellow. This means all the primaries are there, but not all the replicas. If primaries were missing, the cluster would be red to signal at least one index being incomplete. If all replicas would be allocated, the cluster would be green to signal that everything works as expected.

2.6.1. Starting a second node

From a different terminal, run bin/elasticsearch or elasticsearch.bat. This starts another Elasticsearch instance on the same machine. You’d normally start new nodes on different machines to take advantage of additional processing power, but for now you’ll run everything locally.

In the terminal or log file of the new node, you should see a line that begins

[INFO ][cluster.service ] [Raman] detected_master [Hammond, Jim]

where Hammond, Jim is the name of the first node. What happened was that the second node detected the first one via multicast and joined the cluster. The first node is also the master of the cluster, which means it’s responsible for keeping information such as which nodes are in the cluster and where shards are located. This information is called cluster state and it’s replicated to other nodes. If the master goes down, another node can be elected to take its place.

If you look at your cluster’s status in figure 2.13, you can see that the set of replicas was allocated to the new node, making the cluster green.

Figure 2.13. Replica shards are allocated to the second node.

If these two nodes were on separate machines, you’d have a fault-tolerant cluster, which would handle more concurrent searches than before. But what if you need more indexing performance, or need to handle even more concurrent searches? More nodes will certainly help.

Note

You may have already noticed that the first node starting on a machine listens on port 9200 of all interfaces for HTTP requests. As you add more nodes, it uses port 9201, 9202, and so on. For node-tonode communication, Elasticsearch uses ports 9300, 9301, and so on. These are ports you might need to allow in the firewall. You can change listening addresses in the Network and HTTP section of elasticsearch.yml.

2.6.2. Adding additional nodes

If you run bin/elasticsearch or elasticsearch.bat again to add a third node and then a fourth, you’ll see that they detect the master via multicast and join the cluster in the same way. Additionally, as shown in figure 2.14, the four shards of the get-together index automatically get balanced across the cluster.

Figure 2.14. Elasticsearch automatically distributes shards across the growing cluster.

At this point you might wonder what happens if you add more nodes. By default, nothing happens because you have four total shards that can’t be distributed to more than four nodes. That said, if you need to scale, you have a few options:

Change the number of replicas. Replicas can be updated on the fly, but scaling this way increases only the number of concurrent searches your cluster can serve because searches are sent to replicas of the same shard in a round-robin fashion. Indexing performance will remain the same, because new data has to be processed by all shards. Also, isolated searches will only run on a single set of shards, so adding replicas won’t help.

Create an index with more shards. This implies re-indexing your data because the number of primary shards can’t be changed on the fly.

Add more indices. Some data can be easily designed to use more indices. For example, if you index logs, you can put each day’s logs in its own index.

We discuss these patterns for scaling out in chapter 9. For now, you can shut down the three extra nodes to keep things simple. You can shut down one node at a time and watch shards get automatically balanced as you go back to the initial state. If you shut them down all at once, the first node will remain with one shard, having no time to get the rest of the data. In that case you can run populate.sh again, which will reindex all the sample data.