Chapter 2. Diving into the functionality

This chapter covers

Defining documents, types, and indices

Understanding Elasticsearch nodes and primary and replica shards

Indexing documents with cURL and a data set

Searching and retrieving data

Setting Elasticsearch configuration options

Working with multiple nodes

Now you know what kind of search engine Elasticsearch is, and you’ve seen some of its main features in chapter 1. Let’s switch to the practical side and see how it does what it’s good at. Imagine you’re tasked with creating a way to search through millions of documents, like a website that allows people to build common interest groups and get together. In this case, documents could be the get-together groups, individual events. You need to implement this in a fault-tolerant way, and you need your setup to be able to accommodate more data and more concurrent searches, as your get-together site becomes more successful.

In this chapter, we’ll show you how to deal with such a scenario by explaining how Elasticsearch data is organized. Then you’ll get practical and start indexing and searching some real data for a get-together website using the code samples provided for this chapter. We’ll use this get-together example and the code samples throughout the book to allow you to do some “real” searches and indexing.

All operations will be done using cURL, a nice little command-line tool for HTTP requests. Later you can translate what cURL does into your preferred programming language if you need to. Toward the end of the chapter, you’ll make some configuration changes and start new instances of Elasticsearch, so you can experiment with a cluster of multiple nodes.

We’ll get started with data organization. To understand how data is organized in Elasticsearch, we’ll look at it from two angles:

Logical layout— What your search application needs to be aware of. The unit you’ll use for indexing and searching is a document, and you can think of it like a row in a relational database. Documents are grouped into types, which contain documents in a way similar to how tables contain rows. Finally, one or multiple types live in an index, the biggest container, similar to a database in the SQL world.

Physical layout— How Elasticsearch handles your data in the background. Elasticsearch divides each index into shards, which can migrate between servers that make up a cluster. Typically, applications don’t care about this because they work with Elasticsearch in much the same way, whether it’s one or more servers. But when you’re administering the cluster, you care because the way you configure the physical layout determines its performance, scalability, and availability.

Figure 2.1 illustrates the two perspectives.

Figure 2.1. An Elasticsearch cluster from the application’s and administrator’s points of view

Let’s start with the logical layout—or what the application sees.