1.2. Exploring typical Elasticsearch use cases

We’ve already established that storing and indexing your data in Elasticsearch is a good way to provide quick and relevant results to your searches. But in the end, Elasticsearch is just a search engine, and you’ll never use it on its own. Like any other data store, you need a way to feed data into it, and you probably need to provide an interface for the users searching that data.

To get an idea of how Elasticsearch might fit into a bigger system, let’s consider three typical scenarios:

Elasticsearch as the primary back end for your website— As we discussed, you may have a website that allows people to write blog posts, but you also want the ability to search through the posts. You can use Elasticsearch to store all the data related to these posts and serve queries as well. Adding Elasticsearch to an existing system— You may be reading this book because you already have a system that’s crunching data and you want to add search. We’ll look at a couple of overall designs on how that might be done.

Elasticsearch as the back end of a ready-made solution built around it— Because Elasticsearch is open-source and offers a straightforward HTTP interface, a big ecosystem supports it. For example, Elasticsearch is popular for centralizing logs; given the tools already available that can write to and read from Elasticsearch, other than configuring those tools to work the way you want, you don’t need to develop anything.

Let’s take a closer look at each of these scenarios.

1.2.1. Using Elasticsearch as the primary back end

Traditionally, search engines are deployed on top of well-established data stores to provide fast and relevant search capability. That’s because historically search engines haven’t offered durable storage or other features that are often needed, such as statistics.

Elasticsearch is one of those modern search engines that provide durable storage, statistics, and many other features you’ve come to expect from a data store. If you’re starting a new project, we recommend that you consider using Elasticsearch as the only data store to help keep your design as simple as possible. This might not work well for all use cases—for instance, when you have lots of updates—so you can also use Elasticsearch on top of another data store.

Note

Like other NoSQL data stores, Elasticsearch doesn’t support transactions. In chapter 3, you’ll see how you can use versioning to manage concurrency, but if you need transactions, consider using another database as the “source of truth.” Also, regular backups are a good practice when you’re using a single data store. We’ll discuss backups in chapter 11.

Let’s return to the blog example: you can store newly written blog posts in Elasticsearch. Similarly, you can use Elasticsearch to retrieve, search, or do statistics through all that data, as shown in figure 1.2.

Figure 1.2. Elasticsearch as the only back end storing and indexing all your data

What happens if a server goes down? You can get fault tolerance by replicating your data to different servers. Many other features make Elasticsearch a tempting NoSQL data store. It can’t be great for everything, but you should weigh whether including another data store in your overall design is worth the extra complexity.

1.2.2. Adding Elasticsearch to an existing system

By itself, Elasticsearch may not always provide all the functionality you need from a data store. Some situations may require you to use Elasticsearch in addition to another data store.

For example, transaction support and complex relationships are features that Elastic search doesn’t currently support, at least in version 1. If you need those features, consider using Elasticsearch along with a different data store.

Or you may already have a complex system that works, but you want to add search. It might be risky to redesign the entire system for the sole purpose of using Elasticsearch alone (though you might want to do that over time). The safer approach is to add Elasticsearch to your system and make it work with your existing components.

Either way, if you have two data stores, you’ll have to find a way to keep them synchronized. Depending on what your primary data store is and how your data is laid out, you can deploy an Elasticsearch plugin to keep the two entities synchronized, as illustrated in figure 1.3. Figure 1.3. Elasticsearch in the same system with another data store

For example, suppose you have an online retail store with product information stored in an SQL database. You need fast and relevant searching, so you install Elasticsearch. To index the data, you need to deploy a synchronizing mechanism, which can be an Elasticsearch plugin or a custom service that you build. You’ll learn more about plugins in appendix B and about dealing with indexing and updating from your own application in chapter 3. This synchronizing mechanism could pull all the data corresponding to each product and index it in Elasticsearch, where each product is stored as a document.

When a user types in search criteria on the web page, the storefront web application queries Elasticsearch for that criteria. Elasticsearch returns a number of product documents that match the criteria, sorted in the way you prefer. Sorting can be based on a relevance score that indicates how many times the words people searched for appear in each product, or anything stored in the product document, such as how recently the product was added, the average rating, or even a combination of those.

Inserting or updating information can still be done on the “primary” SQL database, so you can use Elasticsearch solely for handling searches. It’s up to the synchronizing mechanism to keep Elasticsearch up to date with the latest changes.

When you need to integrate Elasticsearch with other components, you can check for existing tools that may already do what you need. As we’ll explore in the next section, there’s a strong community building tools for Elasticsearch, and sometimes you don’t have to build any custom component.

1.2.3. Using Elasticsearch with existing tools

In some use cases, you don’t have to write a single line of code to get a job done with Elasticsearch.

Many tools are available that work with Elasticsearch, so you don’t have to write yours from scratch.

For example, say you want to deploy a large-scale logging framework to store, search, and analyze a large number of events. As shown in figure 1.4, to process logs and output to Elasticsearch, you can use logging tools such as Rsyslog (www.rsyslog.com), Logstash[2] (www.elastic.co/products/logstash), or Apache Flume (http://flume.apache.org). To search and analyze those logs in a visual interface, you can use Kibana (www.elastic.co/products/kibana).[3]

2

Ryslog home page: www.rsyslog.com

3

Kibana home page: www.elastic.co/products/kibana

Figure 1.4. Elasticsearch in a system of logging tools that support Elasticsearch out of the box

The fact that Elasticsearch is open-source—under the Apache 2 license, to be precise—isn’t the only reason that so many tools support it. Even though Elasticsearch is written in Java, there’s more than a Java API that lets you work with it. It also exposes a REST API, which any application can access, no matter the programming language it was written in.

What’s more, the REST requests and replies are typically in JSON (JavaScript Object Notation) format. Typically, a REST request has its payload in JSON, and replies are also a JSON document.

JSON and YAML

JSON is a format for expressing data structures. A JSON object typically contains keys and values, where values can be strings, numbers, true/false, null, another object, or an array. For more details about the JSON format, visit http://json.org/.

JSON is easy for applications to parse and generate. YAML (YAML Ain’t Markup Language) is also supported for the same purpose. To activate YAML, add the format =yaml parameter to the HTTP request. For more details on YAML, visit http://yaml.org. Although JSON is typically used for HTTP communication, the configuration files are usually written in YAML. In this book we stick with the popular formats: JSON for LHTT communication and YAML for configuration.

Throughout this book, JSON field names are shown in blue and their values are in red to make the code easier to read.

A search request for log events with a value of first in the message field would look like this:

Sending data and running queries by sending JSON objects over HTTP makes it easy for you to extend anything—from a syslog daemon like Rsyslog to a connecting framework like Apache ManifoldCF (http://manifoldcf.apache.org)—to interact with Elasticsearch. If you’re building a new application from scratch or want to add search to an existing application, the REST API is one of the features that makes Elasticsearch appealing. In the next section we’ll look at other such features.

1.2.4. Main Elasticsearch features

Elasticsearch allows you to easily access Lucene’s functionality for indexing and searching your data. On the indexing side, you have lots of options for how to process the text in them and how to store that processed text. When searching, you have many queries and filters to choose from. Elasticsearch exposes this functionality through the REST API, allowing you to structure queries in JSON and adjust most of the configuration though the same API.

On top of what Lucene provides, Elasticsearch adds its own, higher-level functionality, from caching to real-time analytics. In chapter 7 you’ll learn how to do these analytics through aggregations, which can give you results like the most popular blog tags, the average popularity of a certain group of posts, and endless combinations such as the average popularity of posts for each tag.

Another level of abstraction is the way you can organize documents: multiple indices can be searched separately or together, and you can put different types of documents within each index.

Finally, Elasticsearch is, as the name suggests, elastic. It’s clustered by default—you call it a cluster even if you run it on a single server—and you can always add more servers to increase capacity or fault tolerance. Similarly, you can easily remove servers from the cluster to reduce costs if you have lower load.

We’ll discuss all these features in great detail in the rest of the book—scaling, in particular, is addressed in chapter 9—but before that, let’s have a closer look and see how these features are useful.

1.2.5. Extending Lucene functionality

In many use cases, users search based on multiple criteria. For example, you can search for multiple words in multiple fields; some criteria would be mandatory and some would be optional. One of the most appreciated features of Elasticsearch is its well-structured REST API: you can structure your queries in JSON to combine different types of queries in many ways. We’ll show you how in chapter 4, and you’ll also see how you can use filters to include or exclude results in a cheap and cacheable way. Your JSON search can include both queries and filters, as well as aggregations, which generate statistics from matching documents.

Through the same REST API you can read and change many settings (as you’ll see in chapter 11), as well as the way documents are indexed.

What about Apache Solr?

If you’ve already heard about Lucene, you’ve probably also heard about Solr, which is an open-source, distributed search engine based on Lucene. In fact, Lucene and Solr merged as a single Apache project in 2010, so you might wonder how Elasticsearch compares with Solr.

Both search engines provide similar functionality, and features evolve quickly with each new version. You can search the web for comparisons, but we recommend taking them with a grain of salt. Besides being tied to particular versions, which makes such comparisons obsolete in a matter of months, many of them are biased for various reasons.

That said, a few historical facts help explain the origins of the two products. Solr was created in 2004 and Elasticsearch in 2010. When Elasticsearch came around, its distributed model, which is discussed later in this chapter, made it much easier to scale out than any of its competitors, which suggests the “elastic” part of the name. In the meantime, however, Solr added sharding with version 4.0, which makes the “distributed” argument debatable, like many other aspects.

At the time of this writing, Elasticsearch and Solr each have features that the other one doesn’t, and choosing between them may come down to the specific functionality you need at a given point in time. For many use cases, the functionality you need is covered by both, and, as is often the case with competitors, choosing between them becomes a matter of taste. If you want to read more about Solr, we recommend Solr in Action by Trey Grainger and Timothy Potter (Manning, 2014).

When it comes to the way documents are indexed, one important aspect is analysis. Through analysis, the words from the text you’re indexing become terms in Elasticsearch. For example, if you index the text “bicycle race,” analysis may produce the terms “bicycle,” “race,” “cycling,” and “racing,” and when you search for any of those terms, the corresponding document is included in the results. The same analysis process applies when you search, as illustrated in figure 1.5. If you enter “bicycle race,” you probably don’t want to search for only the exact match. Maybe a document that contains both those words somewhere will do.

Figure 1.5. Analysis breaks text into words, both when you’re indexing and when you’re searching.

The default analyzer first breaks text into words by looking for common word separators, such as a space or a comma. Then it lowercases those words, so that “Bicycle Race” generates “bicycle” and “race.” There are many more analyzers, and you can also build your own. We’ll show you how in chapter 5.

At this point you might want to know more about what’s in that “indexed data” box shown in figure 1.5 because it sounds quite vague. As we’ll discuss next, data is organized in documents. By default,

Elasticsearch stores your documents as they are, and it also puts all the terms resulting from analysis into the inverted index to enable the all-important fast and relevant searches. We go into more detail about indexing and storing data in chapter 3. For now, let’s take a closer look at why Elasticsearch is documentoriented and how it groups documents in types and indices.

1.2.6. Structuring your data in Elasticsearch

Unlike a relational database, which stores data in records or rows, Elasticsearch stores data in documents. Yet, to some extent, the two concepts are similar. With rows in a table, you have columns, and for each column, each row has a value. With a document you have keys and values, in much the same way.

The difference is that a document is more flexible than a row, mainly because—in Elasticsearch, at least —a document can be hierarchical. For example, in the same way you associate a key with a string value, such as "author":"Joe", a document can have an array of strings, such as "tags":

["cycling", "bicycles"], or even key-value pairs, such as "author":

{"first_name":"Joe", "last_name":"Smith"}. This flexibility is important because it

encourages you to keep all the data that belongs to a logical entity in the same document, as opposed to keeping it in different rows in different tables. For example, the easiest (and probably fastest) way of storing blog articles is to keep all the data that belongs to a post in the same document. This way, searches are fast because you don’t need joins or any other relational work.

If you have an SQL background, you might miss the ability to use joins. Unfortunately, they’re not supported, at least in version 1.76 installed. Once that’s in place, you’re typically only a download away from getting Elasticsearch ready to start.

1.2.7. Installing Java

If you don’t have a Java Runtime Environment (JRE) already, you’ll have to install it first. Any JRE should work, as long as it’s version 1.7 or later. Typically, you install the one from Oracle (www.java.com/en/download/index.jsp) or the open-source implementation, OpenJDK (http://download.java.net/openjdk/).

Troubleshooting “no Java found” errors

With Elasticsearch, as with other Java applications, it might happen that you’ve downloaded and installed Java, but the application refuses to start, complaining that it can’t find Java.

Elasticsearch’s script looks for Java in two places: the JAVA_HOME environment variable and the system path. To check if it’s in JAVA_HOME, use the env command on UNIX-like systems and the set command on Windows. To check if it’s in the system path, run the following command: % java version.

If it works, then Java is in your path. If it doesn’t, either configure JAVA_HOME or add the Java binary to your path. The Java binary is typically found wherever you installed Java (which should be JAVA_HOME), in the bin directory.

1.2.8. Downloading and starting Elasticsearch

With Java set up, you need to get Elasticsearch and start it. Download the package that best fits your environment. The following package options are available from www.elastic.co/downloads/elasticsearch: Tar, ZIP, RPM, and DEB.

Any UNIX-like operating system

If you’re running on Linux, Mac, or any other UNIX-like operating system, you can get Elasticsearch from the tar.gz package. Then you can unpack it and start Elasticsearch with the shell script from the archive:

% tar zxf elasticsearch-*.tar.gz

% cd elasticsearch-*

% bin/elasticsearch

Homebrew package manager for OS X

If you need an easier way to install Elasticsearch on your Mac, you can install Homebrew. Instructions for doing that can be found at http://brew.sh. With Homebrew installed, getting Elasticsearch is a matter of running the following command:

% brew install elasticsearch

Then you start it in a similar way to the tar.gz archive:

% elasticsearch

ZIP package

If you’re running on Windows, download the ZIP archive. Unpack it and then run elasticsearch.bat from the bin/ directory, much as you run Elasticsearch on UNIX:

% bin\elasticsearch.bat

RPM or DEB packages

If you’re running on Red Hat Linux, CentOS, SUSE, or anything else that works with RPMs, or Debian, Ubuntu, or anything else that works with DEBs, there are RPM and DEB repositories provided by Elastic. You can see how to use them at www.elastic.co/guide/en/elasticsearch/reference/current/setuprepositories.html.

Once you get Elasticsearch installed, which basically requires adding the repository to your list and running an install command, you can start it by running:

% systemctl start elasticsearch.service Or, if your operating system doesn't have systemd:

% /etc/init.d/elasticsearch start

If you want to see what Elasticsearch is doing, look up the logs in /var/log/elasticsearch/. If you installed it by unpacking the TAR or ZIP archive, you should find them in the logs/ directory within the unpacked archive.

1.2.9. Verifying that it works

Now that you have Elasticsearch installed and started, let’s take a look at the logs generated during startup and connect to the REST API for the first time.

Examining the startup logs

When you first run Elasticsearch, you see a series of log lines telling you what’s going on. Let’s take a look at some of those lines and what they mean.

The first line typically provides statistics about the node you started:

[node] [Karkas] version[1.4.0], pid[6011], build[bc94bd8/2014-11-05T14:26:12Z]

By default, Elasticsearch gives your node a random name, in this case Karkas, which you can modify from the configuration. You can see details on the particular Elasticsearch version you’re running, along with the PID of the Java process that started.

Plugins are loaded during initialization, and no plugins are included by default:

[plugins] [Karkas] loaded [], sites []

For more information about plugins, see appendix B.

Port 9300 is used by default for inter-node communication, called transport:

[transport] [Karkas] bound_address {inet[/0.0.0.0:9300]}, publish_address

{inet[/192.168.1.8:9300]}

If you use the native Java API instead of the REST API, this is the point where you need to connect.

In the next line, a master node was elected and it’s the node you started named Karkas:

[cluster.service] [Karkas] new_master [Karkas][YPHC_vWiQVuSX-ZIJIlMhg]

[inet[/192.168.1.8:9300]], reason: zen-disco-join (elected_as_master)

We discuss master election in chapter 9, which covers scaling out. The basic idea is that each cluster has a master node, responsible for knowing which nodes are in the cluster and where all the shards are located. Each time the master is unavailable, a new one is elected. In this case, you started the first node in the cluster, so this is your master.

Port 9200 is used for HTTP communication by default. This is where applications using the REST API connect:

[http] [Karkas] bound_address {inet[/0.0.0.0:9200]}, publish_address

{inet[/192.168.1.8:9200]}

The next line indicates that your node is now started:

[node] [Karkas] started

At this point, you can connect to it and start issuing requests.

The gateway is the component of Elasticsearch responsible for persisting your data to disk so you don’t lose it if the node goes down:

[gateway] [Karkas] recovered [0] indices into cluster_state

When you start your node, the gateway looks on the disk to see if any data is saved so it can restore it. In this case, there’s no index to restore.

Much of the information we’ve looked at in these log lines—from the node name to the gateway settings— is configurable. We talk about configuration options, and the concepts around them, as the book progresses. You can expect such configuration options to appear in part 2, which is all about performance and administration. Until then, you won’t need to configure much because the default values are developer-friendly.

Warning

Default values are so developer-friendly that if you start another Elasticsearch instance on another computer within the same multicast-enabled network, it will join the same cluster as the first instance, which might lead to unexpected results, such as shards migrating from one to the other. To prevent this, you can change the cluster name in the elasticsearch.yml configuration file, as shown in chapter 2, section

2.5.1

Using the REST API

The easiest way to connect to the REST API is by pointing your browser to http://localhost:9200. If you didn’t install Elasticsearch on your local machine, replace localhost with the IP address of the remote machine. By default, Elasticsearch listens for incoming HTTP requests on port 9200 of all interfaces. If the request works, you should get a JSON reply, showing that it works, as shown in figure 1.6. Figure 1.6. Checking out Elasticsearch from your browser