2.4. Searching for and retrieving data

As you might imagine, there are many options around how to search. After all, searching is what Elasticsearch is for.

Note

We look at the most common ways to search in chapter 4; you learn more about getting relevant results in chapter 6 and about search performance in chapter 10.

To take a closer look at the pieces that make up a typical search, search for groups that contain the word “elasticsearch” but ask only for the name and location fields of the most relevant document. The following listing shows the GET request and response.

Listing 2.2. Search for “elasticsearch” in groups

Normally a query runs on a specific field, such as q=name:elasticsearch, but in this case we didn’t specify any field because we wanted to search in all fields. In fact, Elasticsearch uses, by default, a special field named _all, in which all fields’ contents are indexed. We’ll look more at the _all field in chapter 3, but for now it’s nice to know that such a query without an explicit field name goes there.

We’ll look at many more aspects of searches in chapter 4, but here we’ll take a closer look at three important pieces of a search:

Where to search

Contents of the reply

What and how to search

2.4.1. Where to search

You can tell Elasticsearch to look in a specific type of a specific index, as in listing 2.2, but you can also search in multiple types in the same index, in multiple indices, or in all indices.

To search in multiple types, use a comma-separated list. For example, to search in both group and event types, run a command like this:

% curl "localhost:9200/get-together/group,event/_search\

?q=elasticsearch&pretty"

You can also search in all types of an index by sending your request to the _search endpoint of the index’s URL:

% curl 'localhost:9200/get-together/_search?q=sample&pretty'

Similar to types, to search in multiple indices, separate them with a comma:

% curl "localhost:9200/get-together,other-index/_search\

?q=elasticsearch&pretty"

This particular request will fail unless you created other-index in advance. To ignore such problems, you can add the ignore_unavailable flag in the same way you add the pretty flag. To search in all indices, omit the index name altogether:

% curl 'localhost:9200/_search?q=elasticsearch&pretty'

Tip

If you need to search in all indices, you can also use a placeholder called _all as the index name. This comes in handy when you need to search in a single type across all indices as in this example: http://localhost:9200/_all/event/_search.

This flexibility regarding where to search allows you to organize data in multiple indices and types, depending on what makes sense for your use case. For example, log events are often organized in timebased indices, such as “logs-2013-06-03,” “logs-2013-06-04,” and so on. Such a design implies that today’s index is hot: all new events go here, and most of the searches are in recent data. The hot index contains only a fraction of all your data, making it easier to handle and faster. And you can still search in older data or in all data if you need to. You’ll find out more about such design patterns in part 2, where you’ll learn more about scaling, performance, and administration.

2.4.2. Contents of the reply

In addition to the documents that match your search criteria, the reply of a search contains information that’s useful for checking the performance of your search or the relevance of the results.

You might have some questions about listing 2.2 regarding what the reply from Elasticsearch contains. What’s the score about? What happens if not all shards are available? Let’s look at each part of the reply shown the following listing.

Listing 2.3. Search reply returning two fields of a single resulting document

As you can see, the JSON reply from Elasticsearch includes information on time, shards, hits statistics, and the documents you asked for. We’ll look at each of these in turn.

Time

The first items of a reply look something like this:

"took" : 2,

"timed_out" : false,

The took field tells you how long Elasticsearch needed to process your request. The time is in milliseconds. The timed_out field indicates whether your search timed out. By default, searches never time out, but you can specify a limit via the timeout parameter. For example, the following search times out after three seconds:

% curl "localhost:9200/get-together/group/_search\

?q=elasticsearch\

&pretty\

&timeout=3s"

If a search times out, the value of timed_out is true, and you get only results that were gathered until the search timed out.

Shards

The next bit of the response is information about shards involved in the search:

"_shards" : {

"total" : 2,

"successful" : 2,

"failed" : 0

This might look natural to you because you searched in one index, which in this case has two shards. All shards replied, so successful is 2, which leaves failed with 0.

You might wonder what happens when a node goes down and a shard can’t reply to a search request. Take a look at figure 2.11, which shows a cluster of three nodes, each with only one shard and no replicas. If one node goes down, some data would be missing. In this case, Elasticsearch gives you the results from shards that are up and reports the number of shards unavailable for search in the failed field.

Figure 2.11. Partial results can be returned from shards that are still available.

Hits statistics

The last element of the reply is called hits and is quite lengthy because it contains an array of the matching documents. But before that array, it contains a couple of statistics:

"total" : 2,

"max_score" : 0.9066504

In total, you see the total number of matching documents, and in max_score, you see the maximum score of those matching documents.

Definition

The score of a document returned by a search is the measure of how relevant that document is for the given search criteria. As mentioned in chapter 1, by default, the score is calculated with the TF-IDF (term frequency-inverse document frequency) algorithm. Term frequency means for each term (word) you search, the document’s score is increased if it has more occurrences of that term. Inverse document frequency means the score is increased more if the term is rare across all documents because it’s considered more relevant. If the term occurs often in other documents, it’s probably a common term, and is thus less relevant. We’ll show you how to make your searches more relevant in chapter 6.

The total number of documents may not match the number of documents you see in the reply. By default, Elasticsearch limits the number of results to 10, so if you can have more than 10 results, look at the value of total for the precise number of documents that match your search criteria. As you saw previously, to change the number of results returned, use the size parameter.

Resulting documents

The array of hits is usually the most interesting information in a reply:

"hits" : [ {

"_index" : "get-together",

"_type" : "group",

"_id" : "3",

"_score" : 0.9066504,

"fields" : {

"location" : [ "San Francisco, California, USA" ],

"name" : [ "Elasticsearch San Francisco" ]

}

} ]

Each matching document is shown with the index and type it belongs to, its ID, and its score. The values of the fields you specified in your search query are also shown. In listing 2.2, you used fields=name,location. If you don’t specify which fields you want, the _source field is shown. Like _all, _source is a special field, in which, by default, Elasticsearch stores the original JSON document. You can configure what gets stored in the source, and we explore that in chapter 3.

Tip

You can also limit which fields from the original document (_source) are shown, by using source filtering, as explained here: www.elastic.co/guide/en/elasticsearch/reference/master/search-requestsource-filtering.html . You’d put these options in the JSON payload of your search, which is explained in the next section.

2.4.3. How to search

So far, you’ve searched through what’s called a URI request, so named because all your search options go into the URI. This is good for simple searches you run on the command line, but it’s safer to think of URI requests as shortcuts.

Normally, you’d put your query in the data part of your request. Elasticsearch allows you to specify all the search criteria in JSON format. As searches get more complex, JSON is much easier to read and write and offers a lot more functionality.

To send a JSON query for all groups that are about Elasticsearch, you could do this:

% curl 'localhost:9200/get-together/group/_search?pretty' -d '{

"query": {

"query_string": {

"query": "elasticsearch"

}

In plain English, this translates to “run a query of type query_string, where the string is elasticsearch.” It might seem like too much boilerplate to type in elasticsearch, but this is because JSON provides many more options than a URI request. As you’ll see in chapter 4, using a JSON query makes sense when you start to combine different types of queries: squeezing all those options in a URI would be more difficult to handle. Let’s explore each field.

Setting query string options

At the last level of the JSON request, you have "query": "elasticsearch", and you might think the "query" part is redundant because you already know it’s a query. But a query_string provides more options than the string itself.

For example, if you search for “elasticsearch san francisco”, Elasticsearch looks in the _all field by default. If you wanted to look in the group’s name instead, you’d specify

"default_field": "name"

Also by default, Elasticsearch returns documents matching any of the specified words (the default operator is OR). If you wanted to match all the words, you’d specify

"default_operator": "AND"

The revised query looks like this:

% curl 'localhost:9200/get-together/group/_search?pretty' -d '{

"query": {

"query_string": {

"query": "elasticsearch san francisco",

"default_field": "name",

"default_operator": "AND"

}

Another way to achieve the same results is to specify the field and the operator in the query string itself:

"query": "name:elasticsearch AND name:san AND name:francisco"

The query string is a powerful tool to specify your search criteria. Elasticsearch parses the string to understand the terms you’re looking for and your other options, such as fields and operators, and then runs the query. This functionality is inherited from Lucene^[1].

If you want to find out more about the query string syntax, visit http://lucene.apache.org/core/4_9_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description.

Choosing the right query type

If the query_string query type looks intimidating, the good news is there are many other types of queries, most of which are covered in chapter 4. For example, if you’re looking only for the term “elasticsearch” in the name field, a term query would be faster and more straightforward:

% curl 'localhost:9200/get-together/group/_search?pretty' -d '{

"query": {

"term": {

"name": "elasticsearch"

}

Using filters

So far, all the searches you’ve seen have been queries. Queries give you back results and each result has a score. If you’re not interested in the score, you can run a filtered query instead. We’ll talk more about the filtered query in chapter 4, but the key information is that filters care only whether a result matches the search or not. As a result, they’re faster and easier to cache than their query counterparts. For example, the following query looks for the term “elasticsearch” in the name field of group documents:

% curl 'localhost:9200/get-together/group/_search?pretty' -d '{

"query": {

"filtered": {

"filter": {

"term": { "name": "elasticsearch"

}

The results are the same as the ones you get with the equivalent term query, but filter results aren’t sorted by score (because the score is 1.0 for all results).

Applying aggregations

In addition to queries and filters, you can do all sorts of statistics through aggregations. We look at aggregations in chapter 7, but let’s look at a simple example here.

Suppose a user is visiting your get-together website and wants to explore the kinds of groups that are available. You might want to show who the group organizers are. For example, if “Lee” comes up in the results as the organizer of seven meetings, a user who knows Lee might click his name to filter only those seven meetings.

To return people who are group organizers, you can use a terms aggregation. This shows counters for each term that appears in the field you specify—in this case, organizer. The aggregation might look like this:

% curl localhost:9200/get-together/group/_search?pretty -d '{

"aggregations" : {

"organizers" : {

"terms" : { "field" : "organizer" }

}

In plain English, this request translates to “give me an aggregation named organizers, which is of type terms and is looking at the organizer field.” The following results display at the bottom of the reply:

"aggregations" : {

"organizers" : {

"buckets" : [ {

"key" : "lee",

"doc_count" : 2

}, {

"key" : "andy", "doc_count" : 1 ....

The results show you that out of the six total terms, “lee” appears two times, “andy” one time, and so on. We have two groups organized by Lee. You could then search for the groups for which Lee is the organizer to narrow down your results.

Aggregations are useful when you can’t search for what you need because you don’t know what that is. What kinds of groups are available? Are there any events hosted near where I live? You can use aggregations to drill down in the available data and see real-time statistics.

At other times you have the opposite scenario. You know exactly what you need and you don’t want to run a search at all. That’s when it’s useful to retrieve a document by ID.

2.4.4. Getting documents by ID

To retrieve a specific document, you must know the index and type it belongs to and its ID. You then issue an HTTP GET request to that document’s URI:

% curl 'localhost:9200/get-together/group/1?pretty'

{

"_index" : "get-together",

"_type" : "group",

"_id" : "1",

"_version" : 1,

"found" : true,

"_source" : {

"name": "Denver Clojure",

"organizer": ["Daniel", "Lee"] ....

The reply contains the index, type, and ID you specified. If the document exists, you’ll see that the found field is true, in addition to its version and its source. If the document doesn’t exist, found is false:

% curl 'localhost:9200/get-together/group/doesnt-exist?pretty'

{

"_index" : "get-together",

"_type" : "group",

"_id" : "doesnt-exist", "found" : false

}

As you might expect, getting documents by ID is much faster and less expensive in terms of resources than searching. It’s also done in real time: as soon as an indexing operation is finished, the new document can be fetched through this GET API. By contrast, searches are near-real time because they need to wait for a refresh, which by default happens every second.

Now that you’ve seen how to do all the basic API requests, let’s take a look at how to change some basic configuration options.