4.2. Introducing the query and filter DSL

In the previous section we discussed the basic components of a search request. We talked about the amount of items to return and support pagination using from and size. We also discussed sorting and filtering the fields of the source to return. In this section we explain the basic component we didn’t discuss at length yet, the query component. So far you’ve used a basic query component, the match_all query. Check the following listing to see it in action.

Listing 4.12. Basic search request using request body

In this section you’re replacing the match_all query with a match query, and you’re going to add a term filter from the filter DSL to the search request using the filtered query of the query DSL. After that we dive into what makes filters different from queries. Next, we take a look at some other basic queries and filters. We wrap up the section with compound queries and other more advanced queries and filters. Then, before moving to analyzers, we help you choose the right query for the job.

4.2.1. Match query and term filter

So far almost every search request that you did returned all documents. In this section we show two ways to limit the number of documents to return. We start with a match query to find groups containing the word “Hadoop” in the title. The following code listing shows this search request.

Listing 4.13. match query

The query returns three events in total. The structure of the response was explained previously in section 4.1.4. If you’re following along, look at the score for the first match. The first match is the document with the title “Using Hadoop with Elasticsearch.” The score for this document is 1.3958796. You can change the search by searching for the word “Hadoop” with the capital H. The result will be the same. Try it if you don’t believe us.

Now imagine you have a website that groups events by host, so you get this nice list of aggregates and a count of the number of events per host. After clicking on the events hosted by Andy, you want to find all events hosted by Andy. You can create a search request with a match query looking for Andy in the host field. If you create this search request and execute it, you’ll see there are three events hosted by Andy, all having the same score. We hear you ask, “Why?” Read in chapter 6 about how scoring works. This is the right moment to introduce filters.

Filters are similar to the queries we discuss in this chapter, but they differ in how they affect the scoring and performance of many search actions. Rather than computing the score for a particular term as queries do, a filter on a search returns a simple binary “does this document match this query” yes-or-no answer. Figure 4.2 shows the main difference between queries and filters.

Figure 4.2. Filters require less processing and are cacheable because they don’t calculate the score.

Because of this difference, filters can be faster than using a regular query, and they can also be cacheable. A search using a filter looks similar to a regular search using a query, but the query is replaced with a "filtered" map that contains the original query and a filter to be applied, as shown in the next listing. This query is called the filtered query in the query DSL. A filtered query contains two components: the query and the filter.

Listing 4.14. Query using a filter

Here a regular query for events matching “hadoop” is used as the query, but in addition to the query for the word “Hadoop,” a filter is used to limit the events. Inside this particular filter section, a term filter is applied for all documents that have the host "andy". Behind the scenes, Elasticsearch constructs a bitset, which is a binary set of bits denoting whether the document matches this filter. Figure 4.3 shows what this bitset looks like.

Figure 4.3. Filter results are cached in bitsets, making subsequent runs much faster.

After constructing the bitset, Elasticsearch can now use it to filter (hence the name!) out the documents that it shouldn’t be searching based on the query part of the search. The filter limits the amount of documents for which a score needs to be calculated. The score for the limited set of documents is calculated based on the query. Because of this, adding a filter can be much faster than combining the entire query into a single search. Depending on what kind of filter is used, Elasticsearch can cache the results in a bitset. If the filter is used for another search, the bitset doesn’t have to be calculated again!

Other types of filters aren’t automatically cached if Elasticsearch can tell they’ll never be used again or if the bitsets are trivial to recreate. An example of a query that’s hard to cache is a filter that limits the results to all documents of the last hour. This query changes every second when you execute it and therefore there’s no reason to cache it. Check listing 4.17 to see an example. Additionally, Elasticsearch gives you the ability to manually specify whether a filter should be cached, as well as the ability to manually specify whether a filter should be cached. All of this translates into faster searches with filters. Therefore, you should make parts of your query into filters if you can.

We’ll revisit bitsets to explain the details of how they work and how they affect performance in chapter 10, which discusses ways to speed up searches. Now that you understand what filters are, we’ll cover several different types of filters and queries, and you’ll run some searches against data.

4.2.2. Most used basic queries and filters

Although there are a number of ways to query for things in Elasticsearch, some may be better than others depending on how the data is stored in your index. In this section, you learn the different types of queries Elasticsearch supports and try out an example of how to use each query. We assess the pros and cons of using each query and provide performance notes about each one so you can determine which query best fits your data.

In the previous sections of this chapter, a number of queries and filters were already introduced. You started with the match_all query to return all documents, moved on to the match query to limit results of an occurring word in a field, and used the term filter to limit the results using a term in a field. One query that we didn’t discuss but that you did use is the query_string query. This query was used in the URL-based search. More on this later in this section.

In this section we recap these queries but now introduce some more advanced options. We also look at more advanced queries and filters like the range filter, the prefix query, and the

simple_query_string query. Let’s start with the easiest queries, beginning with the match_all query.

Match_all query

We’ll give you a guess as to what this query does. That’s correct! It matches all documents. The match_all query is useful when you want to use a filter instead of a query (perhaps if you don’t care about the score of documents at all) or you want to return all documents among the indices and types you’re searching. The query looks like this:

% curl 'localhost:9200/_search' -d '

{

"query" : {

"match_all" : {}

}

}'

To use a filter for a search instead of any regular query parts, the query looks something like this (with the filters omitted):

% curl 'localhost:9200/get-together/_search' –d '

{

"query": {

"filtered": {

"query": {

"match_all": {}

},

"filter": { ... filter details ...

}

}

}

}'

Simple, huh? Not too useful, though, for a search engine, because users rarely search for everything. You can even make this search request easier; using the match_all query is the default. Therefore the query elements can be left out completely in this case. Next, let’s look at a query that’s a bit more useful.

Query_string query

In chapter 2, you used the query_string query to see how easy it is to get an Elasticsearch server up and running, but we’ll cover it again in more detail so you can see how it compares to the other queries.

As shown in the following listing, a query_string search can be performed either from the URL of the request or sent in a request body. In this example, you search for documents that contain “nosql.” The query should return one document.

Listing 4.15. Example query_string search

By default, a query_string query searches the _all field, which, if you recall from chapter 3, is made up of all the fields combined. You can change this by either specifying a field with the query, such as description:nosql, or by specifying a default_field with the request, as shown in the next listing.

Listing 4.16. Specifying a default_field for a query_string search

As you may have guessed, this syntax offers more than searching for a single word. Under the hood, this is the entire Lucene query syntax, which allows combining searching different terms with Boolean operators like AND and OR, as well as excluding documents from the results using the minus sign (-) operator. The following query searches for all groups with “nosql” in the name but without “mongodb” in the description: name:nosql AND -description:mongodb

To search for all search and Lucene groups created between 1999 and 2001, you could use the following:

(tags:search OR tags:lucene) AND created_on:[1999-01-01 TO 2001-01-01]

Note

Refer to www.lucenetutorial.com/lucene-query-syntax.html for a full example of syntax the query_string query supports.

Query_string cautions

Although the query_string query is one of the most powerful queries available to you in

Elasticsearch, it can sometimes be one of the hardest to read and extend. It may be tempting to allow your users the ability to specify their own queries with this syntax, but consider the difficulty in explaining the meaning of complex queries such as this:

name:search^2 AND (tags:lucene OR tags:"big data"~2) AND -description:analytics AND created_on:[2006-05-01 TO 2007-03-29]

One big disadvantage with the query_string query is that it has great power. Giving your website users this power might put your Elasticsearch cluster at risk. If users start entering queries with the wrong format, they’ll get back exceptions; it’s also possible to make combinations that would return the world and that way put your cluster at risk. See the previous note for an example.

Suggested replacements for the query_string query include the term, terms, match, or multi_match queries, all of which allow you to search for strings within a field or fields in a document. Another good replacement is the simple-query-string query; this is meant to be a replacement with easy access to a query syntax using +, -, AND, OR. More on these queries in the sections that follow.

Term query and term filter

term queries and filters are some of the simplest queries that can be performed, allowing you to specify a field and term to search for within your documents. Note that because the term being searched for isn’t analyzed, it must match a term in the document exactly for the result to be found. We’ll cover how exactly tokens, which are individual pieces of text indexed by Elasticsearch, get analyzed in chapter 5. If you’re familiar with Lucene, it might be helpful to know that the term query maps directly to Lucene’s

TermQuery.

The following listing shows a term query that searches for groups with the elasticsearch tag.

Listing 4.17. Example term query

Like the term query, a term filter can be used when you want to limit the results to documents that contain the term but without affecting the score. Compare the scores of the documents in the previous listing with the scores in the following listing: you’ll notice that the filter doesn’t bother calculating and therefore influencing the score; due to the match_all query, the score for all documents is 1.0.

Listing 4.18. Example term filter

Terms query

Similar to the term query, the terms query (note the s!) can search for multiple terms in a document’s field. For example, the following listing searches for groups by a tag matching either “jvm” or “hadoop.”

Listing 4.19. Searching for multiple terms with the terms query

To force a minimum number of matching terms to be in a document before it matches the query, specify the minimum_should_match parameter:

% curl 'localhost:9200/get-together/group/_search' -d '

{

"query": {

"terms": {

"tags": ["jvm", "hadoop", "lucene"],

"minimum_should_match": 2

}

}

}'

If you’re thinking, “Wait! That’s pretty limited!” you’re probably also wondering what happens when you need to combine multiple queries into a single query. More information about combining multiple term queries is discussed in section 4.3 about compound queries.

4.2.3. Match query and term filter

Similar to the term query, the match query is a hash map containing the field you’d like to search as well as the string you want to search for, which can be either a field or the special _all field to search all fields at once. Here’s an example match query, searching for groups where name contains “elasticsearch”:

% curl 'localhost:9200/get-together/group/_search' –d '

{

"query": {

"match": {

"name": "elasticsearch"

}

}

}'

The match query can behave in a number of different ways; the two most important behaviors are boolean and phrase.

Boolean query behavior

By default, the match query uses Boolean behavior and the OR operator. For example, if you search for the text “Elasticsearch Denver,” Elasticsearch searches for “Elasticsearch OR Denver,” which would match get-together groups from both “Elasticsearch Amsterdam” and “Denver Clojure Group.”

To search for results that contain both “Elasticsearch” and “Denver,” change the operator by modifying the match field name into a map and setting the operator field to and:

The second important way a match query can behave is as a phrase query.

Phrase query behavior

A phrase query is useful when searching for a specific phrase within a document, with some amount of leeway between the positions of each word. This leeway is called slop, which is a number representing the distance between tokens in a phrase. Say you’re trying to remember the name of a get-together group; you remember it had the words “Enterprise” and “London” in it, but you don’t remember the rest of the name. You could search for the phrase “enterprise london” with slop set to 1 or 2 instead of the default of 0 to find results containing that phrase without having to know the exact title of the group:

4.2.4. Phrase_prefix query

Similar to the match_phrase query, the match_phrase_prefix query allows you to go one step further and search for a phrase, but it allows prefix matching on the last term in the phrase. This behavior is extremely useful for providing a running autocomplete for a search box, where the user gets search suggestions while typing a search term. When using the search for this kind of behavior, it’s a good idea to set the maximum number of expansions for the prefix by setting the max_expansions setting so the search returns in a reasonable amount of time.

In the following example, “elasticsearch den” is used as the phrase_prefix query. Elasticsearch takes the “den” text and looks across all the values of the name field to check for those that start with “den” (“Denver,” for example). Because this could potentially be a large set, the number of expansions should be limited:

The Boolean and phrase queries are a great choice for accepting user input; they allow you to pass in user input in a much less error-prone way, and unlike a query_string query, a match query won’t choke on reserved characters like +, -, ?, and !.

Matching multiple fields with multi_match

Although it might be tempting to think that the multi_match query behaves like the terms query by searching for multiple matches in a field, its behavior is slightly different. Instead, it allows you to search for a value across multiple fields. This can be helpful in the get-together example where you may want to search for a string across both the name of the group and the description:

% curl 'localhost:9200/get-together/_search' -d'

{

"query": {

"multi_match": {

"query": "elasticsearch hadoop",

"fields": [ "name", "description" ]

}

}

}'

Just as the match query can be turned into a phrase query, a prefix query, or a phrase_prefix query, the multi_match query can be turned into a phrase query or phrase_prefix query as well by specifying the type key. Consider the multi_match query exactly like the match query, except that you can specify multiple fields for searching instead of a single field only.

With all the different match queries, it’s possible to find a way to search for almost anything, which is why the match query and its relatives are considered the go-to query type for most uses. We highly recommended that you use them whenever possible. For everything else, however, we’ll cover some of the other types of queries that Elasticsearch supports.