List of Figures

Chapter 1. Introducing Elasticsearch

Figure 1.1. More occurrences of the searched terms usually rank the document higher.

Figure 1.2. Elasticsearch as the only back end storing and indexing all your data

Figure 1.3. Elasticsearch in the same system with another data store

Figure 1.4. Elasticsearch in a system of logging tools that support Elasticsearch out of the box

Figure 1.5. Analysis breaks text into words, both when you’re indexing and when you’re searching.

Figure 1.6. Checking out Elasticsearch from your browser

Chapter 2. Diving into the functionality

Figure 2.1. An Elasticsearch cluster from the application’s and administrator’s points of view

Figure 2.2. Logical layout of data in Elasticsearch: how an application sees data

Figure 2.3. A three-node cluster with an index divided into five shards with one replica per shard

Figure 2.4. Documents are indexed to random primary shards and their replicas. Searches run on complete sets of shards, regardless of their status as primaries or replicas.

Figure 2.5. Term dictionary and frequencies in a Lucene index

Figure 2.6. Multiple primary and replica shards make up the get-together index.

Figure 2.7. To improve performance, scale vertically (upper-right) or scale horizontally (lowerright).

Figure 2.8. Indexing operation is forwarded to the responsible shard and then to its replicas.

Figure 2.9. Search request is forwarded to primary/replica shards containing a complete set of data. Then results are aggregated and sent back to the client.

Figure 2.10. URI of a document in Elasticsearch

Figure 2.11. Partial results can be returned from shards that are still available.

Figure 2.12. One-node cluster shown in Elasticsearch kopf

Figure 2.13. Replica shards are allocated to the second node.

Figure 2.14. Elasticsearch automatically distributes shards across the growing cluster.

Chapter 3. Indexing, updating, and deleting data

Figure 3.1. Using types to divide data in the same index; searches can run in one, multiple, or all types.

Figure 3.2. After the default analyzer breaks strings into terms, subsequent searches match those terms.

Figure 3.3. Updating a document involves retrieving it, processing it, and re-indexing it while overwriting the previous document.

Figure 3.4. Without concurrency control, changes can get lost.

Figure 3.5. Concurrency control through versioning prevents one update from overriding another.

Chapter 4. Searching your data

Figure 4.1. How a search request is routed; the index consists of two shards and one replica per shard. After locating and scoring the documents, only the top 10 documents are fetched.

Figure 4.2. Filters require less processing and are cacheable because they don’t calculate the score.

Figure 4.3. Filter results are cached in bitsets, making subsequent runs much faster.

Chapter 5. Analyzing your data

Figure 5.1. Overview of the analysis process of a custom analyzer using standard components Figure 5.2. Analyzer overview

Figure 5.3. Token filters accept tokens from tokenizer and prep data for indexing.

Chapter 6. Searching with relevancy

Figure 6.1. Term frequency is how many times a term appears in a document.

Figure 6.2. Inverse document frequency checks to see if a term occurs in a document, not how often it occurs.

Figure 6.3. Lucene’s scoring formula for a score given a query and document Figure 6.4. Linear curve—scores decrease from the origin at the same rate.

Figure 6.5. Gauss curve—scores decrease more slowly until the scale point is reached and then they decrease faster.

Figure 6.6. Exponential curve—scores drastically drop from the origin.

Chapter 7. Exploring your data with aggregations

Figure 7.1. Example use case of aggregations: top tags for get-together groups

Figure 7.2. The terms bucket aggregation allows you to nest other aggregations within it.

Figure 7.3. A filter wrapped in a filtered query runs first and restricts both results and aggregations.

Figure 7.4. Post filter runs after the query and doesn’t affect aggregations.

Figure 7.5. Major types of multi-bucket aggregations

Figure 7.6. A terms aggregation can be used to get term frequencies and generate a word cloud.

Figure 7.7. Sometimes the overall top X is inaccurate, because only the top X terms are returned per shard.

Figure 7.8. Reducing inaccuracies by increasing shard_size

Figure 7.9. range aggregations give you counts of documents for each range. This is good for pie charts.

Figure 7.10. Nesting a date histogram aggregation under a terms aggregation

Figure 7.11. Nesting the top_hits aggregation under a terms aggregation to get result grouping Figure 7.12. Nesting aggregations under the global aggregation makes them run on all documents.

Figure 7.13. The filter aggregation restricts query results for aggregations nested under it.

Chapter 8. Relations among documents

Figure 8.1. Inner object boundaries aren’t accounted for when storing, leading to unexpected results.

Figure 8.2. The nested type makes Elasticsearch index objects as separate Lucene documents.

Figure 8.3. Different types of Elasticsearch documents can have parent-child relationships.

Figure 8.4. Denormalizing is the technique of multiplying data to avoid costly relations.

Figure 8.5. You can keep your data normalized and do the joins in your application.

Figure 8.6. JSON hierarchical structure stored as a flat structure in Lucene

Figure 8.7. You can search in an object’s field by specifying that field’s full path.

Figure 8.8. A block of documents in Lucene storing the Elasticsearch document with nested-type objects

Figure 8.9. With include_in_root, fields of nested documents are indexed in the root document, too.

Figure 8.10. include_in_parent indexes a nested document’s field into the immediate parent, too.

Figure 8.11. Nested aggregation doing necessary joins for other aggregations to work on the indicated path

Figure 8.12. The relationship between events and groups as it’s defined in the mapping

Figure 8.13. The _parent field of each child document is pointing to the _id field of its parent.

Figure 8.14. The has_child filter first runs on children and then aggregates the results into parents, which are returned.

Figure 8.15. Hierarchical relationship (nested or parent-child) between different Lucene documents

Figure 8.16. Hierarchical relationship denormalized by copying group information to each event Figure 8.17. Joining documents across nodes is difficult because of network latency.

Figure 8.18. Nested/parent-child relations make sure all joins are local.

Figure 8.19. Many-to-many relationships can contain a huge amount of data, making local joins impossible.

Figure 8.20. Many-to-many relation denormalized into multiple one-to-many relations, allowing local joins

Figure 8.21. Application-side joins require you to run two queries.

Chapter 9. Scaling out

Figure 9.1. Shard allocation for the test index for one node transitioning to two nodes

Figure 9.2. Shard allocation for the test index with three Elasticsearch nodes

Figure 9.3. Elasticsearch using multicast discovery to discover other nodes in the cluster

Figure 9.4. Elasticsearch using unicast discovery to discover other nodes in the cluster

Figure 9.5. Cluster fault detection by the master node

Figure 9.6. Turning replica shards into primaries after node loss

Figure 9.7. Re-creating replica shards after losing a node

Figure 9.8. A single node with a single shard and two nodes trying to scale a single shard

Figure 9.9. Additional replicas handling search and aggregations

Chapter 10. Improving performance

Figure 10.1. Bulk indexing allows you to send multiple documents in the same request.

Figure 10.2. A flush moves segments from memory to disk and clears the transaction log.

Figure 10.3. A flush is triggered when the memory buffer or transaction log is full or at an interval.

Figure 10.4. Tiered merge policy performs a merge when it finds too many segments in a tier.

Figure 10.5. Optimizing makes sense for indices that don’t get updates.

Figure 10.6. By default, the terms filter is checking which documents match each term, and it intersects the lists.

Figure 10.7. Field data execution means iterating through documents but no list intersections.

Figure 10.8. The shard query cache is more high-level than the filter cache.

Figure 10.9. Ngrams generate more terms than you need with fuzzy queries, but they match exactly.

Figure 10.10. A prefix query has to match more terms but works with a smaller index than edge ngrams.

Figure 10.11. You can use the reverse and edge ngram token filters to match suffixes.

Figure 10.12. Using shingles to match compound words

Figure 10.13. Counting members in a script or while indexing

Figure 10.14. Comparison between query_and_fetch and query_then_fetch Figure 10.15. Uneven distribution of DF can lead to incorrect ranking.

Figure 10.16. dfs search types use an extra network hop to compute global DFs, which are used for scoring.

Chapter 11. Administering your cluster

Figure 11.1. Cluster with default allocation settings

Figure 11.2. Cluster with allocation awareness

Figure 11.3. Yellow status solved by making nodes accessible

Figure 11.4. Elasticsearch keeps runtime data and caches in memory, so writes and reads can be expensive.

Appendix A. Working with geospatial data

Figure A.1. You can filter only points that fall in a certain range from a specified location. Figure A.2. You can filter points based on whether they fall within a rectangle on the map.

Figure A.3. The world divided in 32 letter-coded cells. Each cell is divided into 32 cells and so on, making longer hashes.

Figure A.4. Shapes represented in geohashes. Searching for shapes matching shape 1 will return shape 2.

Appendix B. Plugins

Figure B.1. Example of the kopf plugin

Figure B.2. Screenshot of the elasticsearch-head plugin

Appendix C. Highlighting

Figure C.1. Highlighting shows why a document matched a query.

Figure C.2. The lack of fragment encoding can make the browser interpret HTML incorrectly.

Figure C.3. Using the HTML encoder avoids parsing mistakes.

Appendix D. Elasticsearch monitoring plugins

Figure D.1. Website: http://bigdesk.org/ License: Apache License v2.0 Figure D.2. Bigdesk makes visualizing the get-together cluster easy.

Figure D.3. Website: www.elastichq.org/ License: Apache License v2.0

Figure D.4. Node diagnostics screen

Figure D.5. Website: https://github.com/mobz/elasticsearch-head License: Apache License v2.0

Figure D.6. Website: https://github.com/lmenezes/elasticsearch-kopf License: MIT

Figure D.7. Website: www.elastics.co/overview/marvel/ License: Commercial

Figure D.8. Autocompletion of REST calls

Figure D.9. Website: www.sematext.com License: Commercial Figure D.10. Alerts and notifications configuration

Appendix E. Turning search upside down with the percolator

Figure E.1. Typical use case: percolating a document enables the application to send alerts to

users if their stored queries match the document.

Figure E.2. You need a mapping and some registered queries in order to percolate documents.

Figure E.3. Percolator for automated tagging. The multi percolate and bulk APIs reduce the number of requests. Before step 1, the percolation queries have been indexed. In step 1 you use the multi percolate API to find matching percolation queries. The application maps the IDs to the tags and adds them to the documents to index. In step 2 you use the bulk index API to index the documents.

Figure E.4. A percolate request with routing reduces the number of queries and also hits fewer shards.

Appendix F. Using suggesters for autocomplete and did-you-mean functionality

Figure F.1. Spell checking by Google

Figure F.2. Autocomplete on Google

Figure F.3. Candidate suggestions are ranked based on the shingles field.

Figure F.4. Using filters and two direct generators to correct both prefix and suffix typos Figure F.5. Stupid Backoff discounts the score of lower-order shingles.

Figure F.6. In-memory FSTs help you get fast suggestions based on a prefix.

Figure F.7. Instant search lets you jump to the result without running an actual search.