A.3. Filter and aggregate based on distance

Let’s say you’re looking for events within a certain range from where you are, as in figure A.1.

Figure A.1. You can filter only points that fall in a certain range from a specified location.

To filter such events, you’d use the geo distance filter. The parameters it needs are your reference location and the limiting distance, as shown here:

% curl 'localhost:9200/get-together/event/_search?pretty' -d '{

"query": {

"filtered": {

"filter": {

"geo_distance": {

"distance": "50km",

"location.geolocation": "40.0,-105.0"

}

In this default mode, Elasticsearch will calculate the distance from 40.0,–105.0 to each event’s geolocation and return only those that are under 50 km. You can set the way the distance is calculated via the distance_type parameter, which will go next to the distance parameter. You have three options:

sloppy_arc (default)—It calculates the distance between the two points by doing a faster approximation of an arc of a circle. This is a good option for most situations.

arc— It actually calculates the arc of a circle, making it slower but more precise than

sloppy_arc. Note that you don’t get 100% precision here, either, because the Earth isn’t perfectly round. Still, if you need precision, this is the best option.

plane— This is the fastest but least precise implementation because it assumes the surface between the two points is a plane. This option works well when you have many documents and the distance limit is fairly small.

Performance optimization doesn’t end with distance algorithms. There’s another parameter to the geo distance filter called optimizebbox. bbox stands for _bounding box, which is a rectangle that you define on a map that contains all the points and areas of interest.

Using optimize,_bbox will first check if events match a square that contains the circle describing the distance range. If they match, Elasticsearch filters further by calculating the distance.

If you’re asking yourself whether the bounding box optimization is actually worth it, you’ll be happy to know that for most cases, it is. Verifying whether a point belongs to a bounding box is much faster than calculating the distance and comparing it to your limit.

It’s also configurable. You can set optimize_bbox to none and check whether your query times are faster or slower. The default value is memory and you can set it to indexed.

Are you curious about what the difference between memory and indexed is? We’ll discuss this difference in the beginning of the next section. If you’re not curious and you don’t want to obsess about performance improvements, sticking with the default should be good enough for most cases.

Distance range filter

The geo distance range filter allows you, for example, to search for events between 50 and 100 kilometers from where you are. Besides its from and to distance options, it accepts the same parameters as the geo distance filter:

"filter": {

"geo_distance_range": {

"from": "50km",

"to": "100km",

"location.geolocation": "40.0,-105.0"

}

Distance range aggregation

Users will probably search for events farther from their point of reference because the ones they found close by weren’t satisfying—for example, if the events’ dates are too far in the future. In such situations, it might be handy for the user to see in advance how many events are, say, within 50 km, between 50 and 100, between 100 and 200, and so on.

For this use case, the geo distance range aggregation will come in handy. It looks similar to the range and date range aggregations you saw in chapter 7. In this case, you’ll specify a reference point (origin) and the distance ranges you need:

"aggs" : {

"events_ranges" : {

"geo_distance" : {

"field" : "location.geolocation",

"origin" : "40.0, -105.0",

"unit": "km",

"ranges" : [

{ "to" : 100 },

{ "from" : 100, "to" : 5000 },

{ "from" : 5000 }

]

}

Elasticsearch will return how many events it finds for each distance range:

"aggregations" : {

"events_ranges" : {

"buckets" : [ {

"key" : "*-100.0",

"from" : 0.0,

"to" : 100.0,

"doc_count" : 8

}, {

"key" : "100.0-5000.0",

"from" : 100.0,

"to" : 5000.0,

"doc_count" : 3

}, {

"key" : "5000.0-*",

"from" : 5000.0,

"doc_count" : 3

} ]

}

So far we’ve covered how to search and aggregate points based on distances. Next, we’ll look at searching and aggregating them based on shapes.

A.4. Does a point belong to a shape?

Shapes, especially rectangles, are easy to draw interactively on a map, as you can see in figure A.2. It’s also faster to search for points in a shape than to calculate distances because searching in a shape only requires comparing the coordinates of the point with the coordinates of the shape’s corners.

Figure A.2. You can filter points based on whether they fall within a rectangle on the map.

There are three types of shapes on the map that you can match points to, or you can match points to events if you’re thinking of the get-together example we used throughout the chapters:

Bounding boxes (rectangles)— These are fast and give you the flexibility to draw any rectangle. Polygons— These allow you to draw a more precise shape, but it’s difficult to ask a user to draw a polygon, and the more complex the polygon is, the slower the search.

Geohashes (squares defined by a hash)— These are the least flexible because hashes are fixed. But, as you’ll see later, they’re typically the fastest implementation of the three.

A.4.1. Bounding boxes

To search whether a point falls within a rectangle, you’d use the bounding box filter. This is useful if your application allows users to click a point on the map to define a corner of the rectangle and then click again to define the opposite corner. The result could be a rectangle like the one from figure A.2.

To run the bounding box filter, specify the coordinates for the top-left and bottom-right points that describe the rectangle:

% curl 'localhost:9200/get-together/event/_search?pretty' -d '{

"query": {

"filtered": {

"filter": {

"geo_bounding_box": {

"location.geolocation": {

"top_left": "40, -106",

"bottom_right": "38, -103"

}

The default implementation of the bounding box filter is to load the points’ coordinates in memory and compare them with those provided for the bounding box. This is the equivalent of setting the type option under geo_bounding_box to memory.

Alternatively, you can set type to indexed and Elasticsearch will do the same comparison using range filters, like the ones you learned about in chapter 4. For this implementation to work, you need to index the point’s latitude and longitude in their own fields, which aren’t enabled by default.

To enable indexing latitude and longitude separately, you have to set lat_lon to true in your mapping, making your geolocation field definition look like this:

"geolocation" : { "type" : "geo_point", "lat_lon": true }

Note

If you make this change to mapping.json from the code samples, you’ll need to run populate.sh again to reindex the sample dataset and have your changes take effect.

The indexed implementation is faster, but indexing latitude and longitude will make your index bigger. Also, if you have more geo points per document—such as an array of points for a restaurant franchise— the indexed implementation won’t work.

Polygon filter

If you want to search for points matching a more complex shape than a rectangle, you can use the geo polygon filter. It allows you to enter the array of points that describe the polygon. More details about the geo polygon filter can be found here: www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-geo-polygon-filter.html.

If you use the geo bounding box filter to search for documents that fall in an area, you can use the geo bounds aggregation to do the opposite—get the bounding box that includes all points resulting from your search:

"aggs" : {

"events_box": {

"geo_bounds": {

"field": "location.geolocation"

}

returns

"aggregations" : {

"events_box" : {

"bounds" : {

"top_left" : {

"lat" : 51.524806,

"lon" : -122.399801

"bottom_right" : {

"lat" : 37.787742,

"lon" : -0.099095

}

A.4.2. Geohashes

The last point-matches-shape method you can use is matching geohash cells. Geohash, which is a system invented by Gustavo Niemeyer when building geohash.org ,^[1] works as suggested in figure A.3, which is a screenshot from http://geohash.gofreerange.com . The Earth is divided into 32 rectangles/cells. Each cell is identified by an alphanumeric character, its hash. Then each rectangle—for example, d—can be further divided into 32 rectangles of its own, generating d0, d1, and so on. You can repeat the process virtually forever, generating smaller and smaller rectangles with longer and longer hash values.

1 https://en.wikipedia.org/wiki/Geohash Figure A.3. The world divided in 32 letter-coded cells. Each cell is divided into 32 cells and so on, making longer hashes.

Geohash cell filter

Because of the way geohash cells are defined, each point on the map belongs to an infinite number of such geohash cells, like d, d0, d0b, and so on. Given such a cell, Elasticsearch can tell you which points match with the geohash cell filter:

% curl 'localhost:9200/get-together/event/_search?pretty' -d '{

"query": {

"filtered": {

"filter": {

"geohash_cell": {

"location.geolocation": "9xj"

}

Even though a geohash cell is a rectangle, this filter works differently than the bounding box filter. First, geo points have to get indexed with a geohash that describes them—for example, 9xj6. Then, you also have to index all the ngrams of that hash, like 9, 9x, 9xj, and 9xj6, which describe all the parent cells. When you run the filter, the hash from the query is matched against the hashes indexed for that point, making a geohash cell filter similar in implementation to the term filter you saw in chapter 4, which is very fast.

To enable indexing the geohash in your geo point, you have to set geohash to true in the mapping. To index that hash’s parents (edge ngrams), set geohash_prefix to true, as well. Indexing prefixes will help make filters faster because they’ll do an exact match on the prefixes already indexed instead of a more expensive wildcard search.

Tip

Because a cell will never be able to perfectly describe a point, you have to choose how precise (or big) that rectangle needs to be. The default setting for precision is 12, which creates hashes like 9xj64sswpkdq with an accuracy of a few centimeters. Because you’ll also index all the parents, you may want to trade some precision for index size and search performance. You can also specify the precision as length (like 10m), and Elasticsearch will set the corresponding numeric value.

Geohash grid aggregation

Just as you can do aggregations with distances, you can cluster documents that match your search by the geohash cells they belong to. The size of these geohash cells is configured through the precision option:

"aggs" : {

"events_clusters": {

"geohash_grid": {

"field": "location.geolocation",

"precision": 5

}

This would return buckets like these:

"events_clusters" : {

"buckets" : [ {

"key" : "9xj64",

"doc_count" : 6

}, {

"key" : "gcpvj",

"doc_count" : 3 ...

Understanding geohash cells is important even if you’re not going to use the geohash filters and aggregations because in Elasticsearch, geohashes are the default way of representing shapes. We’ll explain how shapes use geohashes in the next section.