8.1. Overview of options for defining relationships among documents

First, let’s quickly define each of these approaches:

Object type— This allows you to have an object (with its own fields and values) as the value of a field in your document. For example, your address field for an event could be an object with its own fields: city, postal code, street name, and so on. You could even have an array of addresses if the same event happens in multiple cities.

Nested documents— The problem you may have with the object type is that all the data is stored in the same document, so matches for a search can go across subdocuments. For example, city=Paris AND street_name=Broadway could return an event that’s hosted in New

York and Paris at the same time, even though there’s no Broadway street in Paris. Nested documents allow you to index the same JSON document but will keep your addresses in separate Lucene documents, making only searches like city=New York AND street_name=Broadway return the expected result.

Parent-child relationships between documents— This method allows you to use completely

separate Elasticsearch documents for different types of data, like events and groups, but still define a relationship between them. For example, you can have groups as parents of events to indicate which event hosts which group. This will allow you to search for events hosted by groups in your area or for groups that host events about Elasticsearch.

Denormalizing— This is a general technique for duplicating data in order to represent relationships. In Elasticsearch, you’re likely to employ it to represent many-to-many relationships because other options work only on one-to-many. For example, all groups have members, and members could belong to multiple groups. You can duplicate one side of the relationship by including all the members of a group in that group’s document.

Application-side joins— This is another general technique where you deal with relationships from your application. It works well when you have less data and can afford to keep it normalized. For example, instead of duplicating members for all groups they’re part of, you could store them separately and include only their IDs in the groups. Then you’d run two queries: first, on members to filter those matching member criteria. Then you’d take their IDs and include them in the search criteria for groups.

Before we dive into all the details of working with each possibility, we’ll provide an overview of them and their typical use cases.

8.1.1. Object type

The easiest way to represent a common interest group and the corresponding events is to use the object type. This allows you to put a JSON object or an array of JSON objects as the value of your field, like the following example:

{

"name": "Denver technology group",

"events": [

{

"date": "2014-12-22",

"title": "Introduction to Elasticsearch"

{

"date": "2014-06-20",

"title": "Introduction to Hadoop"

}

]

}

If you want to search for a group with events that are about Elasticsearch, you can search in the events.title field.

Under the hood, Elasticsearch (or rather, Lucene) isn’t aware of the structure of each object; it only knows about fields and values. The document ends up being indexed as if it looked like this:

{

"name": "Denver technology group",

"events.date": ["2014-12-22", "2014-06-20"],

"events.title": ["Introduction to Elasticsearch", "Introduction to Hadoop"]

}

Because of how they’re indexed, objects work brilliantly when you need to query only one field of the object at a time (generally one-to-one relationships), but when querying multiple fields (as is generally the case with one-to-many relationships), you might get unexpected results. For example, let’s say you want to filter groups hosting Hadoop meetings in December 2014. Your query can look like this:

"bool": {

"must": [

{

"term": {

"events.title": "hadoop"

}

{

"range": {

"events.date": {

"from": "2014-12-01",

"to": "2014-12-31"

}

]

}

This will match the sample document because it has a title that matches hadoop and a date that’s in the specified range. But this isn’t what you want: it’s the Elasticsearch event that’s in December; the Hadoop event is in June. Sticking with the default object type is the fastest and easiest approach to relations, but Elasticsearch is unaware of the boundaries between documents, as illustrated in figure 8.1.

Figure 8.1. Inner object boundaries aren’t accounted for when storing, leading to unexpected results.

8.1.2. Nested type

If you need to make sure such cross-object matches don’t happen, you can use the nested type, which will index your events in separate Lucene documents. In both cases, the group’s JSON document will look exactly the same, and applications will index each in the same way. The difference is in the mapping, which triggers Elasticsearch to index nested inner objects in adjacent but separate Lucene documents, as illustrated in figure 8.2. When searching, you’ll need to use nested filters and queries, which will be explored in section 8.2; those will search in all those Lucene documents.

Figure 8.2. The nested type makes Elasticsearch index objects as separate Lucene documents.

In some use cases, it’s not a good idea to mash all the data in the same document as objects and nested types do. Take the case of groups and events: if a new event is organized by a group and all of that group’s data is in the same document, you’ll have to re-index the whole document for that event. This can hurt performance and concurrency, depending on how big those documents get and how often those operations are done.

8.1.3. Parent-child relationships

With parent-child relationships, you can use completely different Elasticsearch documents by putting them in different types and defining their relationship in the mapping of each type. For example, you can have events in one mapping type and groups in another and you can specify in the mapping that groups are parents of events. Also, when you index an event, you can point it to the group that it belongs to, as in figure 8.3. At search time, you can use has_parent or has_child queries and filters to take the other part of the relationship into account. We’ll discuss them later in this chapter as well.

Figure 8.3. Different types of Elasticsearch documents can have parent-child relationships.

8.1.4. Denormalizing

For any relational work, you have objects, nested documents, and parent-child relations. These work for one-to-one and one-to-many relationships—the kinds that have one parent with one or more children. There are also techniques that are not specific to Elasticsearch but are methods often employed by NoSQL data stores to overcome the lack of joins: one is denormalizing, which means a document will include data that’s related to it, even if the same data will have to be duplicated in another document. Another is doing joins in your application.

For example, let’s take groups and their members. A group can have more than one member, and a user can be a member of more than one group. Both have their own set of properties. To represent this relationship, you can have groups as parents of the members. For users who are members of multiple groups, you can denormalize their data: once for each group they belong to, like in figure 8.4.

Figure 8.4. Denormalizing is the technique of multiplying data to avoid costly relations.

Alternatively, you can keep groups and members separated and include only member IDs in group documents. You’d join groups and their members by using member IDs in your application, which works well if you have a small number of member IDs to query by, as shown in figure 8.5.

Figure 8.5. You can keep your data normalized and do the joins in your application.

In the rest of this chapter, we’ll take a deeper look at each of these techniques: objects and arrays, and nested, parent-child, denormalizing, and application-side joins. You’ll learn how they work internally, how to define them in the mapping, how to index them, and how to search those documents.