8.4. Parent-child relationships: connecting separate documents

Another option for defining relationships among data in Elasticsearch is to define a type within an index as a child of another type of the same index. This is useful when documents or relations need to be updated often. You’d define the relationship in the mapping through the _parent field. For example, you can see in the mapping.json file from the book’s code samples that events are children of groups, as illustrated in figure 8.12.

Figure 8.12. The relationship between events and groups as it’s defined in the mapping

Once you have this relationship defined in the mapping, you can start indexing documents. The parents (group documents in this case) are indexed normally. For children (events in this example) you need to specify the parent’s ID in the _parent field. This will basically point the event to its group and allow you to search for groups that include some event’s criteria or the other way around, like figure 8.13.

Figure 8.13. The _parent field of each child document is pointing to the _id field of its parent.

Compared to the nested approach, searches are slower. With nested documents, the fact that all inner objects are Lucene documents in the same block pays dividends because they can be joined easily into the root document. Parent and child documents are completely different Elasticsearch documents, so they have to be searched for separately.

The parent-child approach shines when it comes to indexing, updating, and deleting documents. Because parent and child documents are different Elasticsearch documents, they can be managed separately. For example, if a group has many events and you need to add a new one, you add that new event document. Using the nested-type approach, Elasticsearch will have to re-index the group documents with the new event and all existing events, which is much slower.

A parent document can already be indexed or not when you index its child. This is useful when you have lots of new documents and you want to index them asynchronously. For example, you can index events on your website generated by users and also index the users. Events may come from your logging system, and users may be synchronized from a database. You don’t need to worry about making sure a user exists before you can index an event that will have that user as a parent. If the user doesn’t exist, the event is indexed anyway.

But how would you index parent and child documents in the first place? This is what we’ll explore next.

8.4.1. Indexing, updating, and deleting child documents

We’ll only worry about child documents here because parents are indexed like any other document you’ve indexed so far. It’s the child documents that must point to their parents via the _parent field.

Note

Parents of a document type can be children of another type. You can have multiple levels of such relationships, just as you can with nested type. You can even combine them. For example, a group can have its members stored as nested type and events separately stored as their children.

When it comes to child documents, you have to define the _parent field in the mapping, and when indexing, you must specify the parent’s ID in the _parent field. The parent’s ID and type will also serve as the child’s routing value.

Routing and routing values

You may recall from chapter 2 how indexing operations get distributed to shards by default: each document you index has an ID, and that ID gets hashed. At the same time, each shard of the index has an equal slice of the total range of hashes. The document you index goes to the shard that has that document’s hashed ID in its range.

The hashed ID is called the routing value, and the process of assigning a document to a shard is called routing. Because each ID is different and you hash them all, the default routing mechanism will evenly balance documents between shards.

You can also specify a custom routing value. We’ll go into the details of using custom routing in chapter 9, but the basic idea is that Elasticsearch hashes that routing value and not the document’s ID to determine the shard. You’d use custom routing when you wanted to make sure multiple documents are in the same shard because hashing the same routing value will always give you the same hash.

Custom routing becomes useful when you start searching because you can provide a routing value to your query. When you do, Elasticsearch goes only to the shard that corresponds to that routing value, instead of querying all the shards. This reduces the load in your cluster a lot and is typically used for keeping each user’s documents together.

The _parent field provides Elasticsearch with the ID and type of the parent document, which lets it route the child documents to the same hash as the parent document. _parent is essentially a routing value, and you benefit from it when searching. Elasticsearch will automatically use this routing value to query only the parent’s shard to get its children or the child’s shard to get its parent.

The common routing value makes all the children of the same parent land in the same shard as the parent itself. When searching, all the correlations that Elasticsearch has to do between a parent and its children happen on the same node. This is much faster than broadcasting all the child documents over the network in search of a parent. Another implication of routing is that when you update or delete a child document, you need to specify the _parent field.

Next we’ll look at how you’d practically do all those things:

Define the _parent field in the mapping.

Index, update, and delete child documents by specifying the _parent field.

Mapping

The next listing shows the relevant part of the events mapping from the code samples. The _parent field has to point to the parent type—in this case, group.

Listing 8.7. _parent mapping from the code samples

Indexing and retrieving

With the mapping in place, you can start indexing documents. Those documents have to contain the parent value in the URI as a parameter. For your events, that value is the document ID of the groups they belong to, such as where you have 2 for the Elasticsearch Denver group:

% curl -XPOST 'localhost:9200/get-together/event/1103?parent=2' -d '{

"host": "Radu,

"title": "Yet another Elasticsearch intro in Denver"

The _parent field is stored so you can retrieve it later, and it’s also indexed so you can search on its value. If you look at the contents of _parent for a group, you’ll see the type you defined in the mapping as well as the group ID you specified when indexing.

To retrieve an event document, you run a normal index request, and you also have to specify the _parent value:

% curl 'localhost:9200/get-together/event/1103?parent=2&pretty'

{

"_index" : "get-together",

"_type" : "event",

"_id" : "1103",

"_version" : 1, "found" : true, "_source" : {

"host": "Radu",

"title": "Yet another Elasticsearch intro in Denver"

}

The _parent value is required because you can have multiple events with the same ID pointing to different groups. But the _parent and _id combination is unique. If you try to get the child document without specifying its parent, you’ll get an error saying that a routing value is required. The _parent value is that routing value Elasticsearch is waiting for:

% curl 'localhost:9200/get-together/event/1103?pretty'

{

"error" : "RoutingMissingException[routing is required for [get-together]/

[event]/[1103]]", "status" : 400

}

Updating

You’d update a child document through the update API, in a similar way to what you did in chapter 3, section 3.5. The only difference here is that you have to provide the parent again. As in the case of retrieving an event document, the parent is needed to get the routing value of the event document you’re trying to change. Otherwise, you’d get the same RoutingMissingException you had earlier when trying to retrieve the document without specifying a parent.

The following snippet adds a description to the document you just indexed:

curl -XPOST 'localhost:9200/get-together/event/1103/_update?parent=2' -d '{ "doc": {

"description": "Gives an overview of Elasticsearch"

}

Deleting

To delete a single event document, run a delete request like in chapter 3, section 3.6.1, and add the parent parameter: curl -XDELETE 'localhost:9200/get-together/event/1103?parent=2'

Deleting by query works as before: documents that match get deleted. This API doesn’t need parent values and it doesn’t take them into account, either: curl -XDELETE 'http://localhost:9200/get-together/event/_query?**q=host:radu**' Speaking of queries, let’s look at how you can search across parent-child relations.

8.4.2. Searching in parent and child documents

With parent-child relations, like those you have with groups and their events, you can search for groups and add event criteria or the other way around. Let’s see what the actual queries and filters are that you’ll use:

has_child queries and filters are useful in searching for parents with criteria from their children —for example, if you need groups hosting events about Elasticsearch.

has_parent queries and filters are useful when searching for children with criteria from their parents—for example, events that happen in Denver because location is a group property.

has_child query and filter

If you want to search in groups hosting events about Elasticsearch, you can use the has_child query or filter. The classic difference here is that filters don’t care about scoring.

A has_child filter can wrap another filter or a query. It runs that filter or query against the specified child type and collects the matches. The matching children contain the IDs of their parents in the

_parent field. Elasticsearch collects those parent IDs and removes the duplicates—because the same parent ID can appear multiple times, once for each child—and returns the list of parent documents. The whole process is illustrated in figure 8.14.

Figure 8.14. The has_child filter first runs on children and then aggregates the results into parents, which are returned.

In Phase 1 of the figure, the following actions take place:

The application runs a has_child filter, requesting group documents with children of type event that have “Elasticsearch” in their title.

The filter runs on the event type for documents matching “Elasticsearch.”

The resulting event documents point to their respective parents. Multiple events can point to the

same group.

In Phase 2, Elasticsearch gathers all the unique group documents and returns them to the application.

The filter from figure 8.14 would look like this:

% curl 'localhost:9200/get-together/group/_search?pretty' -d '{

"query": {

"filtered": {

"filter": {

"has_child": {

"type": "event",

"filter": {

"term": { "title": "elasticsearch"

}

}}'

The has_child query runs in a similar way to the filter, except it can give a score to each parent by aggregating child document scores. You’d do that by setting score_mode to max, sum, avg, or none, as you can do with nested queries.

Note

If the has_child filter can wrap a filter or a query, the has_child query can only wrap another query.

For example, you can set score_mode to max and get the following query to return groups ordered by which one hosts the most relevant event about Elasticsearch:

% curl 'localhost:9200/get-together/group/_search?pretty' -d '{

"query": {

"has_child": {

"type": "event",

"score_mode": "max",

"query": {

"term": { "title": "elasticsearch"

}

}}'

Warning

In order for has_child queries and filters to remove parent duplicates quickly, it caches their IDs in the field cache we introduced in chapter 6. This may take a lot of JVM heap if you have lots of parent matches for your queries. This will be less of a problem once you can have doc values for the _parent field, as described for this issue: https://github.com/elastic/elasticsearch/issues/6107 .

Getting the child documents in the results

By default, only the parent documents are returned by the has_child query, not the children that match. You can get the children as well by adding the inner_hits option you saw earlier for nested documents:

"query": {

"has_child": {

"type": "event",

"query": {

"term": {

"title": "elasticsearch"

}

"inner_hits": {}

}

As with nested documents, the reply for each matching group will also contain matching events, except that now events are separate documents and have their own ID instead of an offset:

"name": "Elasticsearch Denver",

[...]

"inner_hits" : {

"event" : {

"hits" : {

"total" : 2,

"max_score" : 0.9581454,

"hits" : [ {

"_index" : "get-together",

"_type" : "event",

"_id" : "103",

"_score" : 0.9581454,

"_source":{

"host": "Lee", "title": "Introduction to Elasticsearch",

has_parent query and filter

has_parent is, as you might expect, the opposite of has_child. You use it when you want to search for events but include criteria from the groups they belong to.

The has_parent filter can wrap a query or a filter. It runs on the "type" that you provide, takes the parent results, and returns the children, pointing to their IDs from their _parent field.

The following listing shows how to search for events about Elasticsearch, but only if they happen in Denver.

Listing 8.8. has_parent query to find Elasticsearch events in Denver

Because a child only has a parent, there are no scores to aggregate, as would be the case with has_child. By default, has_parent has no influence on the child’s score ("score_mode": "none"). You can change "score_mode" to "score" to make events inherit the score of their parent groups.

Like the has_child queries and filters, has_parent queries and filters have to load parent IDs in field data to support fast lookups. That being said, you can expect all those parent/child queries to be slower than the equivalent nested queries. It’s the price you pay for being able to index and search all the documents independently.

Another similarity with has_child queries and filters is the fact that has_parent returns, by default, only one side of the relationship—in this case, the child documents. From Elasticsearch 1.5, you can fetch the parents as well by adding the inner_hits object to the query.

children aggregation

With version 1.4, a children aggregation was introduced, which allows you to nest aggregations on child documents under those you make on parent documents. Let’s say that you already get the most popular tags for your get-together groups through the terms aggregation. For each of those tags, you also need the most frequent attendees to events belonging to each tag’s groups. In other words, you want to see the people with strong preferences toward specific categories of events.

You’ll get these people in the following listing by nesting a children aggregation under your top-tags terms aggregation. Under the children aggregation, you’ll nest another terms aggregation that will count the number of attendees for each tag.

Listing 8.9. Combining parent and child aggregations

Note

You may have noticed that the children aggregation is similar to the nested aggregation—it passes child documents to the aggregations within it. Unfortunately, at least up to version 1.4, Elasticsearch doesn’t provide a parent-child equivalent of the reverse nested aggregation to allow you to do the opposite: pass parent documents to the aggregations within it.

You can think of nested documents as index-time joins and parent-child relations as query-time joins. With nested, a parent and all its children are joined in a single Lucene block when indexing. By contrast, the _parent field allows different types of documents to be correlated at query time.

Nested and parent-child structures are good for one-to-many relationships. For many-to-many relationships, you’ll have to employ a technique common in the NoSQL space: denormalizing.

Using parent-child designation to define document relationships: pros and cons

Before moving on, here’s a quick recap of why you should or shouldn’t use parent-child relationships. The plus points:

Children and parents can be updated separately.

Query-time join performance is better than if you did joins in your application because all related documents are routed to the same shard and joins are done at the shard level without adding network hops.

The downsides:

Queries are more expensive than the nested equivalent and need more memory than field data. Aggregations can only join child documents to their parents and not the other way around, at least up to version 1.4.