11.4. Backing up your data

Elasticsearch provides a full-featured and incremental data backup solution. The snapshot and restore APIs enable you to back up individual index data, all of your indices, and even cluster settings to either a remote repository or other pluggable backend systems and then easily restore these items to the existing cluster or a new one.

The typical use case for creating snapshots is, of course, to perform backups for disaster recovery, but you may also find it useful in replicating production data in development or testing environments and even as insurance before executing a large set of changes.

11.4.1. Snapshot API

Using the snapshot API to back up your data for the first time, Elasticsearch will take a copy of the state and data of your cluster. All subsequent snapshots will contain the changes from the previous one. The snapshot process is nonblocking, so executing it on a running system should have no visible effect on performance. Furthermore, because every subsequent snapshot is the delta from the previous one, it makes for smaller and faster snapshots over time.

It’s important to note that snapshots are stored in repositories. A repository can be defined as either a file system or a URL.

A file-system repository requires a shared file system, and that shared file system must be mounted on every node in the cluster.

URL repositories are read-only and can be used as an alternative way to access snapshots.

In this section, we’ll cover the more common and flexible file-system repository types, how to store snapshots in them, restoring from them, and leveraging common plugins for cloud vendor storage repositories.

11.4.2. Backing up data to a shared file system

Performing a cluster backup entails executing three steps that we’ll cover in detail:

Define a repository— Instruct Elasticsearch on how you want the repository structured. Confirm the existence of the repository— You want to trust but verify that the repository was created using your definition.

Execute the backup— Your first snapshot is executed via a simple REST API command.

The first step in enabling snapshots requires you to define a shared file-system repository. The curl command in the following listing defines your new repository on a network mounted drive.

Listing 11.4. Defining a new repository

Once the repository has been defined across your cluster, you can confirm its existence with a simple GET command:

curl -XGET 'localhost:9200/_snapshot/my_repository?pretty=1';

{

"my_repository" : {

"type" : "fs",

"settings" : {

"compress" : "true",

"max_restore_bytes_per_sec" : "20mb",

"location" : "smb://share/backups",

"max_snapshot_bytes_per_sec" : "20mb"

}

}

}

Note that as a default action, you don’t have to specify the repository name, and Elasticsearch will respond with all registered repositories for the cluster: curl -XGET 'localhost:9200/_snapshot?pretty=1';

Once you’ve established a repository for your cluster, you can go ahead and create your initial snapshot/backup: curl -XPUT 'localhost:9200/_snapshot/my_repository/first_snapshot';

This command will trigger a snapshot operation and return immediately. If you want to wait until the snapshot is complete before the request responds, you can append the optional wait_for_completion flag:

curl -XPUT 'localhost:9200/_snapshot/my_repository/ first_snapshot?wait_for_completion=true';

Now take a look at your repository location and see what the snapshot command stored away:

./backups/index

./backups/indices/bitbucket/0/__0

./backups/indices/bitbucket/0/__1

./backups/indices/bitbucket/0/__10

./backups/indices/bitbucket/1/__c

./backups/indices/bitbucket/1/__d

./backups/indices/bitbucket/1/snapshot-first_snapshot ...

./backups/indices/bitbucket/snapshot-first_snapshot

./backups/metadata-first_snapshot

./backups/snapshot-first_snapshot

From this list, you can see a pattern emerging on what Elasticsearch backed up. The snapshot contains information for every index, shard, segment, and accompanying metadata for your cluster with the following file path structure: ///. A sample snapshot file may look similar to the following, which contains information about size, Lucene segment, and the files that each snapshot points to within the directory structure:

smb://share/backups/indices/bitbucket/0/snapshot-first_snapshot

{

"name" : "first_snapshot",

"index_version" : 18,

"start_time" : 1416687343604,

"time" : 11,

"number_of_files" : 20,

"total_size" : 161589,

"files" : [ {

"name" : "__0",

"physical_name" : "_l.fnm",

"length" : 2703,

"checksum" : "1ot813j",

"written_by" : "LUCENE_4_9"

}, {

"name" : "__1",

"physical_name" : "_l_Lucene49_0.dvm",

"length" : 90,

"checksum" : "1h6yhga",

"written_by" : "LUCENE_4_9"

}, {

"name" : "__2",

"physical_name" : "_l.si",

"length" : 444,

"checksum" : "afusmz",

"written_by" : "LUCENE_4_9"

}

Second snapshot

Because snapshots are incremental, only storing the delta between them, a second snapshot command will create a few more data files but won’t recreate the entire snapshot from scratch: curl -XPUT 'localhost:9200/_snapshot/my_repository/second_snapshot';

Analyzing the new directory structure, you can see that only one file was modified: the existing /index file in the root directory. Its contents now hold a list of the snapshots taken:

{"snapshots":["first_snapshot","second_snapshot"]}

Snapshots on a per-index basis

In the previous example, you saw how you can take snapshots of the entire cluster and all indices. It’s important to note here that snapshots can be taken on a per-index basis, by specifying the index in the PUT command:

Retrieving basic information on the state of a given snapshot (or all snapshots) is achieved by using the same endpoint, with a GET request: curl -XGET 'localhost:9200/_snapshot/my_repository/first_snapshot?pretty';

The response contains which indices were part of this snapshot and the total duration of the entire snapshot operation:

{

"snapshots": [

{

"snapshot": "first_snapshot",

"indices": [

"bitbucket"

],

"state": "SUCCESS",

"start_time": "2014-11-02T22:38:14.078Z",

"start_time_in_millis": 1414967894078,

"end_time": "2014-11-02T22:38:14.129Z",

"end_time_in_millis": 1414967894129,

"duration_in_millis": 51,

"failures": [],

"shards": {

"total": 10,

"failed": 0,

"successful": 10

}

}

]

}

Substituting the snapshot name for _all will supply you with information regarding all snapshots in the repository: curl -XGET 'localhost:9200/_snapshot/my_repository/_all';

Because snapshots are incremental, you must take special care when removing old snapshots that you no longer need. It’s always advised that you use the snapshot API in removing old snapshots because the API will delete only currently unused segments of data: curl -XDELETE 'localhost:9200/_snapshot/my_repository/first_snapshot';

Now that you have a solid understanding of the options available when backing up your cluster, let’s have a look at restoring your cluster data and state from these snapshots, which you’ll need to understand in the event of a disaster.

11.4.3. Restoring from backups

Snapshots are easily restored to any running cluster, even a cluster the snapshot didn’t originate from.

Using the snapshot API with an added _restore command, you can restore the entire cluster state: curl -XPOST 'localhost:9200/_snapshot/my_repository/first_snapshot/_restore';

This command will restore the data and state of the cluster captured in the given snapshots:

first_snapshot. With this operation, you can easily restore the cluster to any point in time you choose.

Similar to what you saw before with the snapshot operation, the restore operation allows for a wait_for_completion flag, which will block the HTTP call you make until the restore operation is fully complete. By default, the restore HTTP request returns immediately, and the operation executes in the background:

curl -XPOST 'localhost:9200/_snapshot/my_repository/first_snapshot/_restore? wait_for_completion=true';

Restore operations also have additional options available that allow you to restore an index to a newly named index space. This is useful if you want to duplicate an index or verify the contents of a restored index from backup:

Given this command, you’ll restore only the index named logs_2014 from the snapshot and ignore restoring any other indices found in the snapshot. Because the index name matches the pattern you defined as the rename_pattern, the snapshot data will reside in a new index named a_copy_of_logs_2014.

Note

When restoring an existing index, the running instance of the index must be closed. Upon completion, the restore operation will open the closed indices.

Now that you understand how the snapshot API works to enable backups in a network-attached-storage environment, let’s explore some of the many plugins available for performing backups in a cloud-based vendor environment.

11.4.4. Using repository plugins

Although snapshotting and restoring from a shared file system is a common use case, Elasticsearch and the community also provide repository plugins for several of the major cloud vendors. These plugins allow you to define repositories that use a specific vendor’s infrastructure requirements and internal APIs.

Amazon S3

For those deploying on an Amazon Web Services infrastructure, there’s a freely available S3 repository plugin available on GitHub and maintained by the Elasticsearch team: https://github.com/elasticsearch/elasticsearch-cloud-aws#s3-repository.

The Amazon S3 repository plugin has a few configuration variables that differ from the norm, so it’s important to understand what functionality each of them controls. An S3 repository can be created as such:

Once enabled, the S3 plugin will store your snapshots in the defined bucket path. Because HDFS is compatible with Amazon S3, you may be interested in reading the next section, which covers the Hadoop HDFS repository plugin, as well.

Hadoop HDFS

The HDFS file system can be used as a snapshot/restore repository with this simple plugin, built and maintained by the Elasticsearch team that’s part of the more general Hadoop plugin project: https://github.com/elasticsearch/elasticsearch-hadoop/tree/master/repository-hdfs.

You must install the latest stable release of this plugin on your Elasticsearch cluster. From the plugin directory, use the following command to install the desired version of the plugin directly from GitHub: bin/plugin -i elasticsearch/elasticsearch-repository-hdfs/2.x.y

Once it’s installed, it’s time to configure the plugin. The HDFS repository plugin configuration values should be placed within your elasticsearch.yml configuration file. Here are some of the important values:

Now, with your HDFS repository plugin configured, your snapshot and restore operations will execute using the same snapshot API as covered earlier. The only difference is that the method of snapshotting and restoring will be from your Hadoop file system.

In this section we explored various ways to back up and restore cluster data and state using the snapshot API. Repository plugins provide a convenience for those deploying Elasticsearch with public cloud vendors. The snapshot API provides a simple and automated way to store backups in a networked environment for disaster recovery.