Incremental backup

From PostgreSQL wiki
Jump to navigationJump to search

This is a proposal for adding incremental backup support to pg_basebackup via streaming replication protocol.

Initial analysis by Simon Riggs and Marco Nenciarini, reviewed by Gabriele Bartolini.

Goals

The main goal of incremental pg_basebackup is to reduce the size of the backup. A secondary goal is to reduce backup time also.

Rationale

When we take a full backup with pg_basebackup, it generates a backup_profile file.

The backup profile is a file containing:

  • the backup label;
  • one line per file, detailing: tablespace oid, max LSN, whether it is part of the backup or not, modification time, size and filepath.

The filepath will be shown relative to the tablespace root, not the full path. The max LSN attribute has a value only if the file is not part of the backup. If it is part of the backup, max LSN is safely assumed to be the one of the profile.

Backup profile is stored in the root of backup data directory, even if the backup format is tar.

The backup profile is necessary for incremental backups.

The algorithm should be similar to rsync, but it compares block LSNs instead of times. Since our files are never bigger than 1 GB per file that is probably granular enough not to worry about copying parts of files, just whole files.

For a first implementation, it is also more robust.

Incremental mode

An incremental pg_basebackup can be run by passing a previous backup startLSN as an argument.

We read through every file on the master. Every data file is checked against the provided LSN before being sent.

If the file maxLSN is older than the provided startLSN, the file is not sent. The backup profile contains information on every file, even on those that have not been sent. This is necessary in order to detect file removals.

Refresh mode

A “refreshing” incremental pg_basebackup is run by executing with —-refresh on a directory that contains a previous backup. Internally it works like an Incremental mode, with the additional step of deleting all files that are not listed in the backup_profile. The resulting directory will contain a full backup.

Restore

Through a new tool, called pg_restorebackup, users will be able to restore from an incrementally built backup data directory.
The tool could do some basic integrity check and give an estimate of the restore progress as well.
Specifications will be clearer along the development path of the incremental backup feature.

Changes to Streaming Replication Protocol

Our proposal to the BASE_BACKUP command for streaming replication protocol is made up of a major change:

  • Add the INCREMENTAL option to BASE_BACKUP

The command synopsis changes to:

BASE_BACKUP [LABEL 'label'] [INCREMENTAL START_LSN] [PROGRESS] [FAST] [WAL] [NOWAIT] [MAX_RATE rate]

Add backup_profile to ‘BASE_BACKUP’

As last action of a backup, PostgreSQL sends an additional CopyResponse containing a backup_profile file having the following format:

POSTGRESQL BACKUP PROFILE 1
<backup label content>
FILE LIST
<file list>

where <backup label content> is the content of backup_label file.

The <file list> section is made up of one or more lines having the following format (standard COPY TEXT file, tab separated):

tablespace maxlsn included mtime size relpath

Where:

  • tablespace is the OID of the tablespace (or \N for PGDATA files)
  • maxlsn is the file max LSN if the file has been skipped, \N otherwise
  • included is a 't' if the file is included in the backup, 'f' otherwise
  • mtime is the timestamp of the last modification
  • size is the number of bytes of the file
  • relpath is the path of the file relative to the tablespace root (PGDATA or the tablespace)

Add ‘INCREMENTAL’ option to ‘BASE_BACKUP’

The INCREMENTAL option would require the START_LSN additional argument.

The main idea is that, when operating with INCREMENTAL, PostgreSQL will analyse the content of every block-organised file and stream only files that have a maxLSN higher than or equal to the provided START_LSN.

The backup profile will contain information on every file, even those that are not sent.

Changes to pg_basebackup

Add the -I DIRECTORY option (including variant --incremental=DIRECTORY) to activate an incremental backup.

The DIRECTORY value points to a directory containing the backup to use as a start point for a file-level incremental backup. pg_basebackup will read the backup_profile file and then create an incremental backup containing only the files which have been modified after the start point.

pg_restorebackup

A series of incrementally built backups from a full backup can be restored through a tool called pg_restorebackup.

pg_restorebackup restores in a given directory the content of a PostgreSQL clusters by providing, in chronological order, the list of backups, starting from the first full backup and continuing with the following incremental ones:

pg_basebackup [options] dest_dir backup_1 backup_2 [backup_3 ...]

Options:

  • -T, --tablespace-mapping (similar to pg_basebackup)

Proposed phases

  • Phase 1: Add a backup_profile to ‘BASE_BACKUP’
  • Phase 2: Add ‘INCREMENTAL’ option to ‘BASE_BACKUP’
  • Phase 3: Support of INCREMENTAL for pg_basebackup
  • Phase 4: pg_restorebackup