Incremental backup
This is a proposal for adding incremental backup support to pg_basebackup via streaming replication protocol.
Initial analysis by Simon Riggs and Marco Nenciarini, reviewed by Gabriele Bartolini.
Goals
The main goal of incremental pg_basebackup is to reduce the size of the backup. A secondary goal is to reduce backup time also.
Rationale
When we take a full backup with pg_basebackup, it generates a backup_profile file.
The backup profile is a file containing:
- the backup label;
 - one line per file, detailing: tablespace oid, max LSN, whether it is part of the backup or not, modification time, size and filepath.
 
The filepath will be shown relative to the tablespace root, not the full path. The max LSN attribute has a value only if the file is not part of the backup. If it is part of the backup, max LSN is safely assumed to be the one of the profile.
Backup profile is stored in the root of backup data directory, even if the backup format is tar.
The backup profile is necessary for incremental backups.
The algorithm should be similar to rsync, but it compares block LSNs instead of times. Since our files are never bigger than 1 GB per file that is probably granular enough not to worry about copying parts of files, just whole files.
For a first implementation, it is also more robust.
Incremental mode
An incremental pg_basebackup can be run by passing a previous backup startLSN as an argument.
We read through every file on the master. Every data file is checked against the provided LSN before being sent.
If the file maxLSN is older than the provided startLSN, the file is not sent. The backup profile contains information on every file, even on those that have not been sent. This is necessary in order to detect file removals.
Refresh mode
A “refreshing” incremental pg_basebackup is run by executing with —-refresh on a directory that contains a previous backup. Internally it works like an Incremental mode, with the additional step of deleting all files that are not listed in the backup_profile.
The resulting directory will contain a full backup.
Restore
Through a new tool, called pg_restorebackup, users will be able to restore from an incrementally built backup data directory.
The tool could do some basic integrity check and give an estimate of the restore progress as well.
Specifications will be clearer along the development path of the incremental backup feature.
Changes to Streaming Replication Protocol
Our proposal to the BASE_BACKUP command for streaming replication protocol is made up of a major change:
- Add the 
INCREMENTALoption toBASE_BACKUP 
The command synopsis changes to:
BASE_BACKUP [LABEL 'label'] [INCREMENTAL START_LSN] [PROGRESS] [FAST] [WAL] [NOWAIT] [MAX_RATE rate]
Add backup_profile to ‘BASE_BACKUP’
As last action of a backup, PostgreSQL sends an additional CopyResponse containing a backup_profile file having the following format:
POSTGRESQL BACKUP PROFILE 1 <backup label content> FILE LIST <file list>
where <backup label content> is the content of backup_label file.  
The <file list> section is made up of one or more lines having the following format (standard COPY TEXT file, tab separated):
tablespace maxlsn included mtime size relpath
Where:
tablespaceis the OID of the tablespace (or\NforPGDATAfiles)maxlsnis the file max LSN if the file has been skipped,\Notherwiseincludedis a 't' if the file is included in the backup, 'f' otherwisemtimeis the timestamp of the last modificationsizeis the number of bytes of the filerelpathis the path of the file relative to the tablespace root (PGDATAor the tablespace)
Add ‘INCREMENTAL’ option to ‘BASE_BACKUP’
The INCREMENTAL option would require the START_LSN additional argument.
The main idea is that, when operating with INCREMENTAL, PostgreSQL will analyse the content of every block-organised file and stream only files that have a maxLSN higher than or equal to the provided START_LSN.
The backup profile will contain information on every file, even those that are not sent.
Changes to pg_basebackup
Add the -I DIRECTORY option (including variant --incremental=DIRECTORY) to activate an incremental backup.
The DIRECTORY value points to a directory containing the backup to use as a start point for a file-level incremental backup. pg_basebackup will read the backup_profile file and then create an incremental backup containing only the files which have been modified after the start point.
pg_restorebackup
A series of incrementally built backups from a full backup can be restored through a tool called pg_restorebackup.
pg_restorebackup restores in a given directory the content of a PostgreSQL clusters by providing, in chronological order, the list of backups, starting from the first full backup and continuing with the following incremental ones:
pg_basebackup [options] dest_dir backup_1 backup_2 [backup_3 ...]
Options:
-T,--tablespace-mapping(similar to pg_basebackup)
Proposed phases
- Phase 1: Add a backup_profile to ‘BASE_BACKUP’
 - Phase 2: Add ‘INCREMENTAL’ option to ‘BASE_BACKUP’
 - Phase 3: Support of INCREMENTAL for pg_basebackup
 - Phase 4: pg_restorebackup