Transparent Data Encryption
This page describes the transparent data encryption feature proposed in pgsql-hackers.
Contents
Overview
There has been continual discussion about whether and how to implement Transparent Data Encryption (TDE) in Postgres. Many other relational databases support TDE, and some security standards require it. However, it is also debatable how much security value TDE provides.
Fundamentally, TDE must meet three criteria — it must be secure, obviously, but it also must be done in a way that has minimal impact on the rest of the Postgres code. This has value for two reasons — first, only a small number of users will use TDE, so the less code that is added, the less testing is required. Second, the less code that is added, the less likely TDE will break because of future Postgres changes. Finally, TDE should meet regulatory requirements.
History
The first patch was proposed in 2016 [1] and implemented cluster-wide encryption with a single key. In 2018 table-level transparent data encryption was proposed [2], together with a method to integrate with key management systems; that first patch was submitted in 2019 [3]. The patch implemented both tablespace-level encryption using a 2-tier key architecture and generic key management API to communicate with external key management systems.
Threat model
TDE protects data from theft if file system access controls are compromised:
- Malicious user steals storage device and reads database files directly
- Malicious backup operator takes backup
- Protecting data at rest (persistent data)
This does not protect from users who can read system memory, e.g., shared buffers, which root users can do.
Scope of TDE
The scope of TDE is:
- Internal key management system (KMS), storing keys in the database
- Cluster-wide encryption
- encrypt everything that is persistent
- not encrypting shared buffers or data in memory
The benefit of cluster wide encryption is:
- Simple architecture
- Suitable for the requirement of encrypting everything
Cluster-wide encryption meets the compliance requirements and checks the box as far as TDE is concerned. It also meets the criteria of encrypting the data at rest i.e., persistent data.
When to encrypt/decrypt
Buffer
It encrypts buffer data during disk I/O
- Processes encrypt data when writing it to disk
- Decrypt when reading from disk
- Data in the shared buffer is not encrypted
WAL
In cluster encryption
- Processes insert WAL data to WAL buffers in non-encrypted state
- WAL buffers are encrypted when writing to the file system
- WAL would use a dedicated encryption key
Temporary Files
Temporary files would use a temporary key that is randomly generated at postmaster start and lives only for the postmaster lifetime. For parallel queries, especially parallel hash joins, since it's possible that multiple parallel workers use the same temporary files, the the temporary key should be shared with parallel workers.
Backups
In cluster-wide encryption, there will be an option in pg_basebackup to change the heap/index key for key rotation purposes. After failover to a standby, the WAL key can be changed too.
How to encrypt
We will use Advanced Encryption Standard (AES) [4]. We will use AES-GCM [5] for heap/index and WAL encryption and AES-GCM with KWP [6] for key wrapping.
Key length
We will offer three key length options (128, 192, and 256-bits) selected at initdb time with ```--file-encryption-keylen```.
Initialization Vector(IV)
Nonce means "number used once". An IV is a specific type of nonce. That is, unique but not necessarily random or secret, as specified by the NIST [7]. To generate unique IVs, the NIST recommends two methods:
The first method is to apply the forward cipher function, under the same key that is used for the encryption of the plaintext, to a nonce. The nonce must be a data block that is unique to each execution of the encryption operation. For example, the nonce may be a counter, as described in Appendix B, or a message number. The second method is to generate a random data block using a FIPS-approved random number generator.
We will use the first method to generate IVs. That is, select nonce carefully and use a cipher with the key to make it unique enough to use as an IV. The nonce selection for buffer encryption and WAL encryption are described below.
IV for heap/index encryption
We will use the page LSN (8 bytes) and page number (4 bytes) to create an IV (16 bytes) for each page.
Using the page LSN and page number as part of the nonce has three benefits:
- We don't need to decrypt/re-encrypt during CREATE DATABASE since the page contents are the same in both places, and once one database changes its pages, it gets a new LSN, and hence a new nonce/IV.
- For each change of an 8k page, you get a new nonce/IV, so you are not encrypting different data with the same nonce/IV.
- This avoids requiring pg_upgrade to preserve database oids, tablespace oids, and relfilenodes.
- We get a unique nonce even when two different pages in the same relation have the same LSN because the page numbers are different (we don't use the same LSN in different relations)
- This can happen when a heap update expires an old tuple and adds a new tuple to another page.
However, the LSN must then be visible on encrypted pages, so we will not encrypt the LSN on the page. We will also not encrypt the CRC so pg_checksums can still check pages offline without access to the keys. Probably, we will need to encrypt pd_lower and pd_upper to not give a hint to an attacker where the hole is within the page. Therefore, we will need to encrypt from pd_flags or pg_lower.
typedef struct PageHeaderData { /* XXX LSN is member of *any* block, not only page-organized ones */ PageXLogRecPtr pd_lsn; /* LSN: next byte after last byte of xlog * record for last change to this page */ uint16 pd_checksum; /* checksum */ uint16 pd_flags; /* flag bits, see below */ LocationIndex pd_lower; /* offset to start of free space */ LocationIndex pd_upper; /* offset to end of free space */ LocationIndex pd_special; /* offset to start of special space */ uint16 pd_pagesize_version; TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */ ItemIdData pd_linp[FLEXIBLE_ARRAY_MEMBER]; /* line pointer array */ } PageHeaderData;
There are some challenges in making sure the nonce is never reused. Unlogged tables don't update the page LSN when updating the page, so we would need things like the magic GIST "GistBuildLSN" fake-LSN too, and we will need a bit used in the IV to distinguish if it's a real LSN or an unlogged LSN; perhaps we need a separate unlogged table encryption key.
Another challenge is hint bit updates, which change the page contents but don't always update the page LSN. wal_log_hints can be enabled so the LSN changes on the first hint bit change during a checkpoint. However, later hint bit changes during the same checkpoint would not cause an LSN page change. This would result in using the same nonce to encrypt different page contents.
One fix for this would be to create dummy WAL records and update the page LSN for every hint bit change (not just the first one since the last checkpoint), but that would write too many dummy records. A more reasonable option would be to update a per-page counter on every hint bit change for the same page LSN, and use that counter in the four unused IV bytes. However, the counter would have to be stored unencrypted on the page. If the counter is about to wraparound, we force wal_log_hint_bits to generate a new LSN, even though it is not the first hint bit change during the checkpoint. (Checksums use a similar method to allow page checksums to be recomputed multiple times during the same checkpoint, but only create a page image WAL record on first page hint bit change during a checkpoint.)
IV for WAL encryption
We will use the WAL segment number to create an IV for each WAL segments. The maximum bits that can be processed with a single key/IV(nonce) pair is 68GB [8][9].
We will use a different IV(nonce) 16MB WAL file, so we will be OK there too. What about timelines?
IV for temporary files
It is unclear how to set the nonce for temporary files. We will probably use a data encryption key generated at postmaster start, and mix that with the time of day, process id, and maybe file path.
2-tier key architecture
We use two types of encryption keys: key encryption keys (KEK) and data encryption key (DEK). KEK is used to encrypt DEK and must be stored in a separate location, e.g., key management server. DEK is used to encrypt actual user data and can be stored in the same location as the database after encrypted using the KEK. If we use 1-tier key architecture, we must decrypt all data using the current key and re-encrypt them using the new key, which could take a long time.
Key Rotation
Rotation of the KEK (Key Encryption Key) involves rewrapping the data encryption keys with a new KEK. Rotating the data encryption keys might require failover to a standby that uses a different heap/index data encryption key, and then changing the WAL key.
Checksum and Encryption
Encrypt and then CRC, and store the CRC decrypted
Prepared Transaction Encryption
During the discussion, the point about prepared transaction encryption also came up since they are also persisted. Sawada-san mentioned that we aren’t storing any important data for prepared transactions so we might not need to encrypt it. However we need to have it as part of the todo list.
Other requirements
TDE requires the OpenSSL library.
TODO for Full-Cluster Encryption
Here is list of ongoing tasks with there assignment and status for cluster-wide encryption and internal key management system.
- Buffer encryption - Assigned to Sawada Status : Needs to update based on the latest KMS patch.
- Which Files do we need to encrypt?
- Need to go through the data directory and get a list of all the files that contain user data - Assigned to Moon-san (Moon-san will check if NTT has done some work in this area and share the result)
- Need to get buy-in from the community for not encrypting non-user data files i.e. visibility map, transactional data etc (This needs to be a team effort)
- don't encrypt LSN(pd_lsn) and CRC(pd_checksum) of the encrypted page contents
- Encrypt then checksum
- don't encrypt the first 12 bytes of a page so pd_flags is visible in encrypted and non-encrypted mode? (Might be useful for online checksum and encryption.)
- make pd_pagesize_version visible on the encrypted page?
- shared buffers nonce is LSN/page-number (nonce is run through encryption to create the IV?)
- require wal_log_hints and full_page_writes to prevent force bit changes to be WAL logged (generates new page LSN)
- Which Files do we need to encrypt?
- WAL encryption - Assigned to Moon/Swada Status : Needs to update based on the latest KMS patch.
- use WAL key
- use CTR mode
- for WAL, don't use OpenSSL's EVP interface so the offset can be specified?
- WAL nonce is segment number (no timeline?)
- Encrypt whole WAL records
- Need to make sure that we don't encrypt the different data with the same key and nonce, and write it to the disk, especially when encrypting data which is not multiple of 16 bytes.
- add Asssert() code to check that there are no WAL record types that modify more than one relation (already written)
- Temporary file encryption - Assigned to Moon Status : Work in progress, will be posted to hackers soon.
- Encryption key: A hash value is randomly generated for each temporary file, and a temporary key is generated by a combination of the hash value and the master key. (will use HMAC256)
- IV value: The encryption key will be used separately for each file. So that the IV is simply generated(64bit = hash value, 32bit = counter) as it should not exceed 1GB per file(pgsql_tmp).
- Need to check if some new temporary files that could have user data are introduced by recent changes for PG13.
- Front end tools encryption - Assigned to Cary Status : Pending due to the shift of focus on KMS for PG13. Some front-end patches have been shared with community that illustrates the interactions with
KMS. Development can resume once the focus is shifted back to TDE.
- Allow pg_rewind and pg_waldump to work, add --cluster-passphrase-command option
- does pg_rewind need to work across WAL key changes?
- offline tool to allow changing the data encryption key of current WAL and PITR WAL files, must be crash-safe
- changing the pass phrase will require --old-passphrase-command and --new-passphrase-command options
- modify pg_basebackup to store relations with a different relation key so standby servers can be used for relation key rotation
Here are some tasks or areas that we need to research, some of these are being worked as part of main to do's listed above. This is a exhaustive list to ensure that we don't skip any todo required for the first phase of TDE.
- TDE for replication
- wal sender (especially xlogreader) needs to take TDE-wal key to decrypt wal data.
- Physical replication
- WAL records are sent in a form of plaintext?
- Allow primary and the standby servers to use different heap/index keys
- Logical replication (and decoding)
- We decrypt and decode WAL data and send these changes in a form of plaintext.
- The subscriber will use different encryption keys or even can disable TDE.
- TDE for backup
- During physical backup (pg_basebackup or copying OS files), table/index data are transferred in a form of encrypted text and all three internal keys are replicated
- Maybe we need to have the ability to change some internal keys during basebackup? This can be done by generating new internal keys, re-encrypting database files with the new key during the
transfer (we already do checksum verification for every page), generate new control file having new internal keys and sending it.
- For logical backup (pg_dump), all data are dumped in a form of plaintext because pg_dump simply fetches data via SQL.
- How does TDE work with backup manifests?
- Regression test cases for TDE
- Documentation
- if a standby is promoted to a primary and the old primary continues writing, one must be re-keyed to avoid using the same IV
- Comprehensive testing
List of the contains of user data for PostgreSQL files
If there are any added files, incorrect note or user data, please correct it.
Num | Database cluster | Contains of user data | Single Sequential Write | Single Process Write-then-Read | note |
---|---|---|---|---|---|
1 | PG_VERSION | not contain | Only PostgreSQL version information is written | ||
2 | base/NNNNN/NNNNN | contain | Table data | ||
3 | base/NNNNN/NNNNN_vm | not contain | VM file | ||
4 | base/NNNNN/NNNNN_fsm | not contain | FSM file | ||
5 | base/NNNNN/NNNNN_init | not contain | |||
6 | base/NNNNN/PG_VERSION | not contain | Only PostgreSQL version information is written | ||
7 | base/NNNNN/pg_filenode.map | not contain | |||
8 | base/NNNNN/pg_internal.init | not contain | |||
9 | base/pgsql_tmp/pgsql_tmpPID.tempFileCounter | contain | Temporary file that creates user data temporarily when work_mem size is insufficient | ||
10 | current_logfiles | not contain | |||
11 | global/NNNN | contain | Database name and user name | ||
12 | global/NNNN_vm | not contain | vm file | ||
13 | global/NNNN_fsm | not contain | fsm file | ||
14 | global/pg_control | not contain | |||
15 | global/pg_filenode.map | not contain | |||
16 | global/pg_internal.init | not contain | |||
17 | pg_commit_ts/0000 | not contain | |||
18 | pg_dynshmem/mmap.NNNNNNN | not contain | |||
19 | pg_logical/mappings/ | not contain | |||
20 | pg_logical/replorigin_checkpoint | not contain | |||
21 | pg_logical/snapshots/0-XXXXXXXX.snap | not contain | |||
22 | pg_multixact/members/0000 | not contain | |||
23 | pg_multixact/offsets/0000 | not contain | |||
24 | pg_notify/0000 | not contain | |||
25 | pg_replslot/Slotname/state | not contain | |||
26 | pg_replslot/Slotname/xid-NNN-lsn-0-NNNNNNNN.snap | contain | Includes user data decoded from WAL files | ||
27 | pg_serial/ | not contain | |||
28 | pg_snapshots/NNNNNNNN-N | not contain | |||
29 | pg_stat/db_NNNNN.stat | contain | Statistics collector includes user data | ||
30 | pg_stat/global.stat | contain | Statistics collector includes user data | ||
31 | pg_stat_tmp/db_NNNNN.stat | contain | Statistics collector includes user data | ||
32 | pg_stat_tmp/global.stat | contain | Statistics collector includes user data | ||
33 | pg_subtrans/0000 | not contain | |||
34 | pg_tblspc/PG_NN_NNNNNNNN/NNNNN | contain | Symlink files | ||
35 | pg_twophase/NNNNNNNN | not contain | Exclude user data | ||
36 | pg_wal/NNNNNNNNNNNNNNNNNN | contain | WAL data | ||
37 | pg_wal/*.backup | not contain | Exclude user data | ||
38 | pg_wal/*.history | not contain | Exclude user data | ||
39 | pg_wal/*.partial | contain | WAL data | ||
40 | pg_wal/archive_status/NNNNNNNNNNNNNNNNNN.done | not contain | Exclude user data | ||
41 | pg_wal/archive_status/NNNNNNNNNNNNNNNNNN.ready | not contain | Exclude user data | ||
42 | pg_xact/0000 | not contain | |||
43 | postgresql.auto.conf | not contain | setting file | ||
44 | postmaster.opts | not contain | |||
45 | postmaster.pid | not contain | |||
46 | postgresql.conf | not contain | Setting file | ||
47 | pg_hba.conf | not contain | |||
48 | |||||
49 |
TDE in other systems
MySQL (InnoDB)
MySQL supports per tablespace, data at rest encryption [10]. Please note that in MySQL the tablespace refers to a data file that can hold data for one or more InnoDB tables and associated indexes, while tablespace refers to a directory in PostgreSQL. innodb_file_per_table option allows tables to be created in their own tablespace. As of MySQL 8.0.16 it supports redo log and undo log encryption [11] and system tables encryption [12]. It supports 2 tier key architecture; it has tablespace keys for each tablespace which are located on the header of tablespace file. The master key can be obtained from external systems via a key ring plugin [13]
MySQL encrypts each page of both redo log and undo log with dedicated keys, not with the keys used for table encryption. The encryption key is stored in the header of the first redo/undo log file in encrypted state.
Oracle DB
Oracle DB supports column-level and tablespace-level TDE, both approaches use a two-tiered key-based architecture [14]. The Master Encryption Key (MEK) is stored in an external key store with both hardware and software key stores supported [15]. The MEK is used to secure the column- and tablespace-level keys. Column-level TDE uses one key per table, tablespace-level TDE uses one key per tablespace. Oracle TDE supports Triple-DES (3DES168) and AES (128, 192, 256 bit). Column-level TDE defaults to AES-192, tablespace-level TDE defaults to AES-128. Both methods add a salt to the plaintext before encryption by default [16]. Column-level TDE supports a NOMAC parameter to improve performance.
MS SQL Server
MS SQL Server supports database-level TDE with a three-tiered architecture using both symmetric and asymmetric key encryption [17]. The Service Master Key (SMK) is generated automatically during installation (e.g. `initdb` in PostgreSQL). The Database Master Key (DMK) is created in the `master` database (e.g. postgres default database) and is encrypted by the SMK. The DMK is then used to generate the certificates actually used to secure the Database Encryption Key (DEK). The DEK is the per-database symmetricly used to encrypt data and log files.
Filesystem-level encryption (fscrypt)
https://www.kernel.org/doc/html/latest/filesystems/fscrypt.html
Links
- Disk encryption
- [https://crypto.stackexchange.com/questions/44113/what-is-a-safe-maximum-meage-size-limit-when-encrypting-files-to-disk-with-aes What is a safe maximum message size limit when encrypting files to disk
with AES-GCM before the need to re-generate the key or NONCE]