Transparent Data Encryption
This page describes transparent data encryption feature proposed in pgsql-hackers.
- 1 Overview
- 2 Threat models
- 3 Granularity of encryption objects
- 4 What to encrypt/decrypt
- 5 When to encrypt/decrypt
- 6 How to encrypt
- 7 Checksum and Encryption
- 8 Keys
- 9 Other requirements
- 10 TODO for Full-Cluster Encryption
- 11 List of the contains of user data for PostgreSQL files
- 12 TDE in other systems
- 13 Links
This summaries transparent data at rest encryption. The first patch is proposed in 2016 which implements the cluster-wide encryption with single key. In 2018 table-level transparent data encryption was proposed  (together with the method of integration with key management systems), that first patch is submitted in 2019, although it implements tablespace-level encryption as per the discussion. The patch implements both tablespace-level encryption using 2-tier key architecture and generic key management API to communicate with external key management systems.
As of July 2019, there are two proposals implementing different encryption feature: cluster-level encryption and tablespace-level encryption. But these provide the following functionality in common.
- Data at rest encryption
- Encrypt/Decrypt data when reading/writing shared buffer from/to the file system
- Temporary file encryption
- tupleslots, reorder buffer files
Here is latest development status of TDE (copying from Bruce's latest blog on TDE)
For the past 16 months, there has been discussion about whether and how to implement Transparent Data Encryption (tde) in Postgres. Many other relational databases support tde, and some security standards require it. However, it is also debatable how much security value tde provides.
The tde 400-email thread became difficult for people to follow, partly because full understanding required knowledge of Postgres internals and security details. A group of people who wanted to move forward began attending a Zoom call, hosted by Ahsan Hadi. The voice format allowed for more rapid exchange of ideas, and the ability to quickly fill knowledge gaps. It was eventually decided that all-cluster encryption was the easiest to implement in the first version. Later releases will build on this.
Fundamentally, tde must meet three criteria — it must be secure, obviously, but it also must be done in a way that has minimal impact on the rest of the Podaylight saving time usastgres code. This has value for two reasons — first, only a small number of users will use tde, so the less code that is added, the less testing is required. Second, the less code that is added, the less likely tde will break because of future Postgres changes. Finally, tde should meet regulatory requirements.
Here are some points discussed in the voice call on 8/10 :
Bruce shared an update on the recent community discussion. The main point being discussed in community is which files we need to encrypt, the design shared with cluster wide encryption is only encrypting the user data and not encrypting files that don't encrypt user data. Couple of to do's have been added to the list below to figure out which files contain user data and getting a buy-in from the community on which files to encrypt. We already know that we need to encypt pg_temp files and statistical files have some user data.
Bruce mentioned that we need to create targetted threads for each file type that we need to encrypt. This will help in having a targetted discussion for each file type encryption.
One of the main critism about this feature was a very long thread of over 400 emails, it is very difficult to understand the feature direction and key decisions made about the feature. The purpose of having the voice calls for the last several weeks to narrow down the scope of the feature for the first release. Bruce has created a fresh email thread describing the current situation and provding a link to the wiki page. The feature development will be carried forward using the new email thread and other information will be shared in the wiki page.
The goal is to get key management system (KMS) into PG 13, and TDE in PG 14. However the goal of getting KMS committed in PG-13 wasn't accomplioshed because the patch had some user interface changes and not many people from the community weighed in on the patches. The patches are very closed to completion and we hope that they will get committed in the early commit fest of PG-14. Bruce and Joe will be working on the documentation to make it easier for users to understand the interface changes.
In parallel we need to start working on the TDE for PG-14 and try to progress the feature in the early part of PG-14 development. Majarity of development for clsuter wide TDE is already done but we need to revise the patches in light the latest changes in PG-13 and also work on the remaining tasks. The todo list in the wiki page is updated to reflect the latest status.
TDE protects data from theft when file system access controls are compromised:
- Malicious user steals storage devices and reads database files directly.
- Malicious backup operator takes backup.
- Protecting data at rest (persistent data)
This does not protect from users who can read system memory, e.g., shared buffers, which root users can do.
Granularity of encryption objects
This topic is under discussion which granularity of encryption object we choose as the first version of TDE
There are choices of granularity of encryption objects.
- Database cluster
- Proposed on pgsql-hackers
- Close to the SQL Server choice.
- Close to the Oracle database and MySQL choices.
- Oracle database supports
- Better to do during execution something like using pgcrypto and triggers and views.
- Define new groups
- Create a group of tables being encrypted.
Scope for the first release of TDE
The scope for the first release of TDE/KMS is the following, it is targetted for PG-14. The KMS part is pretty much done and will try to commit that in the early commitfest of PG-14.
- Cluster wide encryption
- Internal key management system (KMS), storing keys in the database
- Encrypt everything that is persistent, not encrypting shared buffers or data in memor
Cluster wide encryption
The benefit of cluster wide encryption is following:
- Simple architecture
- Suitable for the requirement of encryption everything
The cluster wide encryption meets the compliance requirement and it is also a check in the box as far as TDE is concrend. It also meets the criteria of encrypting the data at rest i.e. persistent data.
The benefit of more-granular-than-cluster encryption is following:
- Suppress performance overhead
- Reduce the amount of data encrypted with a single key
- Makes cryptoanalysis more difficult
- Puts less data at risk if someone gets "lucky" in doing brute force
- Reencryption without rebuilding whole database cluster
- Useful for multi-user systems where each user's data-at-rest needs to be cryptographically isolated from the others.
However there also are concerns:
- Performance overhead by looking up keys for encrypted table.
What to encrypt/decrypt
It depends on the granularity of encryption objects. In cluster encryption, it encrypts whole database cluster including all tables, indexes as well as SLRU data.
Cluster wide encryption is the target for first release, this level of granuality will encrypt all the persistent data as-well as WAL, temporary files etc. It doesn't encrypt data in memory i.e shared buffers.
On the other hand, in more-granular-than-cluster encryption we can encrypt the part of database cluster. It encrypts the following database objects.
- Tables and indexes
- System catalogs
- Temporary files
it doesn't encrypt the followings but these might change in the future.
- SLRU data (clog, commit timestamp, 2pc file etc)
- Server logs
When to encrypt/decrypt
Regardless of the granularity of encryption objects, it seems to get the consensus about when to encrypt buffer data on the shared buffer. It encrypt buffer data when disk I/O; processes encrypt data when writing it to the disk and decrypt when reading from the disk. That is, the data on the shared buffer is not encrypted state.
In cluster encryption, processes insert WAL data to WAL buffers in non-encrypted state and encrypts WAL buffers when writing to the file system.
On the other hand, in more-granular-than-cluster encryption there were discussions whether we need to use the same encryption key for WAL encryption as used for the table. The conclusion (at least for first version) is to have the encryption key dedicated for WAL and encrypt each WAL blocks with it. The main reason is threat model; if we see TDE as a data-at-rest solution, then I think it's fine to have a separate keyring with such keys, and only allow access to system processes.
To separate privilege of accessing to WAL key we may need to change how WAL writing works, so that individual backends don't really write into the WAL buffers directly, and instead hand-over the data to a separate process (with access to the key). We already have the walwriter, but it may not be running and we consider that to be OK. Or maybe we could have "encrypter" process that does just that. That's surely non-trivial work, but it seems like much less work compared to reworking the WAL format to allow WAL to be encrypted with different keys.
In cluster wide encryption, there will be an option in pg_basebackup to change the key if needed. They keys aren’t changed by default but we provide an option to change the key when setting up a new stand-by server
How to encrypt
This topic is under discussion that which block cipher modes of operation we use for buffer encryption and WAL encryption, respectively
We will use Advanced Encryption Standard (AES), which is a block cipher algorithm. Key size can be 128 or 256 bits.
This topic is under discussion which key length we support
We will provide two types of key lengths: 128 and 256, selected at initdb time, e.g. --encrypt-aes128.
An IV is a specific type of nonce. Nonce means "number used once". That is, unique but not necessarily kept secret. The NIST requirements intended to ensure the random but unique property with high probability. To generate random IVs, the NIST recommends two methods.
The first method is to apply the forward cipher function, under the same key that is used for the encryption of the plaintext, to a nonce. The nonce must be a data block that is unique to each execution of the encryption operation. For example, the nonce may be a counter, as described in Appendix B, or a message number. The second method is to generate a random data block using a FIPSapproved random number generator.
A question that came up in the last meeting is how to set IV value of temporary files since we spend hugh time in determining initialization vector. How to create IV for temporary files that we write to the file-system? Bruce mentioned that we can get get input from the community on temporary file encryption and how to set IV value for temporary files encryption.
Therefore, we concluded to use the first method to generate IVs. That is, select nonce carefully and through the cipher with the key makes it random enough and use it as an IV. The elements of nonce for both buffer encryption and WAL encryption are described below.
IV for heap/index encryption
We will use (page LSN, page number) to create an IV for each page. Page LSN is 8byte and page number is 4byte. Since the length of IV is 16 byte it's fine.
Using the LSN and page number as part of the nonce has three benefits:
- We don't need to decrypt/re-encrypt during CREATE DATABASE since the page contents are the same in both places, and once one database changes its pages, it gets a new LSN, and hence a new nonce/IV.
- For each change of an 8k page, you get a new nonce/IV, so you are not encrypting different data with the same nonce/IV.
- This avoids requiring pg_upgrade to preserve database oids, tablespace oids, relfilenodes.
- We get a unique nonce even when different two pages in the same relation have the same LSN (we don't use the same LSN in different relations)
- This can happen for example when heap update expire the old tuple and add the new tuple to the other page.
However, the LSN must then be visible on the encrypted pages. So we will not encrypt the LSN on the page. We will also not encrypt the CRC so pg_checksums can still check pages offline without access to the keys.
IV for WAL encryption
We will use the WAL segment number to create an IV for each WAL segments.
IV for key wrapping (key encapsulation)
Key wrapping is not like WAL or heap/index encryption where there is naturally a unique page number or segment ID associated with each data block that can be derived into a different unique IV(s) for encryption using AES-CTR block cipher mode. For key wrapping, AES-CBC is used because this mode only requires one IV to be supplied to encrypt the first block. The encrypted first block is then derived to a new IV for encrypting the next. The IV to be supplied to the key wrapping process will be generated using pg_strong_random() and it will be used to wrap keys during initdb. The generated IV will be appended to the wrapped key before storing to disk so during key un-wrapping, the process is able to find the same IV that was used to wrap the key.
Key rotation is an operation of changing the MK (Master Key in 2-tier key architecture) or KEK (Key Encryption Key in 3-tier key architecture) whose primarily usage is to wrap data encryption keys before storing to disk or unwrap them before encryption or decryption. With this approach, there will not be a need to decrypt all the heap/index or WALs with old key and re-encrypt all data with the new key because the actual data encryption keys stay the same. The MK and KEK are derived from cluster_passphrase_command supplied by the user and therefore key rotation ultimately refers to changing the cluster passphrase, which derives into new MK and KEK.
Checksum and Encryption
Encrypt and then CRC, and store the CRC decrypted
Since we're going to use AES, which is symmetric block cipher this section is all about symmetric keys.
2-tier key architecture
2-tier key architecture is a popular way to make key rotation faster that other databases(MySQL, Oracle database) also employees. We use two types of encryption keys: master key (MK) and data encryption key (DEK). MK is used to encrypt DEK and must be stored at a separate location (e.g. key management server) from database. DEK is used to encrypt actual user data and can be stored in the same location of database after encrypted using MK. If we use 1-tier key architecture, we must decrypt data using the current key and re-encrypt them using the new key, which could take a long time with large data. On the other hand in 2-tier key architecture, what we need to encrypt is only DEKs, decrypt DEKs using the current MK and re-encrypt them using new MK. Since we don't touch user data at all and DEKs are very small (32 bytes for AES-256) the key rotation can complete within a very short time.
The proposed tablespace-level encryption have one MK for database cluster and multiple DEKs for each tablespaces.
Prepared transaction encryption
During the discussion, the point about prepared transaction encryption also came up since they are also persisted. Sawada-san mentioned that we aren’t storing any important data for prepared transactions so we might not need to encrypt it. However we need to have it as part of the todo list.
3-tier key architecture
Proposed by Joe Conway,
1. A master key encryption key (KEK): this is the key supplied by the database admin using something akin to ssl_passphrase_command 2. A master data encryption key (MDEK): this is a generated key using a cryptographically secure pseudo-random number generator. It is encrypted using the KEK, probably with Key Wrap (KW): or maybe better Key Wrap with Padding (KWP): 3a. Per table data encryption keys (TDEK): use MDEK and HKDF to generate table specific keys. 3b. WAL data encryption keys (WDEK): Similarly use MDEK and a HKDF to generate new keys when needed for WAL (based on the other info we need to change WAL keys every 68 GB unless I read that wrong).
Temporary key is a randomly generated keys that lives only process-lifetime. This is used for temporary file encryption.
For parallel queries especially for parallel hash joins, since it's possible that multiple parallel workers use the same temporary file the the temporary key should be shared with parallel workers.
Passing (master) key to PostgreSQL
This topic is under discussion how we support the master key rotation
Encryption keys must be stored outside database. If we employee either 2-tier or 3-tier key architecture the master key is stored outside database. To pass it to PostgreSQL server we got consensus to have a GUC parameter that user can specify arbitrary command like ssl_passphrase_command, at least for the first version.
On the other hand, the tablespace-level encryption patch add a generic key management APIs to communicate various external key management services.
Passing key to front-end programs
This topic is under discussion how front-end programs get keys
Front-end programs that read database files or WAL segment files directly need to interact with KMS to obtain a encryption or decryption context in order to encrypt or decrypt the data. First, they need to supply the correct cluster passphrase and have it verified by KMS in order to obtain the data encryption key (DEK). Then, the DEK will need to be used to create a encryption or decryption context using the KMS utility functions. Finally, the front-end programs will be able to perform encryption and decryption using the corresponding context. Without this interaction with KMS, the front-end programs will look at the pages and WALs and think they are corrupted.
wal_log_hints will be enabled automatically in encryption mode.
TDE requires openssl library.
TODO for Full-Cluster Encryption
Here is list of ongoing tasks with there assignment and status for cluster wide encryption and internal key management system.
- Key management system (Internal key management system) - Assigned to Sawada-san Status : Implementation is almost done. Updated patch is sent to hackers.
- Implement GUC cluster_passphrase_command, something like ssl_passphrase_command, to retrieve the pass phrase on boot
- Add read-only GUC to report the cluster's encryption status
- add three encryption keys into $PGDATA/pg_cryptokeys: TDE-Block, TDE-Wal and SQL keys.
- have wrap and unwrap SQL function which use SQL-key
- Wrapping key consists of HMAC of the encrypted data, IV and encrypted data
- verify the passphrase against its hash on boot
- use the passphrase to decrypt the relation (heap/index) and WAL data encryption keys stored in PGDATA and load them into memory
- KMS documentation and commit to PG master (Bruce/Joe - Carry has already shared a documentation patch for KMS in the internal group)
- Buffer encryption - Assigned to Sawada Status : Needs to update based on the latest KMS patch.
- Which Files do we need to encrypt?
- Need to go through the data directory and get a list of all the files that contain user data - Assigned to Moon-san (Moon-san will check if NTT has done some work in this area and share the result)
- Need to get buy-in from the community for not encrypting non-user data files i.e. visibility map, transactional data etc (This needs to be a team effort)
- use TDE-Block key?
- or derive keys per database objects from the TDE-block key?
- use CTR mode
- Nonce consists of page lsn (8 bytes) + block number (4 bytes) + the space for CTR counter (4 bytes).
- This nonce can ensure the uniqueness of nonce for every writes, and we can get this information from the part of page header.
- don't encrypt LSN(pd_lsn) and CRC(pd_checksum) of the encrypted page contents
- Encrypt then checksum
- don't encrypt the first 12 bytes of a page so pd_flags is visible in encrypted and non-encrypted mode? (Might be useful for online checksum and encryption.)
- make pd_pagesize_version visible on the encrypted page?
- shared buffers nonce is LSN/page-number (nonce is run through encryption to create the IV?)
- require wal_log_hints and full_page_writes to prevent force bit changes to be WAL logged (generates new page LSN)
- Which Files do we need to encrypt?
- WAL encryption - Assigned to Moon/Swada Status : Needs to update based on the latest KMS patch.
- use TDE-Wal key
- use CTR mode
- for WAL, don't use OpenSSL's EVP interface so the offset can be specified?
- WAL nonce is segment number (no timeline)
- Encrypt whole each WAL records
- Need to make sure that we don't encrypt the different data with the same key and nonce, and write it to the disk, especially when encrypting data which is not multiple of 16 bytes.
- add Asssert() code to check that there are no WAL record types that modify more than one relation (already written)
- Temporary file encryption - Assigned to Moon Status : Work in progress, will be posted to hackers soon.
- Encryption key: A hash value is randomly generated for each temporary file, and a temporary key is generated by a combination of the hash value and the master key. (will use HMAC256)
- IV value: The encryption key will be used separately for each file. So that the IV is simply generated(64bit = hash value, 32bit = counter) as it should not exceed 1GB per file(pgsql_tmp).
- Need to check if some new temporary files that could have user data are introduced by recent changes for PG13.
- Front end tools encryption - Assigned to Cary Status : Pending due to the shift of focus on KMS for PG13. Some front-end patches have been shared with community that illustrates the interactions with KMS. Development can resume once the focus is shifted back to TDE.
- Add initdb options --aes128, --aes256, and --cluster-passphrase-command
- update pg_control encryption indicator
- store SHA256 hash of the pass phrase (KEK) in PGDATA (add salt?)
- store relation (heap/index and TOAST) and WAL data encryption keys (DEK) encrypted with pass phrase in PGDATA
- Add initdb flag to reuse the keys from the old cluster in the new cluster
- check old/new cluster encryption indicators match
- check encryption pass phrase and keys match
- Add command-line tool to allow modification of the pass phrase and individual encryption keys
- Allow pg_rewind and pg_waldump to work, add --cluster-passphrase-command option
- does pg_rewind need to work across WAL key changes?
- offlinle tool to allow changing the encryption key of current WAL and PITR WAL files, must be crash-safe
- changing the pass phrase will require --old-passphrase-command and --new-passphrase-command options
- modify pg_basebackup to store relations with a different relation key so standby servers can be used for relation key rotation
- Add initdb options --aes128, --aes256, and --cluster-passphrase-command
Here are some tasks or areas that we need to research, some of these are being worked as part of main to do's listed above. This is a exhaustive list to ensure that we don't skip any todo required for the first phase of TDE.
- TDE for replication
- wal sender (especially xlogreader) needs to take TDE-wal key to decrypt wal data.
- Physical replication
- WAL records are sent in a form of plaintext?
- Both the primary and the standby servers have to use the same internal keys? Maybe we can have the ability to change internal keys during pg_basebackup?
- Logical replication (and decoding)
- We decrypt and decode WAL data and send these changes in a form of plaintext.
- The subscriber will use different encryption keys or even can disable TDE.
- TDE for backup
- During physical backup (pg_basebackup or copying OS files), table/index data are transferred in a form of encrypted text and all three internal keys are replicated
- Maybe we need to have the ability to change some internal keys during basebackup? This can be done by generating new internal keys, re-encrypting database files with the new key during the transfer (we already do checksum verification for every page), generate new controlfile having new internal keys and sending it.
- For logical backup (pg_dump), all data are dumped in a form of plaintext because pg_dump simply fetches data via SQL.
- How does TDE work with backup manifests?
- Regression test cases for TDE
- if a standby is promoted to a primary and the old primary continues writing, one must be rekeyed to avoid using the same IV
- Comprehensive testing
- Requires OpenSSL
List of the contains of user data for PostgreSQL files
If there are any added files, incorrect note or user data, please correct it.
|Num||Database cluster||Contains of user data||Single Sequential Write||Single Process Write-then-Read||note|
|1||PG_VERSION||not contain||Only PostgreSQL version information is written|
|3||base/NNNNN/NNNNN_vm||not contain||VM file|
|4||base/NNNNN/NNNNN_fsm||not contain||FSM file|
|6||base/NNNNN/PG_VERSION||not contain||Only PostgreSQL version information is written|
|9||base/pgsql_tmp/pgsql_tmpPID.tempFileCounter||contain||Temporary file that creates user data temporarily when work_mem size is insufficient|
|11||global/NNNN||contain||Database name and user name|
|12||global/NNNN_vm||not contain||vm file|
|13||global/NNNN_fsm||not contain||fsm file|
|26||pg_replslot/Slotname/xid-NNN-lsn-0-NNNNNNNN.snap||contain||Includes user data decoded from WAL files|
|29||pg_stat/db_NNNNN.stat||contain||Statistics collector includes user data|
|30||pg_stat/global.stat||contain||Statistics collector includes user data|
|31||pg_stat_tmp/db_NNNNN.stat||contain||Statistics collector includes user data|
|32||pg_stat_tmp/global.stat||contain||Statistics collector includes user data|
|35||pg_twophase/NNNNNNNN||not contain||Exclude user data|
|37||pg_wal/*.backup||not contain||Exclude user data|
|38||pg_wal/*.history||not contain||Exclude user data|
|40||pg_wal/archive_status/NNNNNNNNNNNNNNNNNN.done||not contain||Exclude user data|
|41||pg_wal/archive_status/NNNNNNNNNNNNNNNNNN.ready||not contain||Exclude user data|
|43||postgresql.auto.conf||not contain||setting file|
|46||postgresql.conf||not contain||Setting file|
TDE in other systems
MySQL supports per tablespace, data at rest encryption. Please note that in MySQL the tablespace refers to that a data file that can hold data for one or more InnoDB tables and associated indexes, while tablespace refers to a directory in PostgreSQL. innodb_file_per_table option allows tables to be created in their own tablespace. As of MySQL 8.0.16 it supports redo log and undo log encryption and system tables encryption. It supports 2 tier key architecture; it has tablespace keys for each tablespace which are located on the header of tablespace file. The master key can be obtained from external systems via a keyring plugin
MySQL encrypts each page of both redo log and undo log with dedicated keys, not with the keys used for table encryption. The encryption key is stored in the header of the first redo/undo log file in encrypted state.
Oracle DB supports column-level and tablespace-level TDE, both approaches use a two-tiered key-based architecture . The Master Encryption Key (MEK) is stored in an external keystore with both hardware and software keystores supported . The MEK is used to secure the column- and tablespace-level keys. Column-level TDE uses one key per table, tablespace-level TDE uses one key per tablespace. Oracle TDE supports Triple-DES (3DES168) and AES (128, 192, 256 bit). Column-level TDE defaults to AES-192, tablespace-level TDE defaults to AES-128. Both methods add a salt to the plaintext before encryption by default . Column-level TDE supports a NOMAC parameter to improve performance.
MS SQL Server
MS SQL Server supports database-level TDE with a three-tiered architecture using both symmetric and asymmetric key encryption . The Service Master Key (SMK) is generated automatically during installation (e.g. `initdb` in PostgreSQL). The Database Master Key (DMK) is created in the `master` database (e.g. postgres default database) and is encrypted by the SMK. The DMK is then used to generate the certificates actually used to secure the Database Encryption Key (DEK). The DEK is the per-database symmetric key used to encrypt data and log files.