2PC improvement state files in shared memory

From PostgreSQL wiki
Jump to navigationJump to search



2PC improvement: state files in shared memory

This feature was proposed for PostgreSQL 9.0 during its Commitfest at the beginning of August 2009. The patch was not applied because the performance improvements were not considered large enough to risk making a behavior change to the code.

General aspects

Description

  • In Postgresql 8.4, all the state files of 2PC transactions are written to disk and then erased when a commit begins. It is not necessary to keep on disk all of those state files as they are not necessary for recovery purposes usually.
  • They have their interest at checkpoint, which is used to recover a database from 2PC state files and X-logs.
  • This feature has been introduced such as to put to shared memory state files at prepare state and deleted from it at the commit phase.
  • Usually there is not a long time between prepare and commit phases, so a state file is kept punctually on shared memory.

This also accelerates 2PC process by decreasing the amount of data flushed to disk.

  • The control of shared memory is done by a parameter called state_file_max_space.

At the initialization step, a block of shared memory of size max_prepared_transactions*state_file_max_space is allocated and then subdivided into several blocks of equal size state_file_max_space.

Each block is then used for one state file.

At checkpoint, only the state files of transactions prepared but not commited are flushed to disk as only them are necessary to recover from a checkpoint in combination with the X-logs.

Files modified

  • src/backend/access/transam/twophase.c, main part of the implementation:
    • New parameter state_file_max_space set at default value 0
    • Creation of space in shared memory for the state files
    • State file management, after gathering the records in EndPrepare
      • State files whose data is smaller than the limit decided by state_file_max_space have its data sent to shared memory
      • In case of larger data is it sent directly to disk
    • At commit state, the state file in shared memory is simply deleted from shared memory, the block is cleaned up and kept for a next one.
    • Management of the checkpoint
      • Check in the prepared but not committed transactions which one has its state file on shared memory. If it is the case, it is flushed to disk.
      • the control is pretty severe and locked so as to be sure not to send to disk empty files. For instance shared memory is reinitialized when state files are flushed.
  • src/backend/utils/misc/guc.c, managing the new parameter state_file_max_space
  • src/include/access/twophase.h, not a big issue ut just the addition of the new guc parameter and the function FlushStateFile
  • src/backend/storage/ipc/ipci.c, makes the additional initalization of shared memory for state files
  • src/backend/utils/misc/postgresql.conf.sample, addition of the parameter state_file_max_space so as to let the user have a control on the shared memory used for the improved 2PC.

About the evaluation

Simulation Method

All the tests have been made with a battery-backed up Disk array equiped with 8 disk in RAID 0 configuration so as to see the effect of improved 2PC. The tests were made with a pgbench script and two transactions whose state file sizes are 600B and 712B. Then, by varying the number of transactions and connections, it was possible to get a large variation of results. Simulations have been repeated 5 times so as to be sure to find stable results. Also simulations have been made in two extreme cases with the scale factor value, set at 1 or 100.

Transactions used

  • State file of 600B
\set nbranches :scale
\set ntellers 10 * :scale
\set naccounts 100000 * :scale
\setrandom aid 1 :naccounts
\setrandom bid 1 :nbranches
\setrandom tid 1 :ntellers
\setrandom delta -5000 5000
\setrandom txidrnd 0 100000
BEGIN;
UPDATE accounts SET abalance = abalance + :delta WHERE aid = :aid;
SELECT abalance FROM accounts WHERE aid = :aid;
UPDATE tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
UPDATE branches SET bbalance = bbalance + :delta WHERE bid = :bid
PREPARE TRANSACTION 'T:txidrnd';
COMMIT PREPARED 'T:txidrnd';
  • State file of 712B
\set nbranches :scale
\set ntellers 10 * :scale
\set naccounts 100000 * :scale
\setrandom aid 1 :naccounts
\setrandom bid 1 :nbranches
\setrandom tid 1 :ntellers
\setrandom delta -5000 5000
\setrandom txidrnd 0 100000
BEGIN;
UPDATE accounts SET abalance = abalance + :delta WHERE aid = :aid;SELECT abalance FROM accounts WHERE aid = :aid;
SELECT bbalance FROM branches WHERE bid = :bid;
SELECT tbalance FROM tellers WHERE tid = :tid;
UPDATE tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
UPDATE branches SET bbalance = bbalance + :delta WHERE bid = :bid;
PREPARE TRANSACTION 'T:txidrnd';
COMMIT PREPARED 'T:txidrnd';


Main Results

There are two types of results: pure and normalized. The pure results' unit is TX/s and the normalized results have no units but it permits to see efficiently the diffrence between 2PC use case with state file shmem, 2PC use case with state file on disk and the case without 2PC. For one couple of values (connection, transaction), in order to evaluate the effect of improved 2PC, the simple formula below is used: <math>\frac{x_{Case}-x_{2PC Disk}}{x_{No 2PC}-x_{2PC Disk}}=y_{Normalized Rate}</math>

With this formula, the Tps rate of "without 2PC" case is equal to 1 and the Tps rate of "2PC use case with state file on shared memory" is equal to 0.

2 tables are presented here, for two values of scale factor, and just normalized results are shown. Scale factor at 100, normalized results

Connection Transaction 600B, 2PC Shmem 600B, 2PC Disk 600B, No 2PC 712B, 2PC Shmem 712B, 2PC Disk 712B, No 2PC
2 10000 0.078663793 0 1 0.079652997 0 1
5 10000 0.105263158 0 1 0.08438061 0 1
10 10000 0.096105528 0 1 0.071661238 0 1
25 10000 0.106321839 0 1 0.128461538 0 1
35 10000 0.138996139 0 1 0.12106136 0 1
50 10000 0.130278527 0 1 0.140726934 0 1
60 10000 0.133937563 0 1 0.151709402 0 1
70 10000 0.17218543 0 1 0.149132948 0 1
80 10000 0.1775 0 1 0.177865613 0 1
90 10000 0.179806362 0 1 0.152327221 0 1
100 10000 0.182242991 0 1 0.152647975 0 1


  • Scale factor at 100, normalized results
Connection Transaction 600B, 2PC Shmem 600B, 2PC Disk 600B, No 2PC 712B, 2PC Shmem 712B, 2PC Disk 712B, No 2PC
2 10000 0.031791908 0 1 0.004266212 0 1
5 10000 0.018481848 0 1 0.038587312 0 1
10 10000 0.049115914 0 1 0.076610169 0 1
25 10000 0.06954612 0 1 0.061172472 0 1
35 10000 0.077677841 0 1 0.058464223 0 1
50 10000 0.059885932 0 1 0.089613035 0 1
60 10000 0.071888412 0 1 0.069977427 0 1
70 10000 0.094007051 0 1 0.035714286 0 1
80 10000 0.078838174 0 1 0.056358382 0 1

As a short analysis, it is acquired that this feature increases in a highly-concurrential environment the transaction flow by up to 15%-18%. This performance is reduced in a limited-concurrential environment by up to 10%.