https://wiki.postgresql.org/api.php?action=feedcontributions&user=Koichi&feedformat=atomPostgreSQL wiki - User contributions [en]2024-03-29T11:28:37ZUser contributionsMediaWiki 1.35.13https://wiki.postgresql.org/index.php?title=Distributed_deadlock_detection&diff=37965Distributed deadlock detection2023-06-08T02:54:53Z<p>Koichi: /* Producing globak deadlock with FDW */</p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
[[Koichi]]<br />
<br />
== Distributed deadlock detection ==<br />
<br />
=== Purpose of this project ===<br />
<br />
This article desribes how to extend PostgreSQL's locks system to represent wait-for-graph including transactions running at remote database server by many means, for example, FDW, libpq and dblink.<br />
<br />
This works to detect deadloc caused by remote transactions, by wait-for-graph cycle formed with chain of remote transactins.<br />
<br><br />
<br><br />
----<br />
<br><br />
<br><br />
=== Producing globak deadlock with FDW ===<br />
<br><br />
<br><br />
We can produce global deadlock very easily with existing FDW.<br />
<br />
Suppose we have two servers: server1 and server2. Both server has table <code>t1 (c int)</code>.<br />
<br />
From server1, server2's table <code>t</code> is seen as <code>svr2_t</code> and from server2, server1's table <code>t</code> is seen as <code>svr1_t</code>.<br />
<br />
Then we can produce global deadlock as follows:<br />
<br />
server1:<br><br />
<code>BEGIN;</code><br><br />
<code>LOCK TABLE t IN ACCESS EXCLUSIVE MODE;</code><br />
<br />
server2:<br><br />
<code>BEGIN;</code><br><br />
<code>LOCK TABLE t in ACCESS EXCLUSIVE MODE;</code><br />
<br />
server1:<br />
<br />
<code>SELECT * FROM svr2_t;</code><br />
<br />
server2:<br />
<br />
<code>SELECT * FROM svr1_t;</code><br />
<br />
<br><br />
<br><br />
----<br />
<br />
<br><br />
<br><br />
<br />
=== How to represent wait-for-graph including remote transactions ===<br />
<br />
PostgreSQL itself represents object-level lock with various internal functions mainly in <code>lock.c<code>. This is used to chase wait-for-graph when a transaction fails to acquire a lock within the timeframe defined by <code>deadlock_timeout</code> GUC parameter, implemented in <code>deadlock.c</code>.<br />
<br />
We can extended this to represent status of transactions waiting for completion of remote transactions.<br />
<br />
==== External Lock ====<br />
<br />
Here, we define new locktag type, <code>LOCKTAG_EXTERNAL</code> in <code>lock.h</code>, to represent that the transaction acquing this lock is waiting for a remote transaction. Because this is hold only by the transaction (upstream transaction) waiting for a remote transaction (downstream transaction) invoked by this upstream transaction, no other transactions in the database where upstream transaction is running do not care about this lock. This lock is acquired in exclusive mode only and is held only by the upstream transaction (and, in a way, upstream transaction is waiting for this lock to complete, too).<br />
<br />
Locktag information is similar to other lock type, with reference information to additional properties as described below. For this purpose, applications which invokes remote transaction should acquire External Lock by calling <code>ExternalLockAcquire()</code>. This is a wrapper to <code>LockAcquireExtended()</code> and sets up locktag for the External Lock.<br />
<br />
==== External Lock property ====<br />
<br />
Because we are using this lock to track wait-for-graph, we need another property information to connect to the remote database where downstream transaction is running and trace wait-for-graph.<br />
<br />
This is the connection string for the database and backend id of the downstream transaction.<br />
<br />
In <code>lock.c</code> new function <code>ExternalLockSetProperties()</code>. For the help to track the wait for graph precisely and to deal with change in the remote transaction status, this function requires connection string, remote transaction pgprocno, pid and xid.<br />
<br />
----<br />
<br />
=== Deadlock Detection ===<br />
<br />
Deadlock detection mechanism uses existing deadlock detection code in <code>deadlock.c</code> with an extension.<br />
<br />
When deadlock detection (<code>DeadLockCheck()</code>) is called, it begins to trace wait-for-graph.<br />
In the checking, when <code>DeadLockcheck()</code> finds External Lock in waiting lock of <code>PGPROC</code>, it begins to trace the remote transaction represented by this External Lock, to build global wait-for-graph.<br />
This is repeated until it finds the global wait-for-graph terminates or it goes back to the original upstream transaction forming a cycle.<br />
<br />
Dedicated functions are added to perform this check.<br />
<br />
----<br />
<br />
=== LWLocks during external lock trace ===<br />
<br />
In local wait-for-graph tracing, all LWLocks are acquired by deadlock checking functions to simplify the tracing code.<br />
In global wait-for-graph tracing, we acquire all LWLocks during local wait-for-graph trace.<br />
When it goes out to check further wait-for-graph in the remote database, such LWLocks are all released so that other transactions can continue to run during time consuming remote wait-for-graph tracing.<br />
When a cycle is found, then all the databases ivolved in the global wait-for-graph cycle will check that their local portion of wait-for-graph is stable.<br />
If not, it means that at leas one transaction involved in this wait-for-graph is running and this is not a deadlock.<br />
If it is stable, then we determine this is a deadlock.<br />
During the check of stableness of the local wait-for-graph, we again acquire all the LWLocks locally.<br />
<br />
----<br />
<br />
=== What applications to do ===<br />
<br />
Applications (or extensions, whatsoever), should call two functions before they invoke a remote transaction.<br />
<br />
* <code>ExternalLockAcquire()</code><br />
* <code>ExternalLockSetProperties()</code><br />
<br />
Applications do not need to release External Locks to follow two-phase locking protocol. <br />
External Locks will be released at the end of transactions as a part of cleanup process.<br />
Because global lock check will be done in background, applications do not need to care about this at all.<br />
Applications should only acquire the lock and provide information to trace the wait-for-graph.<br />
<br />
----<br />
<br />
=== External Lock Properties ===<br />
<br />
Because current Lock struction is too small to acquire the lock property described above, we need extra space to hold them.<br />
In the current implementation, we hold this in the files at <code>$PGDATA/pg_external_locks</code>.<br />
File name is based on the values in the locktag.<br />
For simple implementation, External Lock properties are written in plain text and this may need more improvement for security.<br />
External Lock properties can be stored in other place, such as dynamic shared memory.<br />
<br />
----<br />
<br />
=== Current Status ===<br />
<br />
The code is now running with PG 14.<br />
You can freely clone the repo from <code>https://github.com/koichi-szk/postgres.git</code>.<br />
Please checkout the branch <code>koichi/global_deadlock_detection_14_0</code>.<br />
<br />
Because we don't have actual workload to test this feature, I have separate git repo containing the test environment and several useful functions for the test.<br />
You can visit the repo <code>https://github.com/koichi-szk/gdd_test.git</code>.<br />
Please checkout the branch <code>PG14_GDD</code>.<br />
Please also note that this repo depends on my local environment configuration and you need to arrange the environment for your own.<br />
<br />
=== Future work ===<br />
<br />
Because there are no actual workload around PG which causes global deadlock, I will continue to port this code to further releases to be ready to be added into PG itself when this feature is really needed.</div>Koichihttps://wiki.postgresql.org/index.php?title=Distributed_deadlock_detection&diff=37964Distributed deadlock detection2023-06-08T02:54:34Z<p>Koichi: /* Producing globak deadlock with FDW */</p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
[[Koichi]]<br />
<br />
== Distributed deadlock detection ==<br />
<br />
=== Purpose of this project ===<br />
<br />
This article desribes how to extend PostgreSQL's locks system to represent wait-for-graph including transactions running at remote database server by many means, for example, FDW, libpq and dblink.<br />
<br />
This works to detect deadloc caused by remote transactions, by wait-for-graph cycle formed with chain of remote transactins.<br />
<br><br />
<br><br />
----<br />
<br><br />
<br><br />
=== Producing globak deadlock with FDW ===<br />
<br><br />
<br><br />
We can produce global deadlock very easily with existing FDW.<br />
<br />
Suppose we have two servers: server1 and server2. Both server has table <code>t1 (c int)</code>.<br />
<br />
From server1, server2's table <code>t</code> is seen as <code>svr2_t</code> and from server2, server1's table <code>t</code> is seen as <code>svr1_t</code>.<br />
<br />
Then we can produce global deadlock as follows:<br />
<br />
server1:<br />
<code>BEGIN;</code><br><br />
<code>LOCK TABLE t IN ACCESS EXCLUSIVE MODE;</code><br />
<br />
server2:<br />
<code>BEGIN;</code><br><br />
<code>LOCK TABLE t in ACCESS EXCLUSIVE MODE;</code><br />
<br />
server1:<br />
<br />
<code>SELECT * FROM svr2_t;</code><br />
<br />
server2:<br />
<br />
<code>SELECT * FROM svr1_t;</code><br />
<br />
<br><br />
<br><br />
----<br />
<br />
<br><br />
<br><br />
<br />
=== How to represent wait-for-graph including remote transactions ===<br />
<br />
PostgreSQL itself represents object-level lock with various internal functions mainly in <code>lock.c<code>. This is used to chase wait-for-graph when a transaction fails to acquire a lock within the timeframe defined by <code>deadlock_timeout</code> GUC parameter, implemented in <code>deadlock.c</code>.<br />
<br />
We can extended this to represent status of transactions waiting for completion of remote transactions.<br />
<br />
==== External Lock ====<br />
<br />
Here, we define new locktag type, <code>LOCKTAG_EXTERNAL</code> in <code>lock.h</code>, to represent that the transaction acquing this lock is waiting for a remote transaction. Because this is hold only by the transaction (upstream transaction) waiting for a remote transaction (downstream transaction) invoked by this upstream transaction, no other transactions in the database where upstream transaction is running do not care about this lock. This lock is acquired in exclusive mode only and is held only by the upstream transaction (and, in a way, upstream transaction is waiting for this lock to complete, too).<br />
<br />
Locktag information is similar to other lock type, with reference information to additional properties as described below. For this purpose, applications which invokes remote transaction should acquire External Lock by calling <code>ExternalLockAcquire()</code>. This is a wrapper to <code>LockAcquireExtended()</code> and sets up locktag for the External Lock.<br />
<br />
==== External Lock property ====<br />
<br />
Because we are using this lock to track wait-for-graph, we need another property information to connect to the remote database where downstream transaction is running and trace wait-for-graph.<br />
<br />
This is the connection string for the database and backend id of the downstream transaction.<br />
<br />
In <code>lock.c</code> new function <code>ExternalLockSetProperties()</code>. For the help to track the wait for graph precisely and to deal with change in the remote transaction status, this function requires connection string, remote transaction pgprocno, pid and xid.<br />
<br />
----<br />
<br />
=== Deadlock Detection ===<br />
<br />
Deadlock detection mechanism uses existing deadlock detection code in <code>deadlock.c</code> with an extension.<br />
<br />
When deadlock detection (<code>DeadLockCheck()</code>) is called, it begins to trace wait-for-graph.<br />
In the checking, when <code>DeadLockcheck()</code> finds External Lock in waiting lock of <code>PGPROC</code>, it begins to trace the remote transaction represented by this External Lock, to build global wait-for-graph.<br />
This is repeated until it finds the global wait-for-graph terminates or it goes back to the original upstream transaction forming a cycle.<br />
<br />
Dedicated functions are added to perform this check.<br />
<br />
----<br />
<br />
=== LWLocks during external lock trace ===<br />
<br />
In local wait-for-graph tracing, all LWLocks are acquired by deadlock checking functions to simplify the tracing code.<br />
In global wait-for-graph tracing, we acquire all LWLocks during local wait-for-graph trace.<br />
When it goes out to check further wait-for-graph in the remote database, such LWLocks are all released so that other transactions can continue to run during time consuming remote wait-for-graph tracing.<br />
When a cycle is found, then all the databases ivolved in the global wait-for-graph cycle will check that their local portion of wait-for-graph is stable.<br />
If not, it means that at leas one transaction involved in this wait-for-graph is running and this is not a deadlock.<br />
If it is stable, then we determine this is a deadlock.<br />
During the check of stableness of the local wait-for-graph, we again acquire all the LWLocks locally.<br />
<br />
----<br />
<br />
=== What applications to do ===<br />
<br />
Applications (or extensions, whatsoever), should call two functions before they invoke a remote transaction.<br />
<br />
* <code>ExternalLockAcquire()</code><br />
* <code>ExternalLockSetProperties()</code><br />
<br />
Applications do not need to release External Locks to follow two-phase locking protocol. <br />
External Locks will be released at the end of transactions as a part of cleanup process.<br />
Because global lock check will be done in background, applications do not need to care about this at all.<br />
Applications should only acquire the lock and provide information to trace the wait-for-graph.<br />
<br />
----<br />
<br />
=== External Lock Properties ===<br />
<br />
Because current Lock struction is too small to acquire the lock property described above, we need extra space to hold them.<br />
In the current implementation, we hold this in the files at <code>$PGDATA/pg_external_locks</code>.<br />
File name is based on the values in the locktag.<br />
For simple implementation, External Lock properties are written in plain text and this may need more improvement for security.<br />
External Lock properties can be stored in other place, such as dynamic shared memory.<br />
<br />
----<br />
<br />
=== Current Status ===<br />
<br />
The code is now running with PG 14.<br />
You can freely clone the repo from <code>https://github.com/koichi-szk/postgres.git</code>.<br />
Please checkout the branch <code>koichi/global_deadlock_detection_14_0</code>.<br />
<br />
Because we don't have actual workload to test this feature, I have separate git repo containing the test environment and several useful functions for the test.<br />
You can visit the repo <code>https://github.com/koichi-szk/gdd_test.git</code>.<br />
Please checkout the branch <code>PG14_GDD</code>.<br />
Please also note that this repo depends on my local environment configuration and you need to arrange the environment for your own.<br />
<br />
=== Future work ===<br />
<br />
Because there are no actual workload around PG which causes global deadlock, I will continue to port this code to further releases to be ready to be added into PG itself when this feature is really needed.</div>Koichihttps://wiki.postgresql.org/index.php?title=Distributed_deadlock_detection&diff=37963Distributed deadlock detection2023-06-08T02:52:45Z<p>Koichi: /* Distributed deadlock detection */</p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
[[Koichi]]<br />
<br />
== Distributed deadlock detection ==<br />
<br />
=== Purpose of this project ===<br />
<br />
This article desribes how to extend PostgreSQL's locks system to represent wait-for-graph including transactions running at remote database server by many means, for example, FDW, libpq and dblink.<br />
<br />
This works to detect deadloc caused by remote transactions, by wait-for-graph cycle formed with chain of remote transactins.<br />
<br><br />
<br><br />
----<br />
<br><br />
<br><br />
=== Producing globak deadlock with FDW ===<br />
<br><br />
<br><br />
We can produce global deadlock very easily with existing FDW.<br />
<br />
Suppose we have two servers: server1 and server2. Both server has table <code>t1 (c int)</code>.<br />
<br />
From server1, server2's table <code>t</code> is seen as <code>svr2_t</code> and from server2, server1's table <code>t</code> is seen as <code>svr1_t</code>.<br />
<br />
Then we can produce global deadlock as follows:<br />
<br />
server1:<br />
<br />
<code>LOCK TABLE t IN ACCESS EXCLUSIVE MODE;</code><br />
<br />
server2:<br />
<br />
<code>LOCK TABLE t in ACCESS EXCLUSIVE MODE;</code><br />
<br />
server1:<br />
<br />
<code>SELECT * FROM svr2_t;</code><br />
<br />
server2:<br />
<br />
<code>SELECT * FROM svr1_t;</code><br />
<br />
<br><br />
<br><br />
----<br />
<br />
<br><br />
<br><br />
=== How to represent wait-for-graph including remote transactions ===<br />
<br />
PostgreSQL itself represents object-level lock with various internal functions mainly in <code>lock.c<code>. This is used to chase wait-for-graph when a transaction fails to acquire a lock within the timeframe defined by <code>deadlock_timeout</code> GUC parameter, implemented in <code>deadlock.c</code>.<br />
<br />
We can extended this to represent status of transactions waiting for completion of remote transactions.<br />
<br />
==== External Lock ====<br />
<br />
Here, we define new locktag type, <code>LOCKTAG_EXTERNAL</code> in <code>lock.h</code>, to represent that the transaction acquing this lock is waiting for a remote transaction. Because this is hold only by the transaction (upstream transaction) waiting for a remote transaction (downstream transaction) invoked by this upstream transaction, no other transactions in the database where upstream transaction is running do not care about this lock. This lock is acquired in exclusive mode only and is held only by the upstream transaction (and, in a way, upstream transaction is waiting for this lock to complete, too).<br />
<br />
Locktag information is similar to other lock type, with reference information to additional properties as described below. For this purpose, applications which invokes remote transaction should acquire External Lock by calling <code>ExternalLockAcquire()</code>. This is a wrapper to <code>LockAcquireExtended()</code> and sets up locktag for the External Lock.<br />
<br />
==== External Lock property ====<br />
<br />
Because we are using this lock to track wait-for-graph, we need another property information to connect to the remote database where downstream transaction is running and trace wait-for-graph.<br />
<br />
This is the connection string for the database and backend id of the downstream transaction.<br />
<br />
In <code>lock.c</code> new function <code>ExternalLockSetProperties()</code>. For the help to track the wait for graph precisely and to deal with change in the remote transaction status, this function requires connection string, remote transaction pgprocno, pid and xid.<br />
<br />
----<br />
<br />
=== Deadlock Detection ===<br />
<br />
Deadlock detection mechanism uses existing deadlock detection code in <code>deadlock.c</code> with an extension.<br />
<br />
When deadlock detection (<code>DeadLockCheck()</code>) is called, it begins to trace wait-for-graph.<br />
In the checking, when <code>DeadLockcheck()</code> finds External Lock in waiting lock of <code>PGPROC</code>, it begins to trace the remote transaction represented by this External Lock, to build global wait-for-graph.<br />
This is repeated until it finds the global wait-for-graph terminates or it goes back to the original upstream transaction forming a cycle.<br />
<br />
Dedicated functions are added to perform this check.<br />
<br />
----<br />
<br />
=== LWLocks during external lock trace ===<br />
<br />
In local wait-for-graph tracing, all LWLocks are acquired by deadlock checking functions to simplify the tracing code.<br />
In global wait-for-graph tracing, we acquire all LWLocks during local wait-for-graph trace.<br />
When it goes out to check further wait-for-graph in the remote database, such LWLocks are all released so that other transactions can continue to run during time consuming remote wait-for-graph tracing.<br />
When a cycle is found, then all the databases ivolved in the global wait-for-graph cycle will check that their local portion of wait-for-graph is stable.<br />
If not, it means that at leas one transaction involved in this wait-for-graph is running and this is not a deadlock.<br />
If it is stable, then we determine this is a deadlock.<br />
During the check of stableness of the local wait-for-graph, we again acquire all the LWLocks locally.<br />
<br />
----<br />
<br />
=== What applications to do ===<br />
<br />
Applications (or extensions, whatsoever), should call two functions before they invoke a remote transaction.<br />
<br />
* <code>ExternalLockAcquire()</code><br />
* <code>ExternalLockSetProperties()</code><br />
<br />
Applications do not need to release External Locks to follow two-phase locking protocol. <br />
External Locks will be released at the end of transactions as a part of cleanup process.<br />
Because global lock check will be done in background, applications do not need to care about this at all.<br />
Applications should only acquire the lock and provide information to trace the wait-for-graph.<br />
<br />
----<br />
<br />
=== External Lock Properties ===<br />
<br />
Because current Lock struction is too small to acquire the lock property described above, we need extra space to hold them.<br />
In the current implementation, we hold this in the files at <code>$PGDATA/pg_external_locks</code>.<br />
File name is based on the values in the locktag.<br />
For simple implementation, External Lock properties are written in plain text and this may need more improvement for security.<br />
External Lock properties can be stored in other place, such as dynamic shared memory.<br />
<br />
----<br />
<br />
=== Current Status ===<br />
<br />
The code is now running with PG 14.<br />
You can freely clone the repo from <code>https://github.com/koichi-szk/postgres.git</code>.<br />
Please checkout the branch <code>koichi/global_deadlock_detection_14_0</code>.<br />
<br />
Because we don't have actual workload to test this feature, I have separate git repo containing the test environment and several useful functions for the test.<br />
You can visit the repo <code>https://github.com/koichi-szk/gdd_test.git</code>.<br />
Please checkout the branch <code>PG14_GDD</code>.<br />
Please also note that this repo depends on my local environment configuration and you need to arrange the environment for your own.<br />
<br />
=== Future work ===<br />
<br />
Because there are no actual workload around PG which causes global deadlock, I will continue to port this code to further releases to be ready to be added into PG itself when this feature is really needed.</div>Koichihttps://wiki.postgresql.org/index.php?title=Distributed_deadlock_detection&diff=37962Distributed deadlock detection2023-06-08T02:45:19Z<p>Koichi: /* Producing globak deadlock with FDW */</p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
[[Koichi]]<br />
<br />
== Distributed deadlock detection ==<br />
<br />
=== Purpose of this project ===<br />
<br />
This article desribes how to extend PostgreSQL's locks system to represent wait-for-graph including transactions running at remote database server by many means, for example, FDW, libpq and dblink.<br />
<br />
This works to detect deadloc caused by remote transactions, by wait-for-graph cycle formed with chain of remote transactins.<br />
<br />
----<br />
<br />
=== Producing globak deadlock with FDW ===<br />
<br />
We can produce global deadlock very easily with existing FDW.<br />
<br />
Suppose we have two servers: server1 and server2. Both server has table <code>t1 (c int)</code>.<br />
<br />
From server1, server2's table <code>t</code> is seen as <code>svr2_t</code> and from server2, server1's table <code>t</code> is seen as <code>svr1_t</code>.<br />
<br />
Then we can produce global deadlock as follows:<br />
<br />
server1:<br />
<br />
<code>LOCK TABLE t IN ACCESS EXCLUSIVE MODE;</code><br />
<br />
server2:<br />
<br />
<code>LOCK TABLE t in ACCESS EXCLUSIVE MODE;</code><br />
<br />
server1:<br />
<br />
<code>SELECT * FROM svr2_t;</code><br />
<br />
server2:<br />
<br />
<code>SELECT * FROM svr1_t;</code><br />
<br />
<br />
----<br />
<br />
=== How to represent wait-for-graph including remote transactions ===<br />
<br />
PostgreSQL itself represents object-level lock with various internal functions mainly in <code>lock.c<code>. This is used to chase wait-for-graph when a transaction fails to acquire a lock within the timeframe defined by <code>deadlock_timeout</code> GUC parameter, implemented in <code>deadlock.c</code>.<br />
<br />
We can extended this to represent status of transactions waiting for completion of remote transactions.<br />
<br />
==== External Lock ====<br />
<br />
Here, we define new locktag type, <code>LOCKTAG_EXTERNAL</code> in <code>lock.h</code>, to represent that the transaction acquing this lock is waiting for a remote transaction. Because this is hold only by the transaction (upstream transaction) waiting for a remote transaction (downstream transaction) invoked by this upstream transaction, no other transactions in the database where upstream transaction is running do not care about this lock. This lock is acquired in exclusive mode only and is held only by the upstream transaction (and, in a way, upstream transaction is waiting for this lock to complete, too).<br />
<br />
Locktag information is similar to other lock type, with reference information to additional properties as described below. For this purpose, applications which invokes remote transaction should acquire External Lock by calling <code>ExternalLockAcquire()</code>. This is a wrapper to <code>LockAcquireExtended()</code> and sets up locktag for the External Lock.<br />
<br />
==== External Lock property ====<br />
<br />
Because we are using this lock to track wait-for-graph, we need another property information to connect to the remote database where downstream transaction is running and trace wait-for-graph.<br />
<br />
This is the connection string for the database and backend id of the downstream transaction.<br />
<br />
In <code>lock.c</code> new function <code>ExternalLockSetProperties()</code>. For the help to track the wait for graph precisely and to deal with change in the remote transaction status, this function requires connection string, remote transaction pgprocno, pid and xid.<br />
<br />
----<br />
<br />
=== Deadlock Detection ===<br />
<br />
Deadlock detection mechanism uses existing deadlock detection code in <code>deadlock.c</code> with an extension.<br />
<br />
When deadlock detection (<code>DeadLockCheck()</code>) is called, it begins to trace wait-for-graph.<br />
In the checking, when <code>DeadLockcheck()</code> finds External Lock in waiting lock of <code>PGPROC</code>, it begins to trace the remote transaction represented by this External Lock, to build global wait-for-graph.<br />
This is repeated until it finds the global wait-for-graph terminates or it goes back to the original upstream transaction forming a cycle.<br />
<br />
Dedicated functions are added to perform this check.<br />
<br />
----<br />
<br />
=== LWLocks during external lock trace ===<br />
<br />
In local wait-for-graph tracing, all LWLocks are acquired by deadlock checking functions to simplify the tracing code.<br />
In global wait-for-graph tracing, we acquire all LWLocks during local wait-for-graph trace.<br />
When it goes out to check further wait-for-graph in the remote database, such LWLocks are all released so that other transactions can continue to run during time consuming remote wait-for-graph tracing.<br />
When a cycle is found, then all the databases ivolved in the global wait-for-graph cycle will check that their local portion of wait-for-graph is stable.<br />
If not, it means that at leas one transaction involved in this wait-for-graph is running and this is not a deadlock.<br />
If it is stable, then we determine this is a deadlock.<br />
During the check of stableness of the local wait-for-graph, we again acquire all the LWLocks locally.<br />
<br />
----<br />
<br />
=== What applications to do ===<br />
<br />
Applications (or extensions, whatsoever), should call two functions before they invoke a remote transaction.<br />
<br />
* <code>ExternalLockAcquire()</code><br />
* <code>ExternalLockSetProperties()</code><br />
<br />
Applications do not need to release External Locks to follow two-phase locking protocol. <br />
External Locks will be released at the end of transactions as a part of cleanup process.<br />
Because global lock check will be done in background, applications do not need to care about this at all.<br />
Applications should only acquire the lock and provide information to trace the wait-for-graph.<br />
<br />
----<br />
<br />
=== External Lock Properties ===<br />
<br />
Because current Lock struction is too small to acquire the lock property described above, we need extra space to hold them.<br />
In the current implementation, we hold this in the files at <code>$PGDATA/pg_external_locks</code>.<br />
File name is based on the values in the locktag.<br />
For simple implementation, External Lock properties are written in plain text and this may need more improvement for security.<br />
External Lock properties can be stored in other place, such as dynamic shared memory.<br />
<br />
----<br />
<br />
=== Current Status ===<br />
<br />
The code is now running with PG 14.<br />
You can freely clone the repo from <code>https://github.com/koichi-szk/postgres.git</code>.<br />
Please checkout the branch <code>koichi/global_deadlock_detection_14_0</code>.<br />
<br />
Because we don't have actual workload to test this feature, I have separate git repo containing the test environment and several useful functions for the test.<br />
You can visit the repo <code>https://github.com/koichi-szk/gdd_test.git</code>.<br />
Please checkout the branch <code>PG14_GDD</code>.<br />
Please also note that this repo depends on my local environment configuration and you need to arrange the environment for your own.<br />
<br />
=== Future work ===<br />
<br />
Because there are no actual workload around PG which causes global deadlock, I will continue to port this code to further releases to be ready to be added into PG itself when this feature is really needed.</div>Koichihttps://wiki.postgresql.org/index.php?title=Distributed_deadlock_detection&diff=37961Distributed deadlock detection2023-06-08T02:44:58Z<p>Koichi: /* Producing globak deadlock with FDW */</p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
[[Koichi]]<br />
<br />
== Distributed deadlock detection ==<br />
<br />
=== Purpose of this project ===<br />
<br />
This article desribes how to extend PostgreSQL's locks system to represent wait-for-graph including transactions running at remote database server by many means, for example, FDW, libpq and dblink.<br />
<br />
This works to detect deadloc caused by remote transactions, by wait-for-graph cycle formed with chain of remote transactins.<br />
<br />
----<br />
<br />
=== Producing globak deadlock with FDW ===<br />
<br />
We can produce global deadlock very easily with existing FDW.<br />
<br />
Suppose we have two servers: server1 and server2. Both server has table <code>t1 (c int)</copde>.<br />
<br />
From server1, server2's table <code>t</code> is seen as <code>svr2_t</code> and from server2, server1's table <code>t</code> is seen as <code>svr1_t</code>.<br />
<br />
Then we can produce global deadlock as follows:<br />
<br />
server1:<br />
<br />
<code>LOCK TABLE t IN ACCESS EXCLUSIVE MODE;</code><br />
<br />
server2:<br />
<br />
<code>LOCK TABLE t in ACCESS EXCLUSIVE MODE;</code><br />
<br />
server1:<br />
<br />
<code>SELECT * FROM svr2_t;</code><br />
<br />
server2:<br />
<br />
<code>SELECT * FROM svr1_t;</code><br />
<br />
<br />
----<br />
<br />
=== How to represent wait-for-graph including remote transactions ===<br />
<br />
PostgreSQL itself represents object-level lock with various internal functions mainly in <code>lock.c<code>. This is used to chase wait-for-graph when a transaction fails to acquire a lock within the timeframe defined by <code>deadlock_timeout</code> GUC parameter, implemented in <code>deadlock.c</code>.<br />
<br />
We can extended this to represent status of transactions waiting for completion of remote transactions.<br />
<br />
==== External Lock ====<br />
<br />
Here, we define new locktag type, <code>LOCKTAG_EXTERNAL</code> in <code>lock.h</code>, to represent that the transaction acquing this lock is waiting for a remote transaction. Because this is hold only by the transaction (upstream transaction) waiting for a remote transaction (downstream transaction) invoked by this upstream transaction, no other transactions in the database where upstream transaction is running do not care about this lock. This lock is acquired in exclusive mode only and is held only by the upstream transaction (and, in a way, upstream transaction is waiting for this lock to complete, too).<br />
<br />
Locktag information is similar to other lock type, with reference information to additional properties as described below. For this purpose, applications which invokes remote transaction should acquire External Lock by calling <code>ExternalLockAcquire()</code>. This is a wrapper to <code>LockAcquireExtended()</code> and sets up locktag for the External Lock.<br />
<br />
==== External Lock property ====<br />
<br />
Because we are using this lock to track wait-for-graph, we need another property information to connect to the remote database where downstream transaction is running and trace wait-for-graph.<br />
<br />
This is the connection string for the database and backend id of the downstream transaction.<br />
<br />
In <code>lock.c</code> new function <code>ExternalLockSetProperties()</code>. For the help to track the wait for graph precisely and to deal with change in the remote transaction status, this function requires connection string, remote transaction pgprocno, pid and xid.<br />
<br />
----<br />
<br />
=== Deadlock Detection ===<br />
<br />
Deadlock detection mechanism uses existing deadlock detection code in <code>deadlock.c</code> with an extension.<br />
<br />
When deadlock detection (<code>DeadLockCheck()</code>) is called, it begins to trace wait-for-graph.<br />
In the checking, when <code>DeadLockcheck()</code> finds External Lock in waiting lock of <code>PGPROC</code>, it begins to trace the remote transaction represented by this External Lock, to build global wait-for-graph.<br />
This is repeated until it finds the global wait-for-graph terminates or it goes back to the original upstream transaction forming a cycle.<br />
<br />
Dedicated functions are added to perform this check.<br />
<br />
----<br />
<br />
=== LWLocks during external lock trace ===<br />
<br />
In local wait-for-graph tracing, all LWLocks are acquired by deadlock checking functions to simplify the tracing code.<br />
In global wait-for-graph tracing, we acquire all LWLocks during local wait-for-graph trace.<br />
When it goes out to check further wait-for-graph in the remote database, such LWLocks are all released so that other transactions can continue to run during time consuming remote wait-for-graph tracing.<br />
When a cycle is found, then all the databases ivolved in the global wait-for-graph cycle will check that their local portion of wait-for-graph is stable.<br />
If not, it means that at leas one transaction involved in this wait-for-graph is running and this is not a deadlock.<br />
If it is stable, then we determine this is a deadlock.<br />
During the check of stableness of the local wait-for-graph, we again acquire all the LWLocks locally.<br />
<br />
----<br />
<br />
=== What applications to do ===<br />
<br />
Applications (or extensions, whatsoever), should call two functions before they invoke a remote transaction.<br />
<br />
* <code>ExternalLockAcquire()</code><br />
* <code>ExternalLockSetProperties()</code><br />
<br />
Applications do not need to release External Locks to follow two-phase locking protocol. <br />
External Locks will be released at the end of transactions as a part of cleanup process.<br />
Because global lock check will be done in background, applications do not need to care about this at all.<br />
Applications should only acquire the lock and provide information to trace the wait-for-graph.<br />
<br />
----<br />
<br />
=== External Lock Properties ===<br />
<br />
Because current Lock struction is too small to acquire the lock property described above, we need extra space to hold them.<br />
In the current implementation, we hold this in the files at <code>$PGDATA/pg_external_locks</code>.<br />
File name is based on the values in the locktag.<br />
For simple implementation, External Lock properties are written in plain text and this may need more improvement for security.<br />
External Lock properties can be stored in other place, such as dynamic shared memory.<br />
<br />
----<br />
<br />
=== Current Status ===<br />
<br />
The code is now running with PG 14.<br />
You can freely clone the repo from <code>https://github.com/koichi-szk/postgres.git</code>.<br />
Please checkout the branch <code>koichi/global_deadlock_detection_14_0</code>.<br />
<br />
Because we don't have actual workload to test this feature, I have separate git repo containing the test environment and several useful functions for the test.<br />
You can visit the repo <code>https://github.com/koichi-szk/gdd_test.git</code>.<br />
Please checkout the branch <code>PG14_GDD</code>.<br />
Please also note that this repo depends on my local environment configuration and you need to arrange the environment for your own.<br />
<br />
=== Future work ===<br />
<br />
Because there are no actual workload around PG which causes global deadlock, I will continue to port this code to further releases to be ready to be added into PG itself when this feature is really needed.</div>Koichihttps://wiki.postgresql.org/index.php?title=Distributed_deadlock_detection&diff=37960Distributed deadlock detection2023-06-08T02:39:55Z<p>Koichi: /* Distributed deadlock detection */</p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
[[Koichi]]<br />
<br />
== Distributed deadlock detection ==<br />
<br />
=== Purpose of this project ===<br />
<br />
This article desribes how to extend PostgreSQL's locks system to represent wait-for-graph including transactions running at remote database server by many means, for example, FDW, libpq and dblink.<br />
<br />
This works to detect deadloc caused by remote transactions, by wait-for-graph cycle formed with chain of remote transactins.<br />
<br />
----<br />
<br />
=== Producing globak deadlock with FDW ===<br />
<br />
<br />
<br />
----<br />
<br />
=== How to represent wait-for-graph including remote transactions ===<br />
<br />
PostgreSQL itself represents object-level lock with various internal functions mainly in <code>lock.c<code>. This is used to chase wait-for-graph when a transaction fails to acquire a lock within the timeframe defined by <code>deadlock_timeout</code> GUC parameter, implemented in <code>deadlock.c</code>.<br />
<br />
We can extended this to represent status of transactions waiting for completion of remote transactions.<br />
<br />
==== External Lock ====<br />
<br />
Here, we define new locktag type, <code>LOCKTAG_EXTERNAL</code> in <code>lock.h</code>, to represent that the transaction acquing this lock is waiting for a remote transaction. Because this is hold only by the transaction (upstream transaction) waiting for a remote transaction (downstream transaction) invoked by this upstream transaction, no other transactions in the database where upstream transaction is running do not care about this lock. This lock is acquired in exclusive mode only and is held only by the upstream transaction (and, in a way, upstream transaction is waiting for this lock to complete, too).<br />
<br />
Locktag information is similar to other lock type, with reference information to additional properties as described below. For this purpose, applications which invokes remote transaction should acquire External Lock by calling <code>ExternalLockAcquire()</code>. This is a wrapper to <code>LockAcquireExtended()</code> and sets up locktag for the External Lock.<br />
<br />
==== External Lock property ====<br />
<br />
Because we are using this lock to track wait-for-graph, we need another property information to connect to the remote database where downstream transaction is running and trace wait-for-graph.<br />
<br />
This is the connection string for the database and backend id of the downstream transaction.<br />
<br />
In <code>lock.c</code> new function <code>ExternalLockSetProperties()</code>. For the help to track the wait for graph precisely and to deal with change in the remote transaction status, this function requires connection string, remote transaction pgprocno, pid and xid.<br />
<br />
----<br />
<br />
=== Deadlock Detection ===<br />
<br />
Deadlock detection mechanism uses existing deadlock detection code in <code>deadlock.c</code> with an extension.<br />
<br />
When deadlock detection (<code>DeadLockCheck()</code>) is called, it begins to trace wait-for-graph.<br />
In the checking, when <code>DeadLockcheck()</code> finds External Lock in waiting lock of <code>PGPROC</code>, it begins to trace the remote transaction represented by this External Lock, to build global wait-for-graph.<br />
This is repeated until it finds the global wait-for-graph terminates or it goes back to the original upstream transaction forming a cycle.<br />
<br />
Dedicated functions are added to perform this check.<br />
<br />
----<br />
<br />
=== LWLocks during external lock trace ===<br />
<br />
In local wait-for-graph tracing, all LWLocks are acquired by deadlock checking functions to simplify the tracing code.<br />
In global wait-for-graph tracing, we acquire all LWLocks during local wait-for-graph trace.<br />
When it goes out to check further wait-for-graph in the remote database, such LWLocks are all released so that other transactions can continue to run during time consuming remote wait-for-graph tracing.<br />
When a cycle is found, then all the databases ivolved in the global wait-for-graph cycle will check that their local portion of wait-for-graph is stable.<br />
If not, it means that at leas one transaction involved in this wait-for-graph is running and this is not a deadlock.<br />
If it is stable, then we determine this is a deadlock.<br />
During the check of stableness of the local wait-for-graph, we again acquire all the LWLocks locally.<br />
<br />
----<br />
<br />
=== What applications to do ===<br />
<br />
Applications (or extensions, whatsoever), should call two functions before they invoke a remote transaction.<br />
<br />
* <code>ExternalLockAcquire()</code><br />
* <code>ExternalLockSetProperties()</code><br />
<br />
Applications do not need to release External Locks to follow two-phase locking protocol. <br />
External Locks will be released at the end of transactions as a part of cleanup process.<br />
Because global lock check will be done in background, applications do not need to care about this at all.<br />
Applications should only acquire the lock and provide information to trace the wait-for-graph.<br />
<br />
----<br />
<br />
=== External Lock Properties ===<br />
<br />
Because current Lock struction is too small to acquire the lock property described above, we need extra space to hold them.<br />
In the current implementation, we hold this in the files at <code>$PGDATA/pg_external_locks</code>.<br />
File name is based on the values in the locktag.<br />
For simple implementation, External Lock properties are written in plain text and this may need more improvement for security.<br />
External Lock properties can be stored in other place, such as dynamic shared memory.<br />
<br />
----<br />
<br />
=== Current Status ===<br />
<br />
The code is now running with PG 14.<br />
You can freely clone the repo from <code>https://github.com/koichi-szk/postgres.git</code>.<br />
Please checkout the branch <code>koichi/global_deadlock_detection_14_0</code>.<br />
<br />
Because we don't have actual workload to test this feature, I have separate git repo containing the test environment and several useful functions for the test.<br />
You can visit the repo <code>https://github.com/koichi-szk/gdd_test.git</code>.<br />
Please checkout the branch <code>PG14_GDD</code>.<br />
Please also note that this repo depends on my local environment configuration and you need to arrange the environment for your own.<br />
<br />
=== Future work ===<br />
<br />
Because there are no actual workload around PG which causes global deadlock, I will continue to port this code to further releases to be ready to be added into PG itself when this feature is really needed.</div>Koichihttps://wiki.postgresql.org/index.php?title=Distributed_deadlock_detection&diff=37959Distributed deadlock detection2023-06-08T02:39:30Z<p>Koichi: </p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
[[Koichi]]<br />
<br />
== Distributed deadlock detection ==<br />
<br />
=== Purpose of this project ===<br />
<br />
This article desribes how to extend PostgreSQL's locks system to represent wait-for-graph including transactions running at remote database server by many means, for example, FDW, libpq and dblink.<br />
<br />
This works to detect deadloc caused by remote transactions, by wait-for-graph cycle formed with chain of remote transactins.<br />
<br />
=== Producing globak deadlock with FDW ===<br />
<br />
<br />
<br />
----<br />
<br />
=== How to represent wait-for-graph including remote transactions ===<br />
<br />
PostgreSQL itself represents object-level lock with various internal functions mainly in <code>lock.c<code>. This is used to chase wait-for-graph when a transaction fails to acquire a lock within the timeframe defined by <code>deadlock_timeout</code> GUC parameter, implemented in <code>deadlock.c</code>.<br />
<br />
We can extended this to represent status of transactions waiting for completion of remote transactions.<br />
<br />
==== External Lock ====<br />
<br />
Here, we define new locktag type, <code>LOCKTAG_EXTERNAL</code> in <code>lock.h</code>, to represent that the transaction acquing this lock is waiting for a remote transaction. Because this is hold only by the transaction (upstream transaction) waiting for a remote transaction (downstream transaction) invoked by this upstream transaction, no other transactions in the database where upstream transaction is running do not care about this lock. This lock is acquired in exclusive mode only and is held only by the upstream transaction (and, in a way, upstream transaction is waiting for this lock to complete, too).<br />
<br />
Locktag information is similar to other lock type, with reference information to additional properties as described below. For this purpose, applications which invokes remote transaction should acquire External Lock by calling <code>ExternalLockAcquire()</code>. This is a wrapper to <code>LockAcquireExtended()</code> and sets up locktag for the External Lock.<br />
<br />
==== External Lock property ====<br />
<br />
Because we are using this lock to track wait-for-graph, we need another property information to connect to the remote database where downstream transaction is running and trace wait-for-graph.<br />
<br />
This is the connection string for the database and backend id of the downstream transaction.<br />
<br />
In <code>lock.c</code> new function <code>ExternalLockSetProperties()</code>. For the help to track the wait for graph precisely and to deal with change in the remote transaction status, this function requires connection string, remote transaction pgprocno, pid and xid.<br />
<br />
----<br />
<br />
=== Deadlock Detection ===<br />
<br />
Deadlock detection mechanism uses existing deadlock detection code in <code>deadlock.c</code> with an extension.<br />
<br />
When deadlock detection (<code>DeadLockCheck()</code>) is called, it begins to trace wait-for-graph.<br />
In the checking, when <code>DeadLockcheck()</code> finds External Lock in waiting lock of <code>PGPROC</code>, it begins to trace the remote transaction represented by this External Lock, to build global wait-for-graph.<br />
This is repeated until it finds the global wait-for-graph terminates or it goes back to the original upstream transaction forming a cycle.<br />
<br />
Dedicated functions are added to perform this check.<br />
<br />
----<br />
<br />
=== LWLocks during external lock trace ===<br />
<br />
In local wait-for-graph tracing, all LWLocks are acquired by deadlock checking functions to simplify the tracing code.<br />
In global wait-for-graph tracing, we acquire all LWLocks during local wait-for-graph trace.<br />
When it goes out to check further wait-for-graph in the remote database, such LWLocks are all released so that other transactions can continue to run during time consuming remote wait-for-graph tracing.<br />
When a cycle is found, then all the databases ivolved in the global wait-for-graph cycle will check that their local portion of wait-for-graph is stable.<br />
If not, it means that at leas one transaction involved in this wait-for-graph is running and this is not a deadlock.<br />
If it is stable, then we determine this is a deadlock.<br />
During the check of stableness of the local wait-for-graph, we again acquire all the LWLocks locally.<br />
<br />
----<br />
<br />
=== What applications to do ===<br />
<br />
Applications (or extensions, whatsoever), should call two functions before they invoke a remote transaction.<br />
<br />
* <code>ExternalLockAcquire()</code><br />
* <code>ExternalLockSetProperties()</code><br />
<br />
Applications do not need to release External Locks to follow two-phase locking protocol. <br />
External Locks will be released at the end of transactions as a part of cleanup process.<br />
Because global lock check will be done in background, applications do not need to care about this at all.<br />
Applications should only acquire the lock and provide information to trace the wait-for-graph.<br />
<br />
----<br />
<br />
=== External Lock Properties ===<br />
<br />
Because current Lock struction is too small to acquire the lock property described above, we need extra space to hold them.<br />
In the current implementation, we hold this in the files at <code>$PGDATA/pg_external_locks</code>.<br />
File name is based on the values in the locktag.<br />
For simple implementation, External Lock properties are written in plain text and this may need more improvement for security.<br />
External Lock properties can be stored in other place, such as dynamic shared memory.<br />
<br />
----<br />
<br />
=== Current Status ===<br />
<br />
The code is now running with PG 14.<br />
You can freely clone the repo from <code>https://github.com/koichi-szk/postgres.git</code>.<br />
Please checkout the branch <code>koichi/global_deadlock_detection_14_0</code>.<br />
<br />
Because we don't have actual workload to test this feature, I have separate git repo containing the test environment and several useful functions for the test.<br />
You can visit the repo <code>https://github.com/koichi-szk/gdd_test.git</code>.<br />
Please checkout the branch <code>PG14_GDD</code>.<br />
Please also note that this repo depends on my local environment configuration and you need to arrange the environment for your own.<br />
<br />
=== Future work ===<br />
<br />
Because there are no actual workload around PG which causes global deadlock, I will continue to port this code to further releases to be ready to be added into PG itself when this feature is really needed.</div>Koichihttps://wiki.postgresql.org/index.php?title=Parallel_Recovery&diff=37958Parallel Recovery2023-06-08T01:08:34Z<p>Koichi: </p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
My Wiki home: [[Koichi]]<br />
<br />
email: koichi.dbms_at_gmail.com<br />
<br />
Other topics: [[Distributed deadlock detection]]<br />
== '''Parallel Recovery''' ==<br />
<br><br />
<br />
=== PostgreSQL redo log (WAL) outline ===<br />
<br />
In PostgreSQL, WAL (write-ahead log) is used for recovery and log-shipping replication. Outline is described in [https://www.postgresql.org/docs/current/wal-intro.html this documentation page]. The outline of WAL is:<br />
<br />
* Every updating backend process must write its redo log to WAL before it really updates data files.<br />
* Each WAL record (piece of updating data record) has to reach stable storage before the data is synch'ed to the storage.<br />
<br />
WAL is used in recovery and log-shipping replication. In recovery, the WAL is used as follows:<br />
<br />
* Each WAL record is read from WAL segments in the written order,<br />
* Read WAL record is applied to corresponding data block. This is called "redo" or "replay".<br />
* To maintain redo consistency, redo was done in single thread.<br />
<br />
In the case of log-shipping replication, each WAL record is:<br />
<br />
* Transferred to each replica in the written order and<br />
* Replayed at replica in a single thread just as in recovery<br />
* When replay reaches a consistent state, it allows postmaster to begin read transactions, while following WAL records reach and are replayed.<br />
<br><br />
<br />
----<br />
<br />
=== Issues around serialized replay ===<br />
<br />
At present, WAL records are replayed strictly in written order in a single thread, while they are generated by multiple backend processes in parallel. The issues are:<br />
<br />
* Recovery/replay performance is much worse than performance to write WALs. Recovery can take longer than the period WAL records were written.<br />
* Replication lag can be long when the master has heavy updating workloads.<br />
<br><br />
<br />
----<br />
<br />
=== Can replay be done in parallel ? ===<br />
<br />
If we apply several rules about recovery order, yes. The rules are:<br />
<br />
* For a given data block, WAL records must be applied in the written order.<br />
* For a given transaction, when we apply WAL record terminating the transaction (commit/abort/prepare for commit), all the associated WAL records must have been replayed,<br />
* In the case of multi-block WAL, it should be applied only by one worker. For this purpose, such WAL record is assigned to all the block workers responsible for each block involved. When block worker picks up multi-block WAL and if it is not the last thread, it should wait until the last thread picking up this record actually replay it. If a thread is the last to pick up multi-block WAL record, it replays the record and send sync to other waiging workers.<br />
* To replay specific WAL record such as updating timeline, it should wait until all the preceding WAL records are replayed This may not be necessary but it looks safer at present to have such condition for certain set of WAL records.<br />
* Replayed LSN at pg_controldata will be updated when all the preceding WAL records are replayed.<br />
* Before applying transaction end (commit/abort/prepare for commit), all the WAL record belongs to the transaction must have been replayed.<br />
* In the case of several storage level WAL (create, etc), we need to wait for these WAL record replayed before assigning subsequent WAL records,<br />
* In the case of several storage level WAL (remove, etc), we need to wait until all the assigned WAL records have been replayed.<br />
* We can update applied WAL LSN when all the preceding WAL records has been replayed.<br />
<br><br />
<br />
----<br />
<br />
=== Implementation Note ===<br />
<br><br />
==== Worker configuration ====<br />
<br />
* Because current recovery is done in startup process, parallel replay threads are implemented as its child process.<br />
* Now we have following processes for this purpose:<br />
** Startup process: read WAL record from WAL segment/walsender and perform overall control,<br />
** Dispatching worker: analyze and dispatch WAL records,<br />
** Error page registration: collects error pages from other worker. Current implementation is using hash functions based on memory context, which is not simple to port it to shared memory. We can replace this with simpler configuration if we can manage such information using shared memory.<br />
** Transaction worker: replays WAL record which does not have any block information to update.<br />
** Block worker: replays block information (mostly HEAP/HEAP2).<br />
<br />
We can have more than one block workers. Other workers consists of single process each.<br />
<br><br />
==== Shared information and locks ====<br />
<br />
To pass and share information such as WAL record and status, shared memory defined in <code>storage/dsm.h</code> is used.<br />
<br />
To protect these data, spin lock defined in <code>storage/spin.h</code> is used.<br />
<br />
<br><br />
==== Synchronization among workers ====<br />
<br />
For light-weight synchronization among workers, unix-domain udp socket is used. This can be replaced with <code>stoage/latch.h</code>.<br />
<br><br />
==== New GUC ====<br />
<br />
Following GUC parameters are added:<br />
<br />
* <code>parallel_replay</code> (bool)<br />
* <code>parallel_replay_test</code> (bool, just internal) enable additional code to work to attach workers with gdb for test.<br />
* <code>num_preplay_workers</code> (int) number of total replay workers.<br />
* <code>num_preplay_queue</code> (int) number of queues holding outstanding WAL records,<br />
* <code>num_preplay_max_txn</code> (int) number of max outstanding transaction allowed in recovery. Max_connections or larger. In the case of replication slave, must be master's max_connections or larger.<br />
* <code>preplay_buffers</code> (int)<br />
<br><br />
<br />
----<br />
<br />
=== Connection to GDB ===<br />
<br />
Different from usual backend, recovery code runs as startup process (and workers forked from startup process). Because this is to be terminated when postmaster begins to accept connection unless hot standby is enabled, we need some more to connect GDB to parallel replay workers. Here's how to do this:<br />
<br />
* Implement dummy function to work as gdb break point (<code>PRDebug_sync()</code>).<br />
* When startup process begins to fork workers, it writes messages to the debug file (located in <code>pr_debug</code> in <code>PGDATA</code>).<br />
* The mssage contains what to do in separate shell:<br />
** Start gdb<br />
** Attach startup process and worker process respectively,<br />
** Setup temprary break function,<br />
** Create signal file,<br />
* Then the worker (and startup process) waits for the signal file to be created and call the break function.<br />
<br />
In this way, we can connect startup process and each worker process to gdb without pain.<br />
<br><br />
<br><br />
----<br />
<br />
=== Current code ===<br />
<br />
Current code is available from [https://github.com/koichi-szk/postgres.git Koichi's GITHub repository]. Branch is <code>parallel_replay_14_6</code>.<br />
<br />
For more information about the test, you can visit [https://github.com/koichi-szk/parallel_recovery_test.git Another Koichi's GitHub repository]. Use master branch. Please understand this page is not complete now and needs more updates for general use.<br />
<br />
<br />
----<br />
<br />
<br />
=== Current code status ===<br />
<br />
No test yet. The code just pass the build and under the review before testing. Now a couple of functions need improvement/development and some others are influenced by this change.<br />
<br />
Complete features are:<br />
<br />
* Folk additional worker process from the startup process,<br />
* Assign <code>XLogRecord</code> to workers,<br />
* Various synchronization mechanism to maintain semantics of global order of WAL replay,<br />
<br />
Remaining issues are:<br />
<br />
* Need more test of shared buffer allocation/release. If implementation assumption (if no space, just wait until all the other worker finishes and frees available memory) is correct.<br />
* Recovery code breaks (most of them are due to wrong decoded information from passed from startup process).<br />
* Good benchmark to show potential performance gain.<br />
<br />
'''Any cooperation for discussion/test/development is very very welcome.''' You can contact to koichi.dbms_at_gmail.com.<br />
<br />
Please understand that the repo can be incomplete due to occasional work.<br />
<br />
Further addition is:<br />
<br />
* To add a code to have WAL record in text format (like pg_waldump) and hold it to some portion of the memory to help the test. <code>xlog_outrec()</code> looks very useful for this purpose. At present, this is enabled with <code>WAL_DEBUG</code> flag.<br />
<br><br />
<br><br />
----<br />
<br><br />
<br><br />
===Discussion in PGCon 2023===<br />
<br />
Slide deck is available at [[File:Parallel Recovery in PostgreSQL.pdf]]<br />
<br />
Here are some of the inputs from PGCon 2023 presentation.<br />
<br />
* Parallel recovery at standby may break if index page redo is done before corresponding data page redo is done.<br />
** In this case, index record will point to vacant page ndex (not in use). In the current implementation of indexes, this will cause error.<br />
** Idea is: use this when hot standby is disabled. In parallel, we can change the index code just to ignore such dangling reference.<br />
* Can workers be child processes of postmaster?<br />
** Comment suggested it is possible. Startup's children looks simpler in inherting startup's resources.<br />
* Linux domain socket can be replaced with Latch.<br />
** Should consider this in the upcoming improvement.<br />
<br />
I didn't have feedback on the use of POSIX mqueue. However, POSIX mqueue is not so flexible in configuration. Every change needs kernel configuration and it looks better to implement this with shared memory and socket/latch.</div>Koichihttps://wiki.postgresql.org/index.php?title=File:Parallel_Recovery_in_PostgreSQL.pdf&diff=37957File:Parallel Recovery in PostgreSQL.pdf2023-06-08T01:03:24Z<p>Koichi: PGCon presentation slides, May 31st, 2023, Ottawa, ON, Canada</p>
<hr />
<div>== Summary ==<br />
PGCon presentation slides, May 31st, 2023, Ottawa, ON, Canada<br />
== Licensing ==<br />
{{The PostgreSQL Licence}}</div>Koichihttps://wiki.postgresql.org/index.php?title=Parallel_Recovery&diff=37956Parallel Recovery2023-06-08T01:01:27Z<p>Koichi: </p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
My Wiki home: [[Koichi]]<br />
<br />
email: koichi.dbms_at_gmail.com<br />
<br />
Other topics: [[Distributed deadlock detection]]<br />
== '''Parallel Recovery''' ==<br />
<br><br />
<br />
=== PostgreSQL redo log (WAL) outline ===<br />
<br />
In PostgreSQL, WAL (write-ahead log) is used for recovery and log-shipping replication. Outline is described in [https://www.postgresql.org/docs/current/wal-intro.html this documentation page]. The outline of WAL is:<br />
<br />
* Every updating backend process must write its redo log to WAL before it really updates data files.<br />
* Each WAL record (piece of updating data record) has to reach stable storage before the data is synch'ed to the storage.<br />
<br />
WAL is used in recovery and log-shipping replication. In recovery, the WAL is used as follows:<br />
<br />
* Each WAL record is read from WAL segments in the written order,<br />
* Read WAL record is applied to corresponding data block. This is called "redo" or "replay".<br />
* To maintain redo consistency, redo was done in single thread.<br />
<br />
In the case of log-shipping replication, each WAL record is:<br />
<br />
* Transferred to each replica in the written order and<br />
* Replayed at replica in a single thread just as in recovery<br />
* When replay reaches a consistent state, it allows postmaster to begin read transactions, while following WAL records reach and are replayed.<br />
<br><br />
<br />
----<br />
<br />
=== Issues around serialized replay ===<br />
<br />
At present, WAL records are replayed strictly in written order in a single thread, while they are generated by multiple backend processes in parallel. The issues are:<br />
<br />
* Recovery/replay performance is much worse than performance to write WALs. Recovery can take longer than the period WAL records were written.<br />
* Replication lag can be long when the master has heavy updating workloads.<br />
<br><br />
<br />
----<br />
<br />
=== Can replay be done in parallel ? ===<br />
<br />
If we apply several rules about recovery order, yes. The rules are:<br />
<br />
* For a given data block, WAL records must be applied in the written order.<br />
* For a given transaction, when we apply WAL record terminating the transaction (commit/abort/prepare for commit), all the associated WAL records must have been replayed,<br />
* In the case of multi-block WAL, it should be applied only by one worker. For this purpose, such WAL record is assigned to all the block workers responsible for each block involved. When block worker picks up multi-block WAL and if it is not the last thread, it should wait until the last thread picking up this record actually replay it. If a thread is the last to pick up multi-block WAL record, it replays the record and send sync to other waiging workers.<br />
* To replay specific WAL record such as updating timeline, it should wait until all the preceding WAL records are replayed This may not be necessary but it looks safer at present to have such condition for certain set of WAL records.<br />
* Replayed LSN at pg_controldata will be updated when all the preceding WAL records are replayed.<br />
* Before applying transaction end (commit/abort/prepare for commit), all the WAL record belongs to the transaction must have been replayed.<br />
* In the case of several storage level WAL (create, etc), we need to wait for these WAL record replayed before assigning subsequent WAL records,<br />
* In the case of several storage level WAL (remove, etc), we need to wait until all the assigned WAL records have been replayed.<br />
* We can update applied WAL LSN when all the preceding WAL records has been replayed.<br />
<br><br />
<br />
----<br />
<br />
=== Implementation Note ===<br />
<br><br />
==== Worker configuration ====<br />
<br />
* Because current recovery is done in startup process, parallel replay threads are implemented as its child process.<br />
* Now we have following processes for this purpose:<br />
** Startup process: read WAL record from WAL segment/walsender and perform overall control,<br />
** Dispatching worker: analyze and dispatch WAL records,<br />
** Error page registration: collects error pages from other worker. Current implementation is using hash functions based on memory context, which is not simple to port it to shared memory. We can replace this with simpler configuration if we can manage such information using shared memory.<br />
** Transaction worker: replays WAL record which does not have any block information to update.<br />
** Block worker: replays block information (mostly HEAP/HEAP2).<br />
<br />
We can have more than one block workers. Other workers consists of single process each.<br />
<br><br />
==== Shared information and locks ====<br />
<br />
To pass and share information such as WAL record and status, shared memory defined in <code>storage/dsm.h</code> is used.<br />
<br />
To protect these data, spin lock defined in <code>storage/spin.h</code> is used.<br />
<br />
<br><br />
==== Synchronization among workers ====<br />
<br />
For light-weight synchronization among workers, unix-domain udp socket is used. This can be replaced with <code>stoage/latch.h</code>.<br />
<br><br />
==== New GUC ====<br />
<br />
Following GUC parameters are added:<br />
<br />
* <code>parallel_replay</code> (bool)<br />
* <code>parallel_replay_test</code> (bool, just internal) enable additional code to work to attach workers with gdb for test.<br />
* <code>num_preplay_workers</code> (int) number of total replay workers.<br />
* <code>num_preplay_queue</code> (int) number of queues holding outstanding WAL records,<br />
* <code>num_preplay_max_txn</code> (int) number of max outstanding transaction allowed in recovery. Max_connections or larger. In the case of replication slave, must be master's max_connections or larger.<br />
* <code>preplay_buffers</code> (int)<br />
<br><br />
<br />
----<br />
<br />
=== Connection to GDB ===<br />
<br />
Different from usual backend, recovery code runs as startup process (and workers forked from startup process). Because this is to be terminated when postmaster begins to accept connection unless hot standby is enabled, we need some more to connect GDB to parallel replay workers. Here's how to do this:<br />
<br />
* Implement dummy function to work as gdb break point (<code>PRDebug_sync()</code>).<br />
* When startup process begins to fork workers, it writes messages to the debug file (located in <code>pr_debug</code> in <code>PGDATA</code>).<br />
* The mssage contains what to do in separate shell:<br />
** Start gdb<br />
** Attach startup process and worker process respectively,<br />
** Setup temprary break function,<br />
** Create signal file,<br />
* Then the worker (and startup process) waits for the signal file to be created and call the break function.<br />
<br />
In this way, we can connect startup process and each worker process to gdb without pain.<br />
<br><br />
<br><br />
----<br />
<br />
=== Current code ===<br />
<br />
Current code is available from [https://github.com/koichi-szk/postgres.git Koichi's GITHub repository]. Branch is <code>parallel_replay_14_6</code>.<br />
<br />
For more information about the test, you can visit [https://github.com/koichi-szk/parallel_recovery_test.git Another Koichi's GitHub repository]. Use master branch. Please understand this page is not complete now and needs more updates for general use.<br />
<br />
<br />
----<br />
<br />
<br />
=== Current code status ===<br />
<br />
No test yet. The code just pass the build and under the review before testing. Now a couple of functions need improvement/development and some others are influenced by this change.<br />
<br />
Complete features are:<br />
<br />
* Folk additional worker process from the startup process,<br />
* Assign <code>XLogRecord</code> to workers,<br />
* Various synchronization mechanism to maintain semantics of global order of WAL replay,<br />
<br />
Remaining issues are:<br />
<br />
* Need more test of shared buffer allocation/release. If implementation assumption (if no space, just wait until all the other worker finishes and frees available memory) is correct.<br />
* Recovery code breaks (most of them are due to wrong decoded information from passed from startup process).<br />
* Good benchmark to show potential performance gain.<br />
<br />
'''Any cooperation for discussion/test/development is very very welcome.''' You can contact to koichi.dbms_at_gmail.com.<br />
<br />
Please understand that the repo can be incomplete due to occasional work.<br />
<br />
Further addition is:<br />
<br />
* To add a code to have WAL record in text format (like pg_waldump) and hold it to some portion of the memory to help the test. <code>xlog_outrec()</code> looks very useful for this purpose. At present, this is enabled with <code>WAL_DEBUG</code> flag.<br />
<br><br />
<br><br />
----<br />
<br><br />
<br><br />
===Discussion in PGCon 2023===<br />
<br />
Here are some of the inputs from PGCon 2023 presentation.<br />
<br />
* Parallel recovery at standby may break if index page redo is done before corresponding data page redo is done.<br />
** In this case, index record will point to vacant page ndex (not in use). In the current implementation of indexes, this will cause error.<br />
** Idea is: use this when hot standby is disabled. In parallel, we can change the index code just to ignore such dangling reference.<br />
* Can workers be child processes of postmaster?<br />
** Comment suggested it is possible. Startup's children looks simpler in inherting startup's resources.<br />
* Linux domain socket can be replaced with Latch.<br />
** Should consider this in the upcoming improvement.<br />
<br />
I didn't have feedback on the use of POSIX mqueue. However, POSIX mqueue is not so flexible in configuration. Every change needs kernel configuration and it looks better to implement this with shared memory and socket/latch.</div>Koichihttps://wiki.postgresql.org/index.php?title=Parallel_Recovery&diff=37955Parallel Recovery2023-06-08T01:00:39Z<p>Koichi: /* Parallel Recovery */</p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
My Wiki home: [[Koichi]]<br />
<br />
email: koichi.dbms_at_gmail.com<br />
<br />
Other topics: [[Distributed deadlock detection]]<br />
== '''Parallel Recovery''' ==<br />
<br><br />
<br />
=== PostgreSQL redo log (WAL) outline ===<br />
<br />
In PostgreSQL, WAL (write-ahead log) is used for recovery and log-shipping replication. Outline is described in [https://www.postgresql.org/docs/current/wal-intro.html this documentation page]. The outline of WAL is:<br />
<br />
* Every updating backend process must write its redo log to WAL before it really updates data files.<br />
* Each WAL record (piece of updating data record) has to reach stable storage before the data is synch'ed to the storage.<br />
<br />
WAL is used in recovery and log-shipping replication. In recovery, the WAL is used as follows:<br />
<br />
* Each WAL record is read from WAL segments in the written order,<br />
* Read WAL record is applied to corresponding data block. This is called "redo" or "replay".<br />
* To maintain redo consistency, redo was done in single thread.<br />
<br />
In the case of log-shipping replication, each WAL record is:<br />
<br />
* Transferred to each replica in the written order and<br />
* Replayed at replica in a single thread just as in recovery<br />
* When replay reaches a consistent state, it allows postmaster to begin read transactions, while following WAL records reach and are replayed.<br />
<br><br />
<br />
----<br />
<br />
=== Issues around serialized replay ===<br />
<br />
At present, WAL records are replayed strictly in written order in a single thread, while they are generated by multiple backend processes in parallel. The issues are:<br />
<br />
* Recovery/replay performance is much worse than performance to write WALs. Recovery can take longer than the period WAL records were written.<br />
* Replication lag can be long when the master has heavy updating workloads.<br />
<br><br />
<br />
----<br />
<br />
=== Can replay be done in parallel ? ===<br />
<br />
If we apply several rules about recovery order, yes. The rules are:<br />
<br />
* For a given data block, WAL records must be applied in the written order.<br />
* For a given transaction, when we apply WAL record terminating the transaction (commit/abort/prepare for commit), all the associated WAL records must have been replayed,<br />
* In the case of multi-block WAL, it should be applied only by one worker. For this purpose, such WAL record is assigned to all the block workers responsible for each block involved. When block worker picks up multi-block WAL and if it is not the last thread, it should wait until the last thread picking up this record actually replay it. If a thread is the last to pick up multi-block WAL record, it replays the record and send sync to other waiging workers.<br />
* To replay specific WAL record such as updating timeline, it should wait until all the preceding WAL records are replayed This may not be necessary but it looks safer at present to have such condition for certain set of WAL records.<br />
* Replayed LSN at pg_controldata will be updated when all the preceding WAL records are replayed.<br />
* Before applying transaction end (commit/abort/prepare for commit), all the WAL record belongs to the transaction must have been replayed.<br />
* In the case of several storage level WAL (create, etc), we need to wait for these WAL record replayed before assigning subsequent WAL records,<br />
* In the case of several storage level WAL (remove, etc), we need to wait until all the assigned WAL records have been replayed.<br />
* We can update applied WAL LSN when all the preceding WAL records has been replayed.<br />
<br><br />
<br />
----<br />
<br />
=== Implementation Note ===<br />
<br><br />
==== Worker configuration ====<br />
<br />
* Because current recovery is done in startup process, parallel replay threads are implemented as its child process.<br />
* Now we have following processes for this purpose:<br />
** Startup process: read WAL record from WAL segment/walsender and perform overall control,<br />
** Dispatching worker: analyze and dispatch WAL records,<br />
** Error page registration: collects error pages from other worker. Current implementation is using hash functions based on memory context, which is not simple to port it to shared memory. We can replace this with simpler configuration if we can manage such information using shared memory.<br />
** Transaction worker: replays WAL record which does not have any block information to update.<br />
** Block worker: replays block information (mostly HEAP/HEAP2).<br />
<br />
We can have more than one block workers. Other workers consists of single process each.<br />
<br><br />
==== Shared information and locks ====<br />
<br />
To pass and share information such as WAL record and status, shared memory defined in <code>storage/dsm.h</code> is used.<br />
<br />
To protect these data, spin lock defined in <code>storage/spin.h</code> is used.<br />
<br />
<br><br />
==== Synchronization among workers ====<br />
<br />
For light-weight synchronization among workers, unix-domain udp socket is used. This can be replaced with <code>stoage/latch.h</code>.<br />
<br><br />
==== New GUC ====<br />
<br />
Following GUC parameters are added:<br />
<br />
* <code>parallel_replay</code> (bool)<br />
* <code>parallel_replay_test</code> (bool, just internal) enable additional code to work to attach workers with gdb for test.<br />
* <code>num_preplay_workers</code> (int) number of total replay workers.<br />
* <code>num_preplay_queue</code> (int) number of queues holding outstanding WAL records,<br />
* <code>num_preplay_max_txn</code> (int) number of max outstanding transaction allowed in recovery. Max_connections or larger. In the case of replication slave, must be master's max_connections or larger.<br />
* <code>preplay_buffers</code> (int)<br />
<br><br />
<br />
----<br />
<br />
=== Connection to GDB ===<br />
<br />
Different from usual backend, recovery code runs as startup process (and workers forked from startup process). Because this is to be terminated when postmaster begins to accept connection unless hot standby is enabled, we need some more to connect GDB to parallel replay workers. Here's how to do this:<br />
<br />
* Implement dummy function to work as gdb break point (<code>PRDebug_sync()</code>).<br />
* When startup process begins to fork workers, it writes messages to the debug file (located in <code>pr_debug</code> in <code>PGDATA</code>).<br />
* The mssage contains what to do in separate shell:<br />
** Start gdb<br />
** Attach startup process and worker process respectively,<br />
** Setup temprary break function,<br />
** Create signal file,<br />
* Then the worker (and startup process) waits for the signal file to be created and call the break function.<br />
<br />
In this way, we can connect startup process and each worker process to gdb without pain.<br />
<br><br />
<br><br />
----<br />
<br />
=== Current code ===<br />
<br />
Current code is available from [https://github.com/koichi-szk/postgres.git Koichi's GITHub repository]. Branch is <code>parallel_replay_14_6</code>.<br />
<br />
For more information about the test, you can visit [https://github.com/koichi-szk/parallel_recovery_test.git Another Koichi's GitHub repository]. Use master branch. Please understand this page is not complete now and needs more updates for general use.<br />
<br />
<br />
----<br />
<br />
<br />
=== Current code status ===<br />
<br />
No test yet. The code just pass the build and under the review before testing. Now a couple of functions need improvement/development and some others are influenced by this change.<br />
<br />
Complete features are:<br />
<br />
* Folk additional worker process from the startup process,<br />
* Assign <code>XLogRecord</code> to workers,<br />
* Various synchronization mechanism to maintain semantics of global order of WAL replay,<br />
<br />
Remaining issues are:<br />
<br />
* Need more test of shared buffer allocation/release. If implementation assumption (if no space, just wait until all the other worker finishes and frees available memory) is correct.<br />
* Recovery code breaks (most of them are due to wrong decoded information from passed from startup process).<br />
* Good benchmark to show potential performance gain.<br />
<br />
'''Any cooperation for discussion/test/development is very very welcome.''' You can contact to koichi.dbms_at_gmail.com.<br />
<br />
Please understand that the repo can be incomplete due to occasional work.<br />
<br />
Further addition is:<br />
<br />
* To add a code to have WAL record in text format (like pg_waldump) and hold it to some portion of the memory to help the test. <code>xlog_outrec()</code> looks very useful for this purpose. At present, this is enabled with <code>WAL_DEBUG</code> flag.<br />
<br><br />
<br><br />
----<br />
<br />
===Discussion in PGCon 2023===<br />
<br />
Here are some of the inputs from PGCon 2023 presentation.<br />
<br />
* Parallel recovery at standby may break if index page redo is done before corresponding data page redo is done.<br />
** In this case, index record will point to vacant page ndex (not in use). In the current implementation of indexes, this will cause error.<br />
** Idea is: use this when hot standby is disabled. In parallel, we can change the index code just to ignore such dangling reference.<br />
* Can workers be child processes of postmaster?<br />
** Comment suggested it is possible. Startup's children looks simpler in inherting startup's resources.<br />
* Linux domain socket can be replaced with Latch.<br />
** Should consider this in the upcoming improvement.<br />
<br />
I didn't have feedback on the use of POSIX mqueue. However, POSIX mqueue is not so flexible in configuration. Every change needs kernel configuration and it looks better to implement this with shared memory and socket/latch.</div>Koichihttps://wiki.postgresql.org/index.php?title=Distributed_transaction_visibility&diff=37771Distributed transaction visibility2023-04-20T09:11:13Z<p>Koichi: /* Problem to solve */</p>
<hr />
<div><br />
----<br />
[[File:MyPicture_github.jpg]]<br />
<br />
[[Koichi]]<br />
<br />
== Global Visibility for distributed R/W workload amount PostgreSQL database cluster ==<br />
<br />
=== Goal of the project ===<br />
<br />
* Provide consistent visibility to distributed R/W transactions spanning into more than one postgres database instance.<br />
* No impact to local transactions.<br />
<br />
=== Problem to solve ===<br />
<br />
We have two-phase commit protocol to maitain write consistency of groups of transactions as a distributed transaction.<br />
<br />
In this case, final "commit" timestamp is different in database instances.<br />
<br />
Because of this, result of write can be visible in some of involved instances but invisible in others.<br />
This is known as distributed read anomaly.<br />
<br />
We need to solve this read anomaly problem.<br />
<br />
In addition to the technical topics focused in Postgres-XC and XL, this project aims to solve:<br />
<br />
* To join local postgreSQL databaes to a catabase cluster group,<br />
* To part a datrabae belongs to a database cluster to separate database.<br />
<br />
=== Outline of the solution ===<br />
<br />
* We define global transaction identifier unique to each distributed transactions and same among transactions begonging to a same global transaction.<br />
* Use global snapshot for global transactions in row visibility check.<br />
<br />
=== Definition and structure of a distrubuted transaction ===<br />
<br />
* Any distributed transaction has one "root" transaction which starts the transaction.<br />
* Other transactions in a distributed transaction must be invoked by the "root" transaction or a "child transaction", which is invoked directly or indirectly by the root transaction.<br />
* "'''root'''" transaction is given a global transaction id ('''GXID''') as <code>(dbid, gxid)</code>, where <code>dbid</code> is identifier of the database root transaction is running and gxid is unique number given in ascending order by global transaction proxy as described below.<br />
Typically, database system identifier given by initdb can be used for this purpose. This depends upon implementation and we can choose any other means as long as dbid is unique. In the case of database system identifier, this is copied by cloning the database by `pg_basebackup` and streaming replication, for example. We may need a means to assign different database system identifier if such cloned database should be used as separate database instance.<br />
* When "root" transaction and its "'''child'''" transaction invokes child transactions, such "parent" transactions has to propagate its global transaction Id to their children. Locally, this global transaction ID has to be associated with the local transction.<br />
<br />
=== Assignment of global transaction ID ===<br />
<br />
* We need separate facility to manage global transaction information in a given postgreSQL cluster, called "'''global transaction proxy'''" ('''GTP''').<br />
* When a transaction wants to be the root transaction and would like to provide and receive consistent visibility need to connect to GPT and request for new GXID. GTP assignes gxid and record this with root transaction's dbid as <code>(dbid, gxid)</code> pair. Please note that this is needed not only for write transaction but for read transaction which need consistent visiblity among different database instances.<br />
* A root transact has to propagate this GXID to its children. When a child is invoking its child, then it also has to propagate this GXID, which should be associated to the local transaction of the child.<br />
* When a root transaction (and all its children) finishes with final COMMIT, it has to report GTP that the transaction finished. GTP records this is finished. Maintenance of GXID information at GTP will be given below.<br />
<br />
=== Global snapshot and its handling in local transactions. ===<br />
<br />
* The root has to ask for snapshot of running global transactions to GTP. GTP collects GXID of running global transactions and provides this set of GXIDs as a global snapshot.<br />
* Root has to propagate this global snapshot to its children and so forth. This is associated with the local transaction of children.<br />
* In visibility check, if a local XID (in xmin/xmax and others) is associated with GXID (we may need a flag in each row to indicate this), such transaction visibility should be determined based on its GXID appearing in the snapshot (note that in global snapshot, both terminated GXID and runing GXID appears, depending upon its syntax). Details will be given below.<br />
* Visibility of local transactions not associated with GXID can be handled in the same manner as current usual local transactions.<br />
* For a local transaction without GXID, there are no global snapshot and no GXID is used for visibility check even target row's xid is associated with GXID. Thus, for the local transactions, there will be no additional overhead.<br />
<br />
=== Global snapshot data format ===<br />
<br />
Global snapshot data can be represented in the following format:<br />
<br />
<code><br />
(GXID_0_0, ..., GXID_0_M) GXMIN (GXID_1_0, ..., GXID_1_N) GXMAX<br />
</code><br />
<br />
<code>GXMIN</code> means global tansaction whose gxid is smaller than this value are all terminated (commited/aborted), unless appearing in the list <code>(GXID_0_0, ... GXID_0_M)</code>.<br />
This is used to save the amount of snapshot especially when only very few number of global transactions are alive for error ercovery (typically local commit failure).<br />
<br />
<code>GXMAX</code> means this is the maximum value of currently assigned gxid and global transactions with gxid larger than GXMAX has to be regarded as running.<br />
<br />
<code>(GXID_1_0, .... , GXOD_1_N)</code> represents a set of GXIDs whose gxid is between <code>GXMIN</code> and <code>GXMAX</code> and has already terminated. The reason for this is we cannot guarantee the transaction is running when assocaited database instance is down after the transaction is reported to begin and before it reports to be terminated.<br />
<br />
From the above, there are multiple snapshot representation possible for a given situation. GPT should provide the minumum size snapshot by calculating optimal value for <code>GXMIN</code> and <code>GXMAX</code>.<br />
<br />
=== Global snapshot maintenance in GTP ===<br />
<br />
GXID in GTP cannot be removed in GTP immediately because we have a chance that other global transaction depends upon this.<br />
<br />
When a transaction terminates, its GXID is marked as finished and is associated with set of running GXIDs as referring GXIDs. Each of such GXID can be removed from the list when it terminates. When referring GXID becomes zero, then such GXID information can be removed from GTP.<br />
<br />
When a database instance gets down and restarts, all the associated transaction are automatically aborted. In this case, each database instance should report this to GTP and GTP can clean up its global transaction information if dbid matches to such database instance.<br />
<br />
Please note that such GXID will not appear in the snapshot requested after it ternates. This can be used to calculate GXMIN in the snapshot to minimize the amount of global snapshot.<br />
<br />
=== Global snapshot maintenance in local database instances ===<br />
<br />
Loclal database instances cannot remove GXID associated with local immediately when this GXID terminates because other global transaction can refer to this afterwords. Local database instance should inquire GTP about valid GXID periodically to prevent bloat. If such GXID is not present at GTP, local database instance can remove this. Please note that such remaining GXID-XID mapping does not affect the visibility check correctiveness but can impact the performance.<br />
<br />
=== Database instance beloging to more than one database cluster ===<br />
<br />
We can allow a database instance can belong to more than one cluster. In this case, we need to extend the above with cluster-id. Any global transaction need to determine which cluster it belongs to and GXID and global snapshot request to appropriate GTP. In this case, GXID can be represented as (cluster-id, dbid, gxid).</div>Koichihttps://wiki.postgresql.org/index.php?title=Parallel_Recovery&diff=37439Parallel Recovery2023-01-15T23:48:53Z<p>Koichi: </p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
My Wiki home: [[Koichi]]<br />
<br />
email: koichi.dbms_at_gmail.com<br />
<br />
Other topics: [[Distributed deadlock detection]]<br />
== '''Parallel Recovery''' ==<br />
<br><br />
<br />
=== PostgreSQL redo log (WAL) outline ===<br />
<br />
In PostgreSQL, WAL (write-ahead log) is used for recovery and log-shipping replication. Outline is described in [https://www.postgresql.org/docs/current/wal-intro.html this documentation page]. The outline of WAL is:<br />
<br />
* Every updating backend process must write its redo log to WAL before it really updates data files.<br />
* Each WAL record (piece of updating data record) has to reach stable storage before the data is synch'ed to the storage.<br />
<br />
WAL is used in recovery and log-shipping replication. In recovery, the WAL is used as follows:<br />
<br />
* Each WAL record is read from WAL segments in the written order,<br />
* Read WAL record is applied to corresponding data block. This is called "redo" or "replay".<br />
* To maintain redo consistency, redo was done in single thread.<br />
<br />
In the case of log-shipping replication, each WAL record is:<br />
<br />
* Transferred to each replica in the written order and<br />
* Replayed at replica in a single thread just as in recovery<br />
* When replay reaches a consistent state, it allows postmaster to begin read transactions, while following WAL records reach and are replayed.<br />
<br><br />
<br />
----<br />
<br />
=== Issues around serialized replay ===<br />
<br />
At present, WAL records are replayed strictly in written order in a single thread, while they are generated by multiple backend processes in parallel. The issues are:<br />
<br />
* Recovery/replay performance is much worse than performance to write WALs. Recovery can take longer than the period WAL records were written.<br />
* Replication lag can be long when the master has heavy updating workloads.<br />
<br><br />
<br />
----<br />
<br />
=== Can replay be done in parallel ? ===<br />
<br />
If we apply several rules about recovery order, yes. The rules are:<br />
<br />
* For a given data block, WAL records must be applied in the written order.<br />
* For a given transaction, when we apply WAL record terminating the transaction (commit/abort/prepare for commit), all the associated WAL records must have been replayed,<br />
* In the case of multi-block WAL, it should be applied only by one worker. For this purpose, such WAL record is assigned to all the block workers responsible for each block involved. When block worker picks up multi-block WAL and if it is not the last thread, it should wait until the last thread picking up this record actually replay it. If a thread is the last to pick up multi-block WAL record, it replays the record and send sync to other waiging workers.<br />
* To replay specific WAL record such as updating timeline, it should wait until all the preceding WAL records are replayed This may not be necessary but it looks safer at present to have such condition for certain set of WAL records.<br />
* Replayed LSN at pg_controldata will be updated when all the preceding WAL records are replayed.<br />
* Before applying transaction end (commit/abort/prepare for commit), all the WAL record belongs to the transaction must have been replayed.<br />
* In the case of several storage level WAL (create, etc), we need to wait for these WAL record replayed before assigning subsequent WAL records,<br />
* In the case of several storage level WAL (remove, etc), we need to wait until all the assigned WAL records have been replayed.<br />
* We can update applied WAL LSN when all the preceding WAL records has been replayed.<br />
<br><br />
<br />
----<br />
<br />
=== Implementation Note ===<br />
<br><br />
==== Worker configuration ====<br />
<br />
* Because current recovery is done in startup process, parallel replay threads are implemented as its child process.<br />
* Now we have following processes for this purpose:<br />
** Startup process: read WAL record from WAL segment/walsender and perform overall control,<br />
** Dispatching worker: analyze and dispatch WAL records,<br />
** Error page registration: collects error pages from other worker. Current implementation is using hash functions based on memory context, which is not simple to port it to shared memory. We can replace this with simpler configuration if we can manage such information using shared memory.<br />
** Transaction worker: replays WAL record which does not have any block information to update.<br />
** Block worker: replays block information (mostly HEAP/HEAP2).<br />
<br />
We can have more than one block workers. Other workers consists of single process each.<br />
<br><br />
==== Shared information and locks ====<br />
<br />
To pass and share information such as WAL record and status, shared memory defined in <code>storage/dsm.h</code> is used.<br />
<br />
To protect these data, spin lock defined in <code>storage/spin.h</code> is used.<br />
<br />
<br><br />
==== Synchronization among workers ====<br />
<br />
For light-weight synchronization among workers, unix-domain udp socket is used. This can be replaced with <code>stoage/latch.h</code>.<br />
<br><br />
==== New GUC ====<br />
<br />
Following GUC parameters are added:<br />
<br />
* <code>parallel_replay</code> (bool)<br />
* <code>parallel_replay_test</code> (bool, just internal) enable additional code to work to attach workers with gdb for test.<br />
* <code>num_preplay_workers</code> (int) number of total replay workers.<br />
* <code>num_preplay_queue</code> (int) number of queues holding outstanding WAL records,<br />
* <code>num_preplay_max_txn</code> (int) number of max outstanding transaction allowed in recovery. Max_connections or larger. In the case of replication slave, must be master's max_connections or larger.<br />
* <code>preplay_buffers</code> (int)<br />
<br><br />
<br />
----<br />
<br />
=== Connection to GDB ===<br />
<br />
Different from usual backend, recovery code runs as startup process (and workers forked from startup process). Because this is to be terminated when postmaster begins to accept connection unless hot standby is enabled, we need some more to connect GDB to parallel replay workers. Here's how to do this:<br />
<br />
* Implement dummy function to work as gdb break point (<code>PRDebug_sync()</code>).<br />
* When startup process begins to fork workers, it writes messages to the debug file (located in <code>pr_debug</code> in <code>PGDATA</code>).<br />
* The mssage contains what to do in separate shell:<br />
** Start gdb<br />
** Attach startup process and worker process respectively,<br />
** Setup temprary break function,<br />
** Create signal file,<br />
* Then the worker (and startup process) waits for the signal file to be created and call the break function.<br />
<br />
In this way, we can connect startup process and each worker process to gdb without pain.<br />
<br><br />
<br><br />
----<br />
<br />
=== Current code ===<br />
<br />
Current code is available from [https://github.com/koichi-szk/postgres.git Koichi's GITHub repository]. Branch is <code>parallel_replay_14_6</code>.<br />
<br />
For more information about the test, you can visit [https://github.com/koichi-szk/parallel_recovery_test.git Another Koichi's GitHub repository]. Use master branch. Please understand this page is not complete now and needs more updates for general use.<br />
<br />
<br />
----<br />
<br />
<br />
=== Current code status ===<br />
<br />
No test yet. The code just pass the build and under the review before testing. Now a couple of functions need improvement/development and some others are influenced by this change.<br />
<br />
Complete features are:<br />
<br />
* Folk additional worker process from the startup process,<br />
* Assign <code>XLogRecord</code> to workers,<br />
* Various synchronization mechanism to maintain semantics of global order of WAL replay,<br />
<br />
Remaining issues are:<br />
<br />
* Need more test of shared buffer allocation/release. If implementation assumption (if no space, just wait until all the other worker finishes and frees available memory) is correct.<br />
* Recovery code breaks (most of them are due to wrong decoded information from passed from startup process).<br />
* Good benchmark to show potential performance gain.<br />
<br />
'''Any cooperation for discussion/test/development is very very welcome.''' You can contact to koichi.dbms_at_gmail.com.<br />
<br />
Please understand that the repo can be incomplete due to occasional work.<br />
<br />
Further addition is:<br />
<br />
* To add a code to have WAL record in text format (like pg_waldump) and hold it to some portion of the memory to help the test. <code>xlog_outrec()</code> looks very useful for this purpose. At present, this is enabled with <code>WAL_DEBUG</code> flag.</div>Koichihttps://wiki.postgresql.org/index.php?title=Parallel_Recovery&diff=37438Parallel Recovery2023-01-15T23:45:41Z<p>Koichi: </p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
My Wiki home: [[Koichi]]<br />
<br />
email: koichi.dbms_at_gmail.com<br />
<br />
Other topics: [[Distributed deadlock detection]]<br />
== '''Parallel Recovery''' ==<br />
<br><br />
<br />
=== PostgreSQL redo log (WAL) outline ===<br />
<br />
In PostgreSQL, WAL (write-ahead log) is used for recovery and log-shipping replication. Outline is described in [https://www.postgresql.org/docs/current/wal-intro.html this documentation page]. The outline of WAL is:<br />
<br />
* Every updating backend process must write its redo log to WAL before it really updates data files.<br />
* Each WAL record (piece of updating data record) has to reach stable storage before the data is synch'ed to the storage.<br />
<br />
WAL is used in recovery and log-shipping replication. In recovery, the WAL is used as follows:<br />
<br />
* Each WAL record is read from WAL segments in the written order,<br />
* Read WAL record is applied to corresponding data block. This is called "redo" or "replay".<br />
* To maintain redo consistency, redo was done in single thread.<br />
<br />
In the case of log-shipping replication, each WAL record is:<br />
<br />
* Transferred to each replica in the written order and<br />
* Replayed at replica in a single thread just as in recovery<br />
* When replay reaches a consistent state, it allows postmaster to begin read transactions, while following WAL records reach and are replayed.<br />
<br><br />
<br />
----<br />
<br />
=== Issues around serialized replay ===<br />
<br />
At present, WAL records are replayed strictly in written order in a single thread, while they are generated by multiple backend processes in parallel. The issues are:<br />
<br />
* Recovery/replay performance is much worse than performance to write WALs. Recovery can take longer than the period WAL records were written.<br />
* Replication lag can be long when the master has heavy updating workloads.<br />
<br><br />
<br />
----<br />
<br />
=== Can replay be done in parallel ? ===<br />
<br />
If we apply several rules about recovery order, yes. The rules are:<br />
<br />
* For a given data block, WAL records must be applied in the written order.<br />
* For a given transaction, when we apply WAL record terminating the transaction (commit/abort/prepare for commit), all the associated WAL records must have been replayed,<br />
* In the case of multi-block WAL, it should be applied only by one worker. For this purpose, such WAL record is assigned to all the block workers responsible for each block involved. When block worker picks up multi-block WAL and if it is not the last thread, it should wait until the last thread picking up this record actually replay it. If a thread is the last to pick up multi-block WAL record, it replays the record and send sync to other waiging workers.<br />
* To replay specific WAL record such as updating timeline, it should wait until all the preceding WAL records are replayed This may not be necessary but it looks safer at present to have such condition for certain set of WAL records.<br />
* Replayed LSN at pg_controldata will be updated when all the preceding WAL records are replayed.<br />
* Before applying transaction end (commit/abort/prepare for commit), all the WAL record belongs to the transaction must have been replayed.<br />
* In the case of several storage level WAL (create, etc), we need to wait for these WAL record replayed before assigning subsequent WAL records,<br />
* In the case of several storage level WAL (remove, etc), we need to wait until all the assigned WAL records have been replayed.<br />
* We can update applied WAL LSN when all the preceding WAL records has been replayed.<br />
<br><br />
<br />
----<br />
<br />
=== Implementation Note ===<br />
<br><br />
==== Worker configuration ====<br />
<br />
* Because current recovery is done in startup process, parallel replay threads are implemented as its child process.<br />
* Now we have following processes for this purpose:<br />
** Startup process: read WAL record from WAL segment/walsender and perform overall control,<br />
** Dispatching worker: analyze and dispatch WAL records,<br />
** Error page registration: collects error pages from other worker. Current implementation is using hash functions based on memory context, which is not simple to port it to shared memory. We can replace this with simpler configuration if we can manage such information using shared memory.<br />
** Transaction worker: replays WAL record which does not have any block information to update.<br />
** Block worker: replays block information (mostly HEAP/HEAP2).<br />
<br />
We can have more than one block workers. Other workers consists of single process each.<br />
<br><br />
==== Shared information and locks ====<br />
<br />
To pass and share information such as WAL record and status, shared memory defined in <code>storage/dsm.h</code> is used.<br />
<br />
To protect these data, spin lock defined in <code>storage/spin.h</code> is used.<br />
<br />
<br><br />
==== Synchronization among workers ====<br />
<br />
For light-weight synchronization among workers, unix-domain udp socket is used. This can be replaced with <code>stoage/latch.h</code>.<br />
<br><br />
==== New GUC ====<br />
<br />
Following GUC parameters are added:<br />
<br />
* <code>parallel_replay</code> (bool)<br />
* <code>parallel_replay_test</code> (bool, just internal) enable additional code to work to attach workers with gdb for test.<br />
* <code>num_preplay_workers</code> (int) number of total replay workers.<br />
* <code>num_preplay_queue</code> (int) number of queues holding outstanding WAL records,<br />
* <code>num_preplay_max_txn</code> (int) number of max outstanding transaction allowed in recovery. Max_connections or larger. In the case of replication slave, must be master's max_connections or larger.<br />
* <code>preplay_buffers</code> (int)<br />
<br><br />
<br />
----<br />
<br />
=== Connection to GDB ===<br />
<br />
Different from usual backend, recovery code runs as startup process (and workers forked from startup process). Because this is to be terminated when postmaster begins to accept connection unless hot standby is enabled, we need some more to connect GDB to parallel replay workers. Here's how to do this:<br />
<br />
* Implement dummy function to work as gdb break point (<code>PRDebug_sync()</code>).<br />
* When startup process begins to fork workers, it writes messages to the debug file (located in <code>pr_debug</code> in <code>PGDATA</code>).<br />
* The mssage contains what to do in separate shell:<br />
** Start gdb<br />
** Attach startup process and worker process respectively,<br />
** Setup temprary break function,<br />
** Create signal file,<br />
* Then the worker (and startup process) waits for the signal file to be created and call the break function.<br />
<br />
In this way, we can connect startup process and each worker process to gdb without pain.<br />
<br><br />
<br><br />
----<br />
<br />
=== Current code ===<br />
<br />
Current code is available from [https://github.com/koichi-szk/postgres.git Koichi's GITHub repository]. Branch is <code>parallel_replay_14_6</code>.<br />
<br />
<br><br />
<br />
For more information about the test, you can visit [https://github.com/koichi-szk/parallel_recovery_test.git]. Use master branch. Please understand this page needs more update for complete information.<br />
<br />
<br><br />
<br />
----<br />
<br />
<br />
=== Current code status ===<br />
<br />
No test yet. The code just pass the build and under the review before testing. Now a couple of functions need improvement/development and some others are influenced by this change.<br />
<br />
Complete features are:<br />
<br />
* Folk additional worker process from the startup process,<br />
* Assign <code>XLogRecord</code> to workers,<br />
* Various synchronization mechanism to maintain semantics of global order of WAL replay,<br />
<br />
Remaining issues are:<br />
<br />
* Need more test of shared buffer allocation/release. If implementation assumption (if no space, just wait until all the other worker finishes and frees available memory) is correct.<br />
* Recovery code breaks (most of them are due to wrong decoded information from passed from startup process).<br />
* Good benchmark to show potential performance gain.<br />
<br />
'''Any cooperation for discussion/test/development is very very welcome.''' You can contact to koichi.dbms_at_gmail.com.<br />
<br />
Please understand that the repo can be incomplete due to occasional work.<br />
<br />
Further addition is:<br />
<br />
* To add a code to have WAL record in text format (like pg_waldump) and hold it to some portion of the memory to help the test. <code>xlog_outrec()</code> looks very useful for this purpose. At present, this is enabled with <code>WAL_DEBUG</code> flag.</div>Koichihttps://wiki.postgresql.org/index.php?title=Distributed_deadlock_detection&diff=37350Distributed deadlock detection2022-11-21T00:51:18Z<p>Koichi: </p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
[[Koichi]]<br />
<br />
== Distributed deadlock detection ==<br />
<br />
=== Purpose of this project ===<br />
<br />
This article desribes how to extend PostgreSQL's locks system to represent wait-for-graph including transactions running at remote database server by many means, for example, FDW, libpq and dblink.<br />
<br />
This works to detect deadloc caused by remote transactions, by wait-for-graph cycle formed with chain of remote transactins.<br />
<br />
----<br />
<br />
=== How to represent wait-for-graph including remote transactions ===<br />
<br />
PostgreSQL itself represents object-level lock with various internal functions mainly in <code>lock.c<code>. This is used to chase wait-for-graph when a transaction fails to acquire a lock within the timeframe defined by <code>deadlock_timeout</code> GUC parameter, implemented in <code>deadlock.c</code>.<br />
<br />
We can extended this to represent status of transactions waiting for completion of remote transactions.<br />
<br />
==== External Lock ====<br />
<br />
Here, we define new locktag type, <code>LOCKTAG_EXTERNAL</code> in <code>lock.h</code>, to represent that the transaction acquing this lock is waiting for a remote transaction. Because this is hold only by the transaction (upstream transaction) waiting for a remote transaction (downstream transaction) invoked by this upstream transaction, no other transactions in the database where upstream transaction is running do not care about this lock. This lock is acquired in exclusive mode only and is held only by the upstream transaction (and, in a way, upstream transaction is waiting for this lock to complete, too).<br />
<br />
Locktag information is similar to other lock type, with reference information to additional properties as described below. For this purpose, applications which invokes remote transaction should acquire External Lock by calling <code>ExternalLockAcquire()</code>. This is a wrapper to <code>LockAcquireExtended()</code> and sets up locktag for the External Lock.<br />
<br />
==== External Lock property ====<br />
<br />
Because we are using this lock to track wait-for-graph, we need another property information to connect to the remote database where downstream transaction is running and trace wait-for-graph.<br />
<br />
This is the connection string for the database and backend id of the downstream transaction.<br />
<br />
In <code>lock.c</code> new function <code>ExternalLockSetProperties()</code>. For the help to track the wait for graph precisely and to deal with change in the remote transaction status, this function requires connection string, remote transaction pgprocno, pid and xid.<br />
<br />
----<br />
<br />
=== Deadlock Detection ===<br />
<br />
Deadlock detection mechanism uses existing deadlock detection code in <code>deadlock.c</code> with an extension.<br />
<br />
When deadlock detection (<code>DeadLockCheck()</code>) is called, it begins to trace wait-for-graph.<br />
In the checking, when <code>DeadLockcheck()</code> finds External Lock in waiting lock of <code>PGPROC</code>, it begins to trace the remote transaction represented by this External Lock, to build global wait-for-graph.<br />
This is repeated until it finds the global wait-for-graph terminates or it goes back to the original upstream transaction forming a cycle.<br />
<br />
Dedicated functions are added to perform this check.<br />
<br />
----<br />
<br />
=== LWLocks during external lock trace ===<br />
<br />
In local wait-for-graph tracing, all LWLocks are acquired by deadlock checking functions to simplify the tracing code.<br />
In global wait-for-graph tracing, we acquire all LWLocks during local wait-for-graph trace.<br />
When it goes out to check further wait-for-graph in the remote database, such LWLocks are all released so that other transactions can continue to run during time consuming remote wait-for-graph tracing.<br />
When a cycle is found, then all the databases ivolved in the global wait-for-graph cycle will check that their local portion of wait-for-graph is stable.<br />
If not, it means that at leas one transaction involved in this wait-for-graph is running and this is not a deadlock.<br />
If it is stable, then we determine this is a deadlock.<br />
During the check of stableness of the local wait-for-graph, we again acquire all the LWLocks locally.<br />
<br />
----<br />
<br />
=== What applications to do ===<br />
<br />
Applications (or extensions, whatsoever), should call two functions before they invoke a remote transaction.<br />
<br />
* <code>ExternalLockAcquire()</code><br />
* <code>ExternalLockSetProperties()</code><br />
<br />
Applications do not need to release External Locks to follow two-phase locking protocol. <br />
External Locks will be released at the end of transactions as a part of cleanup process.<br />
Because global lock check will be done in background, applications do not need to care about this at all.<br />
Applications should only acquire the lock and provide information to trace the wait-for-graph.<br />
<br />
----<br />
<br />
=== External Lock Properties ===<br />
<br />
Because current Lock struction is too small to acquire the lock property described above, we need extra space to hold them.<br />
In the current implementation, we hold this in the files at <code>$PGDATA/pg_external_locks</code>.<br />
File name is based on the values in the locktag.<br />
For simple implementation, External Lock properties are written in plain text and this may need more improvement for security.<br />
External Lock properties can be stored in other place, such as dynamic shared memory.<br />
<br />
----<br />
<br />
=== Current Status ===<br />
<br />
The code is now running with PG 14.<br />
You can freely clone the repo from <code>https://github.com/koichi-szk/postgres.git</code>.<br />
Please checkout the branch <code>koichi/global_deadlock_detection_14_0</code>.<br />
<br />
Because we don't have actual workload to test this feature, I have separate git repo containing the test environment and several useful functions for the test.<br />
You can visit the repo <code>https://github.com/koichi-szk/gdd_test.git</code>.<br />
Please checkout the branch <code>PG14_GDD</code>.<br />
Please also note that this repo depends on my local environment configuration and you need to arrange the environment for your own.<br />
<br />
=== Future work ===<br />
<br />
Because there are no actual workload around PG which causes global deadlock, I will continue to port this code to further releases to be ready to be added into PG itself when this feature is really needed.</div>Koichihttps://wiki.postgresql.org/index.php?title=Distributed_deadlock_detection&diff=37349Distributed deadlock detection2022-11-21T00:49:31Z<p>Koichi: How to detect global deadlock formed by wait-for-graph including remote transactions waiting for each other.</p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
[[Koichi]]<br />
<br />
== Distributed deadlock detection ==<br />
<br />
=== Purpose of this project ===<br />
<br />
This article desribes how to extend PostgreSQL's locks system to represent wait-for-graph including transactions running at remote database server by many means, for example, FDW, libpq and dblink.<br />
<br />
This works to detect deadloc caused by remote transactions, by wait-for-graph cycle formed with chain of remote transactins.<br />
<br />
----<br />
<br />
=== How to represent wait-for-graph including remote transactions ===<br />
<br />
PostgreSQL itself represents object-level lock with various internal functions mainly in <code>lock.c<code>. This is used to chase wait-for-graph when a transaction fails to acquire a lock within the timeframe defined by <code>deadlock_timeout</code> GUC parameter, implemented in <code>deadlock.c</code>.<br />
<br />
We can extended this to represent status of transactions waiting for completion of remote transactions.<br />
<br />
==== External Lock ====<br />
<br />
Here, we define new locktag type, <code>LOCKTAG_EXTERNAL</code> in <code>lock.h</code>, to represent that the transaction acquing this lock is waiting for a remote transaction. Because this is hold only by the transaction (upstream transaction) waiting for a remote transaction (downstream transaction) invoked by this upstream transaction, no other transactions in the database where upstream transaction is running do not care about this lock. This lock is acquired in exclusive mode only and is held only by the upstream transaction (and, in a way, upstream transaction is waiting for this lock to complete, too).<br />
<br />
Locktag information is similar to other lock type, with reference information to additional properties as described below. For this purpose, applications which invokes remote transaction should acquire External Lock by calling <code>ExternalLockAcquire()</code>. This is a wrapper to <code>LockAcquireExtended()</code> and sets up locktag for the External Lock.<br />
<br />
==== External Lock property ====<br />
<br />
Because we are using this lock to track wait-for-graph, we need another property information to connect to the remote database where downstream transaction is running and trace wait-for-graph.<br />
<br />
This is the connection string for the database and backend id of the downstream transaction.<br />
<br />
In <code>lock.c</code> new function <code>ExternalLockSetProperties()</code>. For the help to track the wait for graph precisely and to deal with change in the remote transaction status, this function requires connection string, remote transaction pgprocno, pid and xid.<br />
<br />
----<br />
<br />
=== Deadlock Detection ===<br />
<br />
Deadlock detection mechanism uses existing deadlock detection code in <code>deadlock.c</code> with an extension.<br />
<br />
When deadlock detection (<code>DeadLockCheck()</code>) is called, it begins to trace wait-for-graph.<br />
In the checking, when <code>DeadLockcheck()</code> finds External Lock in waiting lock of <code>PGPROC</code>, it begins to trace the remote transaction represented by this External Lock, to build global wait-for-graph.<br />
This is repeated until it finds the global wait-for-graph terminates or it goes back to the original upstream transaction forming a cycle.<br />
<br />
Dedicated functions are added to perform this check.<br />
<br />
----<br />
<br />
=== LWLocks during external lock trace ===<br />
<br />
In local wait-for-graph tracing, all LWLocks are acquired by deadlock checking functions to simplify the tracing code.<br />
In global wait-for-graph tracing, we acquire all LWLocks during local wait-for-graph trace.<br />
When it goes out to check further wait-for-graph in the remote database, such LWLocks are all released so that other transactions can continue to run during time consuming remote wait-for-graph tracing.<br />
When a cycle is found, then all the databases ivolved in the global wait-for-graph cycle will check that their local portion of wait-for-graph is stable.<br />
If not, it means that at leas one transaction involved in this wait-for-graph is running and this is not a deadlock.<br />
If it is stable, then we determine this is a deadlock.<br />
During the check of stableness of the local wait-for-graph, we again acquire all the LWLocks locally.<br />
<br />
----<br />
<br />
=== What applications to do ===<br />
<br />
Applications (or extensions, whatsoever), should call two functions before they invoke a remote transaction.<br />
<br />
* <code>ExternalLockAcquire()</code><br />
* <code>ExternalLockSetProperties()</code><br />
<br />
Applications do not need to release External Locks to follow two-phase locking protocol. <br />
External Locks will be released at the end of transactions as a part of cleanup process.<br />
Because global lock check will be done in background, applications do not need to care about this at all.<br />
Applications should only acquire the lock and provide information to trace the wait-for-graph.<br />
<br />
----<br />
<br />
=== External Lock Properties ===<br />
<br />
Because current Lock struction is too small to acquire the lock property described above, we need extra space to hold them.<br />
In the current implementation, we hold this in the files at <code>$PGDATA/pg_external_locks</code>.<br />
File name is based on the values in the locktag.<br />
For simple implementation, External Lock properties are written in plain text and this may need more improvement for security.<br />
External Lock properties can be stored in other place, such as dynamic shared memory.<br />
<br />
----<br />
<br />
=== Current Status ===<br />
<br />
The code is now running with PG 14.<br />
You can freely clone the repo from <code>https://github.com/koichi-szk/postgres.git</code>.<br />
Please checkout the branch <code>koichi/global_deadlock_detection_14_0</code>.<br />
<br />
Because we don't have actual workload to test this feature, I have separate git repo containing the test environment and several useful functions for the test.<br />
You can visit the repo <code><br />
<br />
=== Future work ===<br />
<br />
Because there are no actual workload around PG which causes global deadlock, I will continue to port this code to further releases to be ready to be added into PG itself when this feature is really needed.</div>Koichihttps://wiki.postgresql.org/index.php?title=Parallel_Recovery&diff=37348Parallel Recovery2022-11-18T08:39:50Z<p>Koichi: </p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
My Wiki home: [[Koichi]]<br />
<br />
email: koichi.dbms_at_gmail.com<br />
<br />
Other topics: [[Distributed deadlock detection]]<br />
== '''Parallel Recovery''' ==<br />
<br><br />
<br />
=== PostgreSQL redo log (WAL) outline ===<br />
<br />
In PostgreSQL, WAL (write-ahead log) is used for recovery and log-shipping replication. Outline is described in [https://www.postgresql.org/docs/current/wal-intro.html this documentation page]. The outline of WAL is:<br />
<br />
* Every updating backend process must write its redo log to WAL before it really updates data files.<br />
* Each WAL record (piece of updating data record) has to reach stable storage before the data is synch'ed to the storage.<br />
<br />
WAL is used in recovery and log-shipping replication. In recovery, the WAL is used as follows:<br />
<br />
* Each WAL record is read from WAL segments in the written order,<br />
* Read WAL record is applied to corresponding data block. This is called "redo" or "replay".<br />
* To maintain redo consistency, redo was done in single thread.<br />
<br />
In the case of log-shipping replication, each WAL record is:<br />
<br />
* Transferred to each replica in the written order and<br />
* Replayed at replica in a single thread just as in recovery<br />
* When replay reaches a consistent state, it allows postmaster to begin read transactions, while following WAL records reach and are replayed.<br />
<br><br />
<br />
----<br />
<br />
=== Issues around serialized replay ===<br />
<br />
At present, WAL records are replayed strictly in written order in a single thread, while they are generated by multiple backend processes in parallel. The issues are:<br />
<br />
* Recovery/replay performance is much worse than performance to write WALs. Recovery can take longer than the period WAL records were written.<br />
* Replication lag can be long when the master has heavy updating workloads.<br />
<br><br />
<br />
----<br />
<br />
=== Can replay be done in parallel ? ===<br />
<br />
If we apply several rules about recovery order, yes. The rules are:<br />
<br />
* For a given data block, WAL records must be applied in the written order.<br />
* For a given transaction, when we apply WAL record terminating the transaction (commit/abort/prepare for commit), all the associated WAL records must have been replayed,<br />
* In the case of multi-block WAL, it should be applied only by one worker. For this purpose, such WAL record is assigned to all the block workers responsible for each block involved. When block worker picks up multi-block WAL and if it is not the last thread, it should wait until the last thread picking up this record actually replay it. If a thread is the last to pick up multi-block WAL record, it replays the record and send sync to other waiging workers.<br />
* To replay specific WAL record such as updating timeline, it should wait until all the preceding WAL records are replayed This may not be necessary but it looks safer at present to have such condition for certain set of WAL records.<br />
* Replayed LSN at pg_controldata will be updated when all the preceding WAL records are replayed.<br />
* Before applying transaction end (commit/abort/prepare for commit), all the WAL record belongs to the transaction must have been replayed.<br />
* In the case of several storage level WAL (create, etc), we need to wait for these WAL record replayed before assigning subsequent WAL records,<br />
* In the case of several storage level WAL (remove, etc), we need to wait until all the assigned WAL records have been replayed.<br />
* We can update applied WAL LSN when all the preceding WAL records has been replayed.<br />
<br><br />
<br />
----<br />
<br />
=== Implementation Note ===<br />
<br><br />
==== Worker configuration ====<br />
<br />
* Because current recovery is done in startup process, parallel replay threads are implemented as its child process.<br />
* Now we have following processes for this purpose:<br />
** Startup process: read WAL record from WAL segment/walsender and perform overall control,<br />
** Dispatching worker: analyze and dispatch WAL records,<br />
** Error page registration: collects error pages from other worker. Current implementation is using hash functions based on memory context, which is not simple to port it to shared memory. We can replace this with simpler configuration if we can manage such information using shared memory.<br />
** Transaction worker: replays WAL record which does not have any block information to update.<br />
** Block worker: replays block information (mostly HEAP/HEAP2).<br />
<br />
We can have more than one block workers. Other workers consists of single process each.<br />
<br><br />
==== Shared information and locks ====<br />
<br />
To pass and share information such as WAL record and status, shared memory defined in <code>storage/dsm.h</code> is used.<br />
<br />
To protect these data, spin lock defined in <code>storage/spin.h</code> is used.<br />
<br />
<br><br />
==== Synchronization among workers ====<br />
<br />
For light-weight synchronization among workers, unix-domain udp socket is used. This can be replaced with <code>stoage/latch.h</code>.<br />
<br><br />
==== New GUC ====<br />
<br />
Following GUC parameters are added:<br />
<br />
* <code>parallel_replay</code> (bool)<br />
* <code>parallel_replay_test</code> (bool, just internal) enable additional code to work to attach workers with gdb for test.<br />
* <code>num_preplay_workers</code> (int) number of total replay workers.<br />
* <code>num_preplay_queue</code> (int) number of queues holding outstanding WAL records,<br />
* <code>num_preplay_max_txn</code> (int) number of max outstanding transaction allowed in recovery. Max_connections or larger. In the case of replication slave, must be master's max_connections or larger.<br />
* <code>preplay_buffers</code> (int)<br />
<br><br />
<br />
----<br />
<br />
=== Connection to GDB ===<br />
<br />
Different from usual backend, recovery code runs as startup process (and workers forked from startup process). Because this is to be terminated when postmaster begins to accept connection unless hot standby is enabled, we need some more to connect GDB to parallel replay workers. Here's how to do this:<br />
<br />
* Implement dummy function to work as gdb break point (<code>PRDebug_sync()</code>).<br />
* When startup process begins to fork workers, it writes messages to the debug file (located in <code>pr_debug</code> in <code>PGDATA</code>).<br />
* The mssage contains what to do in separate shell:<br />
** Start gdb<br />
** Attach startup process and worker process respectively,<br />
** Setup temprary break function,<br />
** Create signal file,<br />
* Then the worker (and startup process) waits for the signal file to be created and call the break function.<br />
<br />
In this way, we can connect startup process and each worker process to gdb without pain.<br />
<br><br />
<br><br />
----<br />
<br />
=== Current code ===<br />
<br />
Current code is available from [https://github.com/koichi-szk/postgres.git Koichi's GITHub repository]. Branch is <code>parallel_replay_14_6</code>.<br />
<br />
<br><br />
<br />
----<br />
<br />
<br />
=== Current code status ===<br />
<br />
No test yet. The code just pass the build and under the review before testing. Now a couple of functions need improvement/development and some others are influenced by this change.<br />
<br />
Complete features are:<br />
<br />
* Folk additional worker process from the startup process,<br />
* Assign <code>XLogRecord</code> to workers,<br />
* Various synchronization mechanism to maintain semantics of global order of WAL replay,<br />
<br />
Remaining issues are:<br />
<br />
* Need more test of shared buffer allocation/release. If implementation assumption (if no space, just wait until all the other worker finishes and frees available memory) is correct.<br />
* Recovery code breaks (most of them are due to wrong decoded information from passed from startup process).<br />
* Good benchmark to show potential performance gain.<br />
<br />
'''Any cooperation for discussion/test/development is very very welcome.''' You can contact to koichi.dbms_at_gmail.com.<br />
<br />
Please understand that the repo can be incomplete due to occasional work.<br />
<br />
Further addition is:<br />
<br />
* To add a code to have WAL record in text format (like pg_waldump) and hold it to some portion of the memory to help the test. <code>xlog_outrec()</code> looks very useful for this purpose. At present, this is enabled with <code>WAL_DEBUG</code> flag.</div>Koichihttps://wiki.postgresql.org/index.php?title=Parallel_Recovery&diff=37347Parallel Recovery2022-11-18T08:39:08Z<p>Koichi: </p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
My Wiki home: [[Koichi]]<br />
<br />
email: koichi.dbms_at_gmail.com<br />
<br />
Other topics: [[Distributed deadlock detection]]<br />
== '''Parallel Recovery''' ==<br />
<br><br />
<br />
=== PostgreSQL redo log (WAL) outline ===<br />
<br />
In PostgreSQL, WAL (write-ahead log) is used for recovery and log-shipping replication. Outline is described in [https://www.postgresql.org/docs/current/wal-intro.html this documentation page]. The outline of WAL is:<br />
<br />
* Every updating backend process must write its redo log to WAL before it really updates data files.<br />
* Each WAL record (piece of updating data record) has to reach stable storage before the data is synch'ed to the storage.<br />
<br />
WAL is used in recovery and log-shipping replication. In recovery, the WAL is used as follows:<br />
<br />
* Each WAL record is read from WAL segments in the written order,<br />
* Read WAL record is applied to corresponding data block. This is called "redo" or "replay".<br />
* To maintain redo consistency, redo was done in single thread.<br />
<br />
In the case of log-shipping replication, each WAL record is:<br />
<br />
* Transferred to each replica in the written order and<br />
* Replayed at replica in a single thread just as in recovery<br />
* When replay reaches a consistent state, it allows postmaster to begin read transactions, while following WAL records reach and are replayed.<br />
<br><br />
<br />
----<br />
<br />
=== Issues around serialized replay ===<br />
<br />
At present, WAL records are replayed strictly in written order in a single thread, while they are generated by multiple backend processes in parallel. The issues are:<br />
<br />
* Recovery/replay performance is much worse than performance to write WALs. Recovery can take longer than the period WAL records were written.<br />
* Replication lag can be long when the master has heavy updating workloads.<br />
<br><br />
<br />
----<br />
<br />
=== Can replay be done in parallel ? ===<br />
<br />
If we apply several rules about recovery order, yes. The rules are:<br />
<br />
* For a given data block, WAL records must be applied in the written order.<br />
* For a given transaction, when we apply WAL record terminating the transaction (commit/abort/prepare for commit), all the associated WAL records must have been replayed,<br />
* In the case of multi-block WAL, it should be applied only by one worker. For this purpose, such WAL record is assigned to all the block workers responsible for each block involved. When block worker picks up multi-block WAL and if it is not the last thread, it should wait until the last thread picking up this record actually replay it. If a thread is the last to pick up multi-block WAL record, it replays the record and send sync to other waiging workers.<br />
* To replay specific WAL record such as updating timeline, it should wait until all the preceding WAL records are replayed This may not be necessary but it looks safer at present to have such condition for certain set of WAL records.<br />
* Replayed LSN at pg_controldata will be updated when all the preceding WAL records are replayed.<br />
* Before applying transaction end (commit/abort/prepare for commit), all the WAL record belongs to the transaction must have been replayed.<br />
* In the case of several storage level WAL (create, etc), we need to wait for these WAL record replayed before assigning subsequent WAL records,<br />
* In the case of several storage level WAL (remove, etc), we need to wait until all the assigned WAL records have been replayed.<br />
* We can update applied WAL LSN when all the preceding WAL records has been replayed.<br />
<br><br />
<br />
----<br />
<br />
=== Implementation Note ===<br />
<br><br />
==== Worker configuration ====<br />
<br />
* Because current recovery is done in startup process, parallel replay threads are implemented as its child process.<br />
* Now we have following processes for this purpose:<br />
** Startup process: read WAL record from WAL segment/walsender and perform overall control,<br />
** Dispatching worker: analyze and dispatch WAL records,<br />
** Error page registration: collects error pages from other worker. Current implementation is using hash functions based on memory context, which is not simple to port it to shared memory. We can replace this with simpler configuration if we can manage such information using shared memory.<br />
** Transaction worker: replays WAL record which does not have any block information to update.<br />
** Block worker: replays block information (mostly HEAP/HEAP2).<br />
<br />
We can have more than one block workers. Other workers consists of single process each.<br />
<br><br />
==== Shared information and locks ====<br />
<br />
To pass and share information such as WAL record and status, shared memory defined in <code>storage/dsm.h</code> is used.<br />
<br />
To protect these data, spin lock defined in <code>storage/spin.h</code> is used.<br />
<br />
<br><br />
==== Synchronization among workers ====<br />
<br />
For light-weight synchronization among workers, unix-domain udp socket is used. This can be replaced with <code>stoage/latch.h</code>.<br />
<br><br />
==== New GUC ====<br />
<br />
Following GUC parameters are added:<br />
<br />
* <code>parallel_replay</code> (bool)<br />
* <code>parallel_replay_test</code> (bool, just internal) enable additional code to work to attach workers with gdb for test.<br />
* <code>num_preplay_workers</code> (int) number of total replay workers.<br />
* <code>num_preplay_queue</code> (int) number of queues holding outstanding WAL records,<br />
* <code>num_preplay_max_txn</code> (int) number of max outstanding transaction allowed in recovery. Max_connections or larger. In the case of replication slave, must be master's max_connections or larger.<br />
* <code>preplay_buffers</code> (int)<br />
<br><br />
<br />
----<br />
<br />
=== Connection to GDB ===<br />
<br />
Different from usual backend, recovery code runs as startup process (and workers forked from startup process). Because this is to be terminated when postmaster begins to accept connection unless hot standby is enabled, we need some more to connect GDB to parallel replay workers. Here's how to do this:<br />
<br />
* Implement dummy function to work as gdb break point (<code>PRDebug_sync()</code>).<br />
* When startup process begins to fork workers, it writes messages to the debug file (located in <code>pr_debug</code> in <code>PGDATA</code>).<br />
* The mssage contains what to do in separate shell:<br />
* Start gdb<br />
* Attach startup process and worker process respectively,<br />
* Setup temprary break function,<br />
* Create signal file,<br />
* Then the worker (and startup process) waits for the signal file to be created and call the break function.<br />
<br />
In this way, we can connect startup process and each worker process to gdb without pain.<br />
<br><br />
<br><br />
----<br />
<br />
=== Current code ===<br />
<br />
Current code is available from [https://github.com/koichi-szk/postgres.git Koichi's GITHub repository]. Branch is <code>parallel_replay_14_6</code>.<br />
<br />
<br><br />
<br />
----<br />
<br />
<br />
=== Current code status ===<br />
<br />
No test yet. The code just pass the build and under the review before testing. Now a couple of functions need improvement/development and some others are influenced by this change.<br />
<br />
Complete features are:<br />
<br />
* Folk additional worker process from the startup process,<br />
* Assign <code>XLogRecord</code> to workers,<br />
* Various synchronization mechanism to maintain semantics of global order of WAL replay,<br />
<br />
Remaining issues are:<br />
<br />
* Need more test of shared buffer allocation/release. If implementation assumption (if no space, just wait until all the other worker finishes and frees available memory) is correct.<br />
* Recovery code breaks (most of them are due to wrong decoded information from passed from startup process).<br />
* Good benchmark to show potential performance gain.<br />
<br />
'''Any cooperation for discussion/test/development is very very welcome.''' You can contact to koichi.dbms_at_gmail.com.<br />
<br />
Please understand that the repo can be incomplete due to occasional work.<br />
<br />
Further addition is:<br />
<br />
* To add a code to have WAL record in text format (like pg_waldump) and hold it to some portion of the memory to help the test. <code>xlog_outrec()</code> looks very useful for this purpose. At present, this is enabled with <code>WAL_DEBUG</code> flag.</div>Koichihttps://wiki.postgresql.org/index.php?title=Parallel_Recovery&diff=37346Parallel Recovery2022-11-18T08:36:40Z<p>Koichi: </p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
My Wiki home: [[Koichi]]<br />
<br />
email: koichi.dbms_at_gmail.com<br />
<br />
Other topics: [[Distributed deadlock detection]]<br />
== '''Parallel Recovery''' ==<br />
<br><br />
<br />
=== PostgreSQL redo log (WAL) outline ===<br />
<br />
In PostgreSQL, WAL (write-ahead log) is used for recovery and log-shipping replication. Outline is described in [https://www.postgresql.org/docs/current/wal-intro.html this documentation page]. The outline of WAL is:<br />
<br />
* Every updating backend process must write its redo log to WAL before it really updates data files.<br />
* Each WAL record (piece of updating data record) has to reach stable storage before the data is synch'ed to the storage.<br />
<br />
WAL is used in recovery and log-shipping replication. In recovery, the WAL is used as follows:<br />
<br />
* Each WAL record is read from WAL segments in the written order,<br />
* Read WAL record is applied to corresponding data block. This is called "redo" or "replay".<br />
* To maintain redo consistency, redo was done in single thread.<br />
<br />
In the case of log-shipping replication, each WAL record is:<br />
<br />
* Transferred to each replica in the written order and<br />
* Replayed at replica in a single thread just as in recovery<br />
* When replay reaches a consistent state, it allows postmaster to begin read transactions, while following WAL records reach and are replayed.<br />
<br><br />
<br />
----<br />
<br />
=== Issues around serialized replay ===<br />
<br />
At present, WAL records are replayed strictly in written order in a single thread, while they are generated by multiple backend processes in parallel. The issues are:<br />
<br />
* Recovery/replay performance is much worse than performance to write WALs. Recovery can take longer than the period WAL records were written.<br />
* Replication lag can be long when the master has heavy updating workloads.<br />
<br><br />
<br />
----<br />
<br />
=== Can replay be done in parallel ? ===<br />
<br />
If we apply several rules about recovery order, yes. The rules are:<br />
<br />
* For a given data block, WAL records must be applied in the written order.<br />
* For a given transaction, when we apply WAL record terminating the transaction (commit/abort/prepare for commit), all the associated WAL records must have been replayed,<br />
* In the case of multi-block WAL, it should be applied only by one worker. For this purpose, such WAL record is assigned to all the block workers responsible for each block involved. When block worker picks up multi-block WAL and if it is not the last thread, it should wait until the last thread picking up this record actually replay it. If a thread is the last to pick up multi-block WAL record, it replays the record and send sync to other waiging workers.<br />
* To replay specific WAL record such as updating timeline, it should wait until all the preceding WAL records are replayed This may not be necessary but it looks safer at present to have such condition for certain set of WAL records.<br />
* Replayed LSN at pg_controldata will be updated when all the preceding WAL records are replayed.<br />
* Before applying transaction end (commit/abort/prepare for commit), all the WAL record belongs to the transaction must have been replayed.<br />
* In the case of several storage level WAL (create, etc), we need to wait for these WAL record replayed before assigning subsequent WAL records,<br />
* In the case of several storage level WAL (remove, etc), we need to wait until all the assigned WAL records have been replayed.<br />
* We can update applied WAL LSN when all the preceding WAL records has been replayed.<br />
<br><br />
<br />
----<br />
<br />
=== Implementation Note ===<br />
<br><br />
==== Worker configuration ====<br />
<br />
* Because current recovery is done in startup process, parallel replay threads are implemented as its child process.<br />
* Now we have following processes for this purpose:<br />
** Startup process: read WAL record from WAL segment/walsender and perform overall control,<br />
** Dispatching worker: analyze and dispatch WAL records,<br />
** Error page registration: collects error pages from other worker. Current implementation is using hash functions based on memory context, which is not simple to port it to shared memory. We can replace this with simpler configuration if we can manage such information using shared memory.<br />
** Transaction worker: replays WAL record which does not have any block information to update.<br />
** Block worker: replays block information (mostly HEAP/HEAP2).<br />
<br />
We can have more than one block workers. Other workers consists of single process each.<br />
<br><br />
==== Shared information and locks ====<br />
<br />
To pass and share information such as WAL record and status, shared memory defined in <code>storage/dsm.h</code> is used.<br />
<br />
To protect these data, spin lock defined in <code>storage/spin.h</code> is used.<br />
<br />
<br><br />
==== Synchronization among workers ====<br />
<br />
For light-weight synchronization among workers, unix-domain udp socket is used. This can be replaced with <code>stoage/latch.h</code>.<br />
<br><br />
==== New GUC ====<br />
<br />
Following GUC parameters are added:<br />
<br />
* <code>parallel_replay</code> (bool)<br />
* <code>parallel_replay_test</code> (bool, just internal) enable additional code to work to attach workers with gdb for test.<br />
* <code>num_preplay_workers</code> (int) number of total replay workers.<br />
* <code>num_preplay_queue</code> (int) number of queues holding outstanding WAL records,<br />
* <code>num_preplay_max_txn</code> (int) number of max outstanding transaction allowed in recovery. Max_connections or larger. In the case of replication slave, must be master's max_connections or larger.<br />
* <code>preplay_buffers</code> (int)<br />
<br><br />
<br />
----<br />
<br />
=== Connection to GDB ===<br />
<br />
Different from usual backend, recovery code runs as startup process (and workers forked from startup process). Because this is to be terminated when postmaster begins to accept connection unless hot standby is enabled, we need some more to connect GDB to parallel replay workers. Here's how to do this:<br />
<br />
* Implement dummy function to work as gdb break point (<code>PRDebug_sync()</code>).<br />
* When startup process begins to fork workers, it writes messages to the debug file (located in <code>pr_debug</code> in <code>PGDATA</code>).<br />
* The mssage contains what to do in separate shell:<br />
* Start gdb<br />
* Attach startup process and worker process respectively,<br />
* Setup temprary break function,<br />
* Create signal file,<br />
* Then the worker (and startup process) waits for the signal file to be created and call the break function.<br />
<br />
In this way, we can connect startup process and each worker process to gdb without pain.<br />
<br><br />
----<br />
<br />
=== Current code ===<br />
<br />
Current code is available from [https://github.com/koichi-szk/postgres.git Koichi's GITHub repository]. Branch is <code>parallel_replay_14_6</code>.<br />
<br />
<br><br />
<br />
----<br />
<br />
<br />
=== Current code status ===<br />
<br />
No test yet. The code just pass the build and under the review before testing. Now a couple of functions need improvement/development and some others are influenced by this change.<br />
<br />
Complete features are:<br />
<br />
* Folk additional worker process from the startup process,<br />
* Assign <code>XLogRecord</code> to workers,<br />
* Various synchronization mechanism to maintain semantics of global order of WAL replay,<br />
<br />
Remaining issues are:<br />
<br />
* Need more test of shared buffer allocation/release. If implementation assumption (if no space, just wait until all the other worker finishes and frees available memory) is correct.<br />
* Recovery code breaks (most of them are due to wrong decoded information from passed from startup process).<br />
* Good benchmark to show potential performance gain.<br />
<br />
'''Any cooperation for discussion/test/development is very very welcome.''' You can contact to koichi.dbms_at_gmail.com.<br />
<br />
Please understand that the repo can be incomplete due to occasional work.<br />
<br />
Further addition is:<br />
<br />
* To add a code to have WAL record in text format (like pg_waldump) and hold it to some portion of the memory to help the test. <code>xlog_outrec()</code> looks very useful for this purpose. At present, this is enabled with <code>WAL_DEBUG</code> flag.</div>Koichihttps://wiki.postgresql.org/index.php?title=Parallel_Recovery&diff=37345Parallel Recovery2022-11-18T05:24:08Z<p>Koichi: /* Can replay be done in parallel ? */</p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
My Wiki home: [[Koichi]]<br />
<br />
email: koichi.dbms_at_gmail.com<br />
<br />
Other topics: [[Distributed deadlock detection]]<br />
== '''Parallel Recovery''' ==<br />
<br><br />
<br />
=== PostgreSQL redo log (WAL) outline ===<br />
<br />
In PostgreSQL, WAL (write-ahead log) is used for recovery and log-shipping replication. Outline is described in [https://www.postgresql.org/docs/current/wal-intro.html this documentation page]. The outline of WAL is:<br />
<br />
* Every updating backend process must write its redo log to WAL before it really updates data files.<br />
* Each WAL record (piece of updating data record) has to reach stable storage before the data is synch'ed to the storage.<br />
<br />
WAL is used in recovery and log-shipping replication. In recovery, the WAL is used as follows:<br />
<br />
* Each WAL record is read from WAL segments in the written order,<br />
* Read WAL record is applied to corresponding data block. This is called "redo" or "replay".<br />
* To maintain redo consistency, redo was done in single thread.<br />
<br />
In the case of log-shipping replication, each WAL record is:<br />
<br />
* Transferred to each replica in the written order and<br />
* Replayed at replica in a single thread just as in recovery<br />
* When replay reaches a consistent state, it allows postmaster to begin read transactions, while following WAL records reach and are replayed.<br />
<br><br />
<br />
----<br />
<br />
=== Issues around serialized replay ===<br />
<br />
At present, WAL records are replayed strictly in written order in a single thread, while they are generated by multiple backend processes in parallel. The issues are:<br />
<br />
* Recovery/replay performance is much worse than performance to write WALs. Recovery can take longer than the period WAL records were written.<br />
* Replication lag can be long when the master has heavy updating workloads.<br />
<br><br />
<br />
----<br />
<br />
=== Can replay be done in parallel ? ===<br />
<br />
If we apply several rules about recovery order, yes. The rules are:<br />
<br />
* For a given data block, WAL records must be applied in the written order.<br />
* For a given transaction, when we apply WAL record terminating the transaction (commit/abort/prepare for commit), all the associated WAL records must have been replayed,<br />
* In the case of multi-block WAL, it should be applied only by one worker. For this purpose, such WAL record is assigned to all the block workers responsible for each block involved. When block worker picks up multi-block WAL and if it is not the last thread, it should wait until the last thread picking up this record actually replay it. If a thread is the last to pick up multi-block WAL record, it replays the record and send sync to other waiging workers.<br />
* To replay specific WAL record such as updating timeline, it should wait until all the preceding WAL records are replayed This may not be necessary but it looks safer at present to have such condition for certain set of WAL records.<br />
* Replayed LSN at pg_controldata will be updated when all the preceding WAL records are replayed.<br />
* Before applying transaction end (commit/abort/prepare for commit), all the WAL record belongs to the transaction must have been replayed.<br />
* In the case of several storage level WAL (create, etc), we need to wait for these WAL record replayed before assigning subsequent WAL records,<br />
* In the case of several storage level WAL (remove, etc), we need to wait until all the assigned WAL records have been replayed.<br />
* We can update applied WAL LSN when all the preceding WAL records has been replayed.<br />
<br><br />
<br />
----<br />
<br />
=== Implementation Note ===<br />
<br><br />
==== Worker configuration ====<br />
<br />
* Because current recovery is done in startup process, parallel replay threads are implemented as its child process.<br />
* Now we have following processes for this purpose:<br />
** Startup process: read WAL record from WAL segment/walsender and perform overall control,<br />
** Dispatching worker: analyze and dispatch WAL records,<br />
** Error page registration: collects error pages from other worker. Current implementation is using hash functions based on memory context, which is not simple to port it to shared memory. We can replace this with simpler configuration if we can manage such information using shared memory.<br />
** Transaction worker: replays WAL record which does not have any block information to update.<br />
** Block worker: replays block information (mostly HEAP/HEAP2).<br />
<br />
We can have more than one block workers. Other workers consists of single process each.<br />
<br><br />
==== Shared information and locks ====<br />
<br />
To pass and share information such as WAL record and status, shared memory defined in <code>storage/dsm.h</code> is used.<br />
<br />
To protect these data, spin lock defined in <code>storage/spin.h</code> is used.<br />
<br />
<br><br />
==== Synchronization among workers ====<br />
<br />
For light-weight synchronization among workers, unix-domain udp socket is used. This can be replaced with <code>stoage/latch.h</code>.<br />
<br><br />
==== New GUC ====<br />
<br />
Following GUC parameters are added:<br />
<br />
* <code>parallel_replay</code> (bool)<br />
* <code>parallel_replay_test</code> (bool, just internal) enable additional code to work to attach workers with gdb for test.<br />
* <code>num_preplay_workers</code> (int) number of total replay workers.<br />
* <code>num_preplay_queue</code> (int) number of queues holding outstanding WAL records,<br />
* <code>num_preplay_max_txn</code> (int) number of max outstanding transaction allowed in recovery. Max_connections or larger. In the case of replication slave, must be master's max_connections or larger.<br />
* <code>preplay_buffers</code> (int)<br />
<br><br />
<br />
----<br />
<br />
=== Current code ===<br />
<br />
Current code is available from [https://github.com/koichi-szk/postgres.git Koichi's GITHub repository]. Branch is <code>parallel_replay_14_6</code>.<br />
<br />
<br><br />
<br />
----<br />
<br />
=== Current code status ===<br />
<br />
No test yet. The code just pass the build and under the review before testing. Now a couple of functions need improvement/development and some others are influenced by this change.<br />
<br />
Complete features are:<br />
<br />
* Folk additional worker process from the startup process,<br />
* Assign <code>XLogRecord</code> to workers,<br />
* Various synchronization mechanism to maintain semantics of global order of WAL replay,<br />
<br />
Remaining issues are:<br />
<br />
* Need more test of shared buffer allocation/release. If implementation assumption (if no space, just wait until all the other worker finishes and frees available memory) is correct.<br />
* Recovery code breaks (most of them are due to wrong decoded information from passed from startup process).<br />
* Good benchmark to show potential performance gain.<br />
<br />
'''Any cooperation for discussion/test/development is very very welcome.''' You can contact to koichi.dbms_at_gmail.com.<br />
<br />
Please understand that the repo can be incomplete due to occasional work.<br />
<br />
Further addition is:<br />
<br />
* To add a code to have WAL record in text format (like pg_waldump) and hold it to some portion of the memory to help the test. <code>xlog_outrec()</code> looks very useful for this purpose. At present, this is enabled with <code>WAL_DEBUG</code> flag.</div>Koichihttps://wiki.postgresql.org/index.php?title=Parallel_Recovery&diff=37344Parallel Recovery2022-11-18T05:20:27Z<p>Koichi: /* Current code status */</p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
My Wiki home: [[Koichi]]<br />
<br />
email: koichi.dbms_at_gmail.com<br />
<br />
Other topics: [[Distributed deadlock detection]]<br />
== '''Parallel Recovery''' ==<br />
<br><br />
<br />
=== PostgreSQL redo log (WAL) outline ===<br />
<br />
In PostgreSQL, WAL (write-ahead log) is used for recovery and log-shipping replication. Outline is described in [https://www.postgresql.org/docs/current/wal-intro.html this documentation page]. The outline of WAL is:<br />
<br />
* Every updating backend process must write its redo log to WAL before it really updates data files.<br />
* Each WAL record (piece of updating data record) has to reach stable storage before the data is synch'ed to the storage.<br />
<br />
WAL is used in recovery and log-shipping replication. In recovery, the WAL is used as follows:<br />
<br />
* Each WAL record is read from WAL segments in the written order,<br />
* Read WAL record is applied to corresponding data block. This is called "redo" or "replay".<br />
* To maintain redo consistency, redo was done in single thread.<br />
<br />
In the case of log-shipping replication, each WAL record is:<br />
<br />
* Transferred to each replica in the written order and<br />
* Replayed at replica in a single thread just as in recovery<br />
* When replay reaches a consistent state, it allows postmaster to begin read transactions, while following WAL records reach and are replayed.<br />
<br><br />
<br />
----<br />
<br />
=== Issues around serialized replay ===<br />
<br />
At present, WAL records are replayed strictly in written order in a single thread, while they are generated by multiple backend processes in parallel. The issues are:<br />
<br />
* Recovery/replay performance is much worse than performance to write WALs. Recovery can take longer than the period WAL records were written.<br />
* Replication lag can be long when the master has heavy updating workloads.<br />
<br><br />
<br />
----<br />
<br />
=== Can replay be done in parallel ? ===<br />
<br />
If we apply several rules about recovery order, yes. The rules are:<br />
<br />
* For a given data block, WAL records must be applied in the written order.<br />
* For a given transaction, when we apply WAL record terminating the transaction (commit/abort/prepare for commit), all the associated WAL records must have been replayed,<br />
* In the case of multi-block WAL, it should be applied only by one worker. For this purpose, such WAL record is assigned to all the block workers responsible for each block involved. When block worker picks up multi-block WAL and if it is not the last thread, it should wait until the last thread picking up this record actually replay it. If a thread is the last to pick up multi-block WAL record, it replays the record and send sync to other waiging workers.<br />
* To replay specific WAL record such as updating timeline, it should wait until all the preceding WAL records are replayed This may not be necessary but it looks safer at present to have such condition for certain set of WAL records.<br />
* Replayed LSN at pg_controldata will be updated when all the preceding WAL records are replayed.<br />
<br><br />
<br />
----<br />
<br />
=== Implementation Note ===<br />
<br><br />
==== Worker configuration ====<br />
<br />
* Because current recovery is done in startup process, parallel replay threads are implemented as its child process.<br />
* Now we have following processes for this purpose:<br />
** Startup process: read WAL record from WAL segment/walsender and perform overall control,<br />
** Dispatching worker: analyze and dispatch WAL records,<br />
** Error page registration: collects error pages from other worker. Current implementation is using hash functions based on memory context, which is not simple to port it to shared memory. We can replace this with simpler configuration if we can manage such information using shared memory.<br />
** Transaction worker: replays WAL record which does not have any block information to update.<br />
** Block worker: replays block information (mostly HEAP/HEAP2).<br />
<br />
We can have more than one block workers. Other workers consists of single process each.<br />
<br><br />
==== Shared information and locks ====<br />
<br />
To pass and share information such as WAL record and status, shared memory defined in <code>storage/dsm.h</code> is used.<br />
<br />
To protect these data, spin lock defined in <code>storage/spin.h</code> is used.<br />
<br />
<br><br />
==== Synchronization among workers ====<br />
<br />
For light-weight synchronization among workers, unix-domain udp socket is used. This can be replaced with <code>stoage/latch.h</code>.<br />
<br><br />
==== New GUC ====<br />
<br />
Following GUC parameters are added:<br />
<br />
* <code>parallel_replay</code> (bool)<br />
* <code>parallel_replay_test</code> (bool, just internal) enable additional code to work to attach workers with gdb for test.<br />
* <code>num_preplay_workers</code> (int) number of total replay workers.<br />
* <code>num_preplay_queue</code> (int) number of queues holding outstanding WAL records,<br />
* <code>num_preplay_max_txn</code> (int) number of max outstanding transaction allowed in recovery. Max_connections or larger. In the case of replication slave, must be master's max_connections or larger.<br />
* <code>preplay_buffers</code> (int)<br />
<br><br />
<br />
----<br />
<br />
=== Current code ===<br />
<br />
Current code is available from [https://github.com/koichi-szk/postgres.git Koichi's GITHub repository]. Branch is <code>parallel_replay_14_6</code>.<br />
<br />
<br><br />
<br />
----<br />
<br />
=== Current code status ===<br />
<br />
No test yet. The code just pass the build and under the review before testing. Now a couple of functions need improvement/development and some others are influenced by this change.<br />
<br />
Complete features are:<br />
<br />
* Folk additional worker process from the startup process,<br />
* Assign <code>XLogRecord</code> to workers,<br />
* Various synchronization mechanism to maintain semantics of global order of WAL replay,<br />
<br />
Remaining issues are:<br />
<br />
* Need more test of shared buffer allocation/release. If implementation assumption (if no space, just wait until all the other worker finishes and frees available memory) is correct.<br />
* Recovery code breaks (most of them are due to wrong decoded information from passed from startup process).<br />
* Good benchmark to show potential performance gain.<br />
<br />
'''Any cooperation for discussion/test/development is very very welcome.''' You can contact to koichi.dbms_at_gmail.com.<br />
<br />
Please understand that the repo can be incomplete due to occasional work.<br />
<br />
Further addition is:<br />
<br />
* To add a code to have WAL record in text format (like pg_waldump) and hold it to some portion of the memory to help the test. <code>xlog_outrec()</code> looks very useful for this purpose. At present, this is enabled with <code>WAL_DEBUG</code> flag.</div>Koichihttps://wiki.postgresql.org/index.php?title=Parallel_Recovery&diff=37343Parallel Recovery2022-11-18T05:15:02Z<p>Koichi: /* Current code */</p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
My Wiki home: [[Koichi]]<br />
<br />
email: koichi.dbms_at_gmail.com<br />
<br />
Other topics: [[Distributed deadlock detection]]<br />
== '''Parallel Recovery''' ==<br />
<br><br />
<br />
=== PostgreSQL redo log (WAL) outline ===<br />
<br />
In PostgreSQL, WAL (write-ahead log) is used for recovery and log-shipping replication. Outline is described in [https://www.postgresql.org/docs/current/wal-intro.html this documentation page]. The outline of WAL is:<br />
<br />
* Every updating backend process must write its redo log to WAL before it really updates data files.<br />
* Each WAL record (piece of updating data record) has to reach stable storage before the data is synch'ed to the storage.<br />
<br />
WAL is used in recovery and log-shipping replication. In recovery, the WAL is used as follows:<br />
<br />
* Each WAL record is read from WAL segments in the written order,<br />
* Read WAL record is applied to corresponding data block. This is called "redo" or "replay".<br />
* To maintain redo consistency, redo was done in single thread.<br />
<br />
In the case of log-shipping replication, each WAL record is:<br />
<br />
* Transferred to each replica in the written order and<br />
* Replayed at replica in a single thread just as in recovery<br />
* When replay reaches a consistent state, it allows postmaster to begin read transactions, while following WAL records reach and are replayed.<br />
<br><br />
<br />
----<br />
<br />
=== Issues around serialized replay ===<br />
<br />
At present, WAL records are replayed strictly in written order in a single thread, while they are generated by multiple backend processes in parallel. The issues are:<br />
<br />
* Recovery/replay performance is much worse than performance to write WALs. Recovery can take longer than the period WAL records were written.<br />
* Replication lag can be long when the master has heavy updating workloads.<br />
<br><br />
<br />
----<br />
<br />
=== Can replay be done in parallel ? ===<br />
<br />
If we apply several rules about recovery order, yes. The rules are:<br />
<br />
* For a given data block, WAL records must be applied in the written order.<br />
* For a given transaction, when we apply WAL record terminating the transaction (commit/abort/prepare for commit), all the associated WAL records must have been replayed,<br />
* In the case of multi-block WAL, it should be applied only by one worker. For this purpose, such WAL record is assigned to all the block workers responsible for each block involved. When block worker picks up multi-block WAL and if it is not the last thread, it should wait until the last thread picking up this record actually replay it. If a thread is the last to pick up multi-block WAL record, it replays the record and send sync to other waiging workers.<br />
* To replay specific WAL record such as updating timeline, it should wait until all the preceding WAL records are replayed This may not be necessary but it looks safer at present to have such condition for certain set of WAL records.<br />
* Replayed LSN at pg_controldata will be updated when all the preceding WAL records are replayed.<br />
<br><br />
<br />
----<br />
<br />
=== Implementation Note ===<br />
<br><br />
==== Worker configuration ====<br />
<br />
* Because current recovery is done in startup process, parallel replay threads are implemented as its child process.<br />
* Now we have following processes for this purpose:<br />
** Startup process: read WAL record from WAL segment/walsender and perform overall control,<br />
** Dispatching worker: analyze and dispatch WAL records,<br />
** Error page registration: collects error pages from other worker. Current implementation is using hash functions based on memory context, which is not simple to port it to shared memory. We can replace this with simpler configuration if we can manage such information using shared memory.<br />
** Transaction worker: replays WAL record which does not have any block information to update.<br />
** Block worker: replays block information (mostly HEAP/HEAP2).<br />
<br />
We can have more than one block workers. Other workers consists of single process each.<br />
<br><br />
==== Shared information and locks ====<br />
<br />
To pass and share information such as WAL record and status, shared memory defined in <code>storage/dsm.h</code> is used.<br />
<br />
To protect these data, spin lock defined in <code>storage/spin.h</code> is used.<br />
<br />
<br><br />
==== Synchronization among workers ====<br />
<br />
For light-weight synchronization among workers, unix-domain udp socket is used. This can be replaced with <code>stoage/latch.h</code>.<br />
<br><br />
==== New GUC ====<br />
<br />
Following GUC parameters are added:<br />
<br />
* <code>parallel_replay</code> (bool)<br />
* <code>parallel_replay_test</code> (bool, just internal) enable additional code to work to attach workers with gdb for test.<br />
* <code>num_preplay_workers</code> (int) number of total replay workers.<br />
* <code>num_preplay_queue</code> (int) number of queues holding outstanding WAL records,<br />
* <code>num_preplay_max_txn</code> (int) number of max outstanding transaction allowed in recovery. Max_connections or larger. In the case of replication slave, must be master's max_connections or larger.<br />
* <code>preplay_buffers</code> (int)<br />
<br><br />
<br />
----<br />
<br />
=== Current code ===<br />
<br />
Current code is available from [https://github.com/koichi-szk/postgres.git Koichi's GITHub repository]. Branch is <code>parallel_replay_14_6</code>.<br />
<br />
<br><br />
<br />
----<br />
<br />
=== Current code status ===<br />
<br />
No test yet. The code just pass the build and under the review before testing. Now a couple of functions need improvement/development and some others are influenced by this change.<br />
<br />
'''Any cooperation for discussion/test/development is very very welcome.''' You can contact to koichi.dbms_at_gmail.com.<br />
<br />
Please understand that the repo can be incomplete due to occasional work.<br />
<br />
Further addition is:<br />
<br />
* To add a code to have WAL record in text format (like pg_waldump) and hold it to some portion of the memory to help the test. <code>xlog_outrec()</code> looks very useful for this purpose. At present, this is enabled with <code>WAL_DEBUG</code> flag.</div>Koichihttps://wiki.postgresql.org/index.php?title=Distributed_transaction_visibility&diff=37342Distributed transaction visibility2022-11-18T03:06:18Z<p>Koichi: </p>
<hr />
<div><br />
----<br />
[[File:MyPicture_github.jpg]]<br />
<br />
[[Koichi]]<br />
<br />
== Global Visibility for distributed R/W workload amount PostgreSQL database cluster ==<br />
<br />
=== Goal of the project ===<br />
<br />
* Provide consistent visibility to distributed R/W transactions spanning into more than one postgres database instance.<br />
* No impact to local transactions.<br />
<br />
=== Problem to solve ===<br />
<br />
We have two-phase commit protocol to maitain write consistency of groups of transactions as a distributed transaction.<br />
<br />
In this case, final "commit" timestamp is different in database instances.<br />
<br />
Because of this, result of write can be visible in some of involved instances but invisible in others.<br />
This is known as distributed read anomaly.<br />
<br />
We need to solve this read anomaly problem.<br />
<br />
=== Outline of the solution ===<br />
<br />
* We define global transaction identifier unique to each distributed transactions and same among transactions begonging to a same global transaction.<br />
* Use global snapshot for global transactions in row visibility check.<br />
<br />
=== Definition and structure of a distrubuted transaction ===<br />
<br />
* Any distributed transaction has one "root" transaction which starts the transaction.<br />
* Other transactions in a distributed transaction must be invoked by the "root" transaction or a "child transaction", which is invoked directly or indirectly by the root transaction.<br />
* "'''root'''" transaction is given a global transaction id ('''GXID''') as <code>(dbid, gxid)</code>, where <code>dbid</code> is identifier of the database root transaction is running and gxid is unique number given in ascending order by global transaction proxy as described below.<br />
Typically, database system identifier given by initdb can be used for this purpose. This depends upon implementation and we can choose any other means as long as dbid is unique. In the case of database system identifier, this is copied by cloning the database by `pg_basebackup` and streaming replication, for example. We may need a means to assign different database system identifier if such cloned database should be used as separate database instance.<br />
* When "root" transaction and its "'''child'''" transaction invokes child transactions, such "parent" transactions has to propagate its global transaction Id to their children. Locally, this global transaction ID has to be associated with the local transction.<br />
<br />
=== Assignment of global transaction ID ===<br />
<br />
* We need separate facility to manage global transaction information in a given postgreSQL cluster, called "'''global transaction proxy'''" ('''GTP''').<br />
* When a transaction wants to be the root transaction and would like to provide and receive consistent visibility need to connect to GPT and request for new GXID. GTP assignes gxid and record this with root transaction's dbid as <code>(dbid, gxid)</code> pair. Please note that this is needed not only for write transaction but for read transaction which need consistent visiblity among different database instances.<br />
* A root transact has to propagate this GXID to its children. When a child is invoking its child, then it also has to propagate this GXID, which should be associated to the local transaction of the child.<br />
* When a root transaction (and all its children) finishes with final COMMIT, it has to report GTP that the transaction finished. GTP records this is finished. Maintenance of GXID information at GTP will be given below.<br />
<br />
=== Global snapshot and its handling in local transactions. ===<br />
<br />
* The root has to ask for snapshot of running global transactions to GTP. GTP collects GXID of running global transactions and provides this set of GXIDs as a global snapshot.<br />
* Root has to propagate this global snapshot to its children and so forth. This is associated with the local transaction of children.<br />
* In visibility check, if a local XID (in xmin/xmax and others) is associated with GXID (we may need a flag in each row to indicate this), such transaction visibility should be determined based on its GXID appearing in the snapshot (note that in global snapshot, both terminated GXID and runing GXID appears, depending upon its syntax). Details will be given below.<br />
* Visibility of local transactions not associated with GXID can be handled in the same manner as current usual local transactions.<br />
* For a local transaction without GXID, there are no global snapshot and no GXID is used for visibility check even target row's xid is associated with GXID. Thus, for the local transactions, there will be no additional overhead.<br />
<br />
=== Global snapshot data format ===<br />
<br />
Global snapshot data can be represented in the following format:<br />
<br />
<code><br />
(GXID_0_0, ..., GXID_0_M) GXMIN (GXID_1_0, ..., GXID_1_N) GXMAX<br />
</code><br />
<br />
<code>GXMIN</code> means global tansaction whose gxid is smaller than this value are all terminated (commited/aborted), unless appearing in the list <code>(GXID_0_0, ... GXID_0_M)</code>.<br />
This is used to save the amount of snapshot especially when only very few number of global transactions are alive for error ercovery (typically local commit failure).<br />
<br />
<code>GXMAX</code> means this is the maximum value of currently assigned gxid and global transactions with gxid larger than GXMAX has to be regarded as running.<br />
<br />
<code>(GXID_1_0, .... , GXOD_1_N)</code> represents a set of GXIDs whose gxid is between <code>GXMIN</code> and <code>GXMAX</code> and has already terminated. The reason for this is we cannot guarantee the transaction is running when assocaited database instance is down after the transaction is reported to begin and before it reports to be terminated.<br />
<br />
From the above, there are multiple snapshot representation possible for a given situation. GPT should provide the minumum size snapshot by calculating optimal value for <code>GXMIN</code> and <code>GXMAX</code>.<br />
<br />
=== Global snapshot maintenance in GTP ===<br />
<br />
GXID in GTP cannot be removed in GTP immediately because we have a chance that other global transaction depends upon this.<br />
<br />
When a transaction terminates, its GXID is marked as finished and is associated with set of running GXIDs as referring GXIDs. Each of such GXID can be removed from the list when it terminates. When referring GXID becomes zero, then such GXID information can be removed from GTP.<br />
<br />
When a database instance gets down and restarts, all the associated transaction are automatically aborted. In this case, each database instance should report this to GTP and GTP can clean up its global transaction information if dbid matches to such database instance.<br />
<br />
Please note that such GXID will not appear in the snapshot requested after it ternates. This can be used to calculate GXMIN in the snapshot to minimize the amount of global snapshot.<br />
<br />
=== Global snapshot maintenance in local database instances ===<br />
<br />
Loclal database instances cannot remove GXID associated with local immediately when this GXID terminates because other global transaction can refer to this afterwords. Local database instance should inquire GTP about valid GXID periodically to prevent bloat. If such GXID is not present at GTP, local database instance can remove this. Please note that such remaining GXID-XID mapping does not affect the visibility check correctiveness but can impact the performance.<br />
<br />
=== Database instance beloging to more than one database cluster ===<br />
<br />
We can allow a database instance can belong to more than one cluster. In this case, we need to extend the above with cluster-id. Any global transaction need to determine which cluster it belongs to and GXID and global snapshot request to appropriate GTP. In this case, GXID can be represented as (cluster-id, dbid, gxid).</div>Koichihttps://wiki.postgresql.org/index.php?title=Distributed_transaction_visibility&diff=37341Distributed transaction visibility2022-11-18T02:34:19Z<p>Koichi: </p>
<hr />
<div><br />
----<br />
[[File:MyPicture_github.jpg]]<br />
<br />
[[Koichi]]<br />
<br />
== Global Visibility for distributed R/W workload amount PostgreSQL database cluster ==<br />
<br />
=== Goal of the project ===<br />
<br />
* Provide consistent visibility to distributed R/W transactions spanning into more than one postgres database instance.<br />
* No impact to local transactions.<br />
<br />
=== Problem to solve ===<br />
<br />
We have two-phase commit protocol to maitain write consistency of groups of transactions as a distributed transaction.<br />
<br />
In this case, final "commit" timestamp is different in database instances.<br />
<br />
Because of this, result of write can be visible in some of involved instances but invisible in others.<br />
This is known as distributed read anomaly.<br />
<br />
We need to solve this read anomaly problem.<br />
<br />
=== Outline of the solution ===<br />
<br />
* We define global transaction identifier unique to each distributed transactions and same among transactions begonging to a same global transaction.<br />
* Use global snapshot for global transactions in row visibility check.<br />
<br />
=== Definition and structure of a distrubuted transaction ===<br />
<br />
* Any distributed transaction has one "root" transaction which starts the transaction.<br />
* Other transactions in a distributed transaction must be invoked by the "root" transaction or a "child transaction", which is invoked directly or indirectly by the root transaction.<br />
* "'''root'''" transaction is given a global transaction id ('''GXID''') as <code>(dbid, gxid)</code>, where <code>dbid</code> is identifier of the database root transaction is running and gxid is unique number given in ascending order by global transaction proxy as described below.<br />
Typically, database system identifier given by initdb can be used for this purpose. This depends upon implementation and we can choose any other means as long as dbid is unique. In the case of database system identifier, this is copied by cloning the database by `pg_basebackup` and streaming replication, for example. We may need a means to assign different database system identifier if such cloned database should be used as separate database instance.<br />
* When "root" transaction and its "'''child'''" transaction invokes child transactions, such "parent" transactions has to propagate its global transaction Id to their children. Locally, this global transaction ID has to be associated with the local transction.<br />
<br />
=== Assignment of global transaction ID ===<br />
<br />
* We need separate facility to manage global transaction information in a given postgreSQL cluster, called "'''global transaction proxy'''" ('''GTP''').<br />
* When a transaction wants to be the root transaction and would like to provide and receive consistent visibility need to connect to GPT and request for new GXID. GTP assignes gxid and record this with root transaction's dbid as <code>(dbid, gxid)</code> pair. Please note that this is needed not only for write transaction but for read transaction which need consistent visiblity among different database instances.<br />
* A root transact has to propagate this GXID to its children. When a child is invoking its child, then it also has to propagate this GXID, which should be associated to the local transaction of the child.<br />
* When a root transaction (and all its children) finishes with final COMMIT, it has to report GTP that the transaction finished. GTP records this is finished. Maintenance of GXID information at GTP will be given below.<br />
<br />
=== Global snapshot and its handling in local transactions. ===<br />
<br />
* The root has to ask for snapshot of running global transactions to GTP. GTP collects GXID of running global transactions and provides this set of GXIDs as a global snapshot.<br />
* Root has to propagate this global snapshot to its children and so forth. This is associated with the local transaction of children.<br />
* In visibility check, if a local XID (in xmin/xmax and others) is associated with GXID (we may need a flag in each row to indicate this), such transaction visibility should be determined based on its GXID appearing in the snapshot (note that in global snapshot, both terminated GXID and runing GXID appears, depending upon its syntax). Details will be given below.<br />
* Visibility of local transactions not associated with GXID can be handled in the same manner as current usual local transactions.<br />
* For a local transaction without GXID, there are no global snapshot and no GXID is used for visibility check even target row's xid is associated with GXID. Thus, for the local transactions, there will be no additional overhead.<br />
<br />
=== Global snapshot data format ===<br />
<br />
Global snapshot data can be represented in the following format:<br />
<br />
<code><br />
(GXID_0_0, ..., GXID_0_M) GXMIN (GXID_1_0, ..., GXID_1_N) GXMAX<br />
</code><br />
<br />
<code>GXMIN</code> means global tansaction whose gxid is smaller than this value are all terminated (commited/aborted), unless appearing in the list <code>(GXID_0_0, ... GXID_0_M)</code>.<br />
This is used to save the amount of snapshot especially when only very few number of global transactions are alive for error ercovery (typically local commit failure).<br />
<br />
<code>GXMAX</code> means this is the maximum value of currently assigned gxid and global transactions with gxid larger than GXMAX has to be regarded as running.<br />
<br />
<code>(GXID_1_0, .... , GXOD_1_N)</code> represents a set of GXIDs whose gxid is between <code>GXMIN</code> and <code>GXMAX</code> and has already terminated. The reason for this is we cannot guarantee the transaction is running when assocaited database instance is down after the transaction is reported to begin and before it reports to be terminated.<br />
<br />
From the above, there are multiple snapshot representation possible for a given situation. GPT should provide the minumum size snapshot by calculating optimal value for <code>GXMIN</code> and <code>GXMAX</code>.<br />
<br />
=== Global snapshot maintenance in GTP ===<br />
<br />
GXID in GTP cannot be removed in GTP immediately because we have a chance that other global transaction depends upon this.<br />
<br />
When a transaction terminates, its GXID is marked as finished and is associated with set of running GXIDs as referring GXIDs. Each of such GXID can be removed from the list when it terminates. When referring GXID becomes zero, then such GXID information can be removed from GTP.<br />
<br />
When a database instance gets down and restarts, all the associated transaction are automatically aborted. In this case, each database instance should report this to GTP and GTP can clean up its global transaction information if dbid matches to such database instance.<br />
<br />
Please note that such GXID will not appear in the snapshot requested after it ternates. This can be used to calculate GXMIN in the snapshot to minimize the amount of global snapshot.<br />
<br />
=== Database instance beloging to more than one database cluster ===<br />
<br />
We can allow a database instance can belong to more than one cluster. In this case, we need to extend the above with cluster-id. Any global transaction need to determine which cluster it belongs to and GXID and global snapshot request to appropriate GTP. In this case, GXID can be represented as (cluster-id, dbid, gxid).</div>Koichihttps://wiki.postgresql.org/index.php?title=Distributed_transaction_visibility&diff=37340Distributed transaction visibility2022-11-18T02:21:10Z<p>Koichi: </p>
<hr />
<div><br />
----<br />
[[File:MyPicture_github.jpg]]<br />
<br />
[[Koichi]]<br />
<br />
== Global Visibility for distributed R/W workload amount PostgreSQL database cluster ==<br />
<br />
=== Goal of the project ===<br />
<br />
* Provide consistent visibility to distributed R/W transactions spanning into more than one postgres database instance.<br />
* No impact to local transactions.<br />
<br />
=== Problem to solve ===<br />
<br />
We have two-phase commit protocol to maitain write consistency of groups of transactions as a distributed transaction.<br />
<br />
In this case, final "commit" timestamp is different in database instances.<br />
<br />
Because of this, result of write can be visible in some of involved instances but invisible in others.<br />
This is known as distributed read anomaly.<br />
<br />
We need to solve this read anomaly problem.<br />
<br />
=== Outline of the solution ===<br />
<br />
* We define global transaction identifier unique to each distributed transactions and same among transactions begonging to a same global transaction.<br />
* Use global snapshot for global transactions in row visibility check.<br />
<br />
=== Definition and structure of a distrubuted transaction ===<br />
<br />
* Any distributed transaction has one "root" transaction which starts the transaction.<br />
* Other transactions in a distributed transaction must be invoked by the "root" transaction or a "child transaction", which is invoked directly or indirectly by the root transaction.<br />
* "'''root'''" transaction is given a global transaction id ('''GXID''') as <code>(dbid, gxid)</code>, where <code>dbid</code> is identifier of the database root transaction is running and gxid is unique number given in ascending order by global transaction proxy as described below.<br />
Typically, database system identifier given by initdb can be used for this purpose. This depends upon implementation and we can choose any other means as long as dbid is unique. In the case of database system identifier, this is copied by cloning the database by `pg_basebackup` and streaming replication, for example. We may need a means to assign different database system identifier if such cloned database should be used as separate database instance.<br />
* When "root" transaction and its "'''child'''" transaction invokes child transactions, such "parent" transactions has to propagate its global transaction Id to their children. Locally, this global transaction ID has to be associated with the local transction.<br />
<br />
=== Assignment of global transaction ID ===<br />
<br />
* We need separate facility to manage global transaction information in a given postgreSQL cluster, called "'''global transaction proxy'''" ('''GTP''').<br />
* When a transaction wants to be the root transaction and would like to provide and receive consistent visibility need to connect to GPT and request for new GXID. GTP assignes gxid and record this with root transaction's dbid as <code>(dbid, gxid)</code> pair. Please note that this is needed not only for write transaction but for read transaction which need consistent visiblity among different database instances.<br />
* A root transact has to propagate this GXID to its children. When a child is invoking its child, then it also has to propagate this GXID, which should be associated to the local transaction of the child.<br />
* When a root transaction (and all its children) finishes with final COMMIT, it has to report GTP that the transaction finished. GTP records this is finished. Maintenance of GXID information at GTP will be given below.<br />
<br />
=== Global snapshot and its handling in local transactions. ===<br />
<br />
* The root has to ask for snapshot of running global transactions to GTP. GTP collects GXID of running global transactions and provides this set of GXIDs as a global snapshot.<br />
* Root has to propagate this global snapshot to its children and so forth. This is associated with the local transaction of children.<br />
* In visibility check, if a local XID (in xmin/xmax and others) is associated with GXID (we may need a flag in each row to indicate this), such transaction visibility should be determined based on its GXID appearing in the snapshot (note that in global snapshot, both terminated GXID and runing GXID appears, depending upon its syntax). Details will be given below.<br />
* Visibility of local transactions not associated with GXID can be handled in the same manner as current usual local transactions.<br />
* For a local transaction without GXID, there are no global snapshot and no GXID is used for visibility check even target row's xid is associated with GXID. Thus, for the local transactions, there will be no additional overhead.<br />
<br />
=== Global snapshot data format ===<br />
<br />
Global snapshot data can be represented in the following format:<br />
<br />
<code><br />
(GXID_0_0, ..., GXID_0_M) GXMIN (GXID_1_0, ..., GXID_1_N) GXMAX<br />
</code><br />
<br />
<code>GXMIN</code> means global tansaction whose gxid is smaller than this value are all terminated (commited/aborted), unless appearing in the list <code>(GXID_0_0, ... GXID_0_M)</code>.<br />
This is used to save the amount of snapshot especially when only very few number of global transactions are alive for error ercovery (typically local commit failure).<br />
<br />
<code>GXMAX</code> means this is the maximum value of currently assigned gxid and global transactions with gxid larger than GXMAX has to be regarded as running.<br />
<br />
<code>(GXID_1_0, .... , GXOD_1_N)</code> represents set of GXIDs whose gxid is between <code>GXMIN</code> and <code>GXMAX</codee and has already terminated. The reason for this is we cannot guarantee the transaction is running when assocaited database instance is down after the transaction is reported to begin and before it reports to be terminated.<br />
<br />
From the above, there are multiple snapshot representation possible for a given situation. GPT should provide the minumum size snapshot by calculating optimal value for <code>GXMIN</code> and <code>GXMAX</code>.<br />
<br />
=== Global snapshot maintenance in GTP ===<br />
<br />
GXID in GTP cannot be removed in GTP immediately because we have a chance that other global transaction depends upon this.<br />
<br />
When a transaction terminates, its GXID is marked as finished and is associated with set of running GXIDs as referring GXIDs. Each of such GXID can be removed from the list when it terminates. When referring GXID becomes zero, then such GXID information can be removed from GTP.<br />
<br />
When a database instance gets down and restarts, all the associated transaction are automatically aborted. In this case, each database instance should report this to GTP and GTP can clean up its global transaction information if dbid matches to such database instance.<br />
<br />
Please note that such GXID will not appear in the snapshot requested after it ternates. This can be used to calculate GXMIN in the snapshot to minimize the amount of global snapshot.<br />
<br />
=== Database instance beloging to more than one database cluster ===<br />
<br />
We can allow a database instance can belong to more than one cluster. In this case, we need to extend the above with cluster-id. Any global transaction need to determine which cluster it belongs to and GXID and global snapshot request to appropriate GTP. In this case, GXID can be represented as (cluster-id, dbid, gxid).</div>Koichihttps://wiki.postgresql.org/index.php?title=Distributed_transaction_visibility&diff=37339Distributed transaction visibility2022-11-18T02:20:33Z<p>Koichi: </p>
<hr />
<div><br />
----<br />
[[File:MyPicture_github.jpg]]<br />
<br />
[[Koichi]]<br />
<br />
== Global Visibility for distributed R/W workload amount PostgreSQL database cluster ==<br />
<br />
=== Goal of the project ===<br />
<br />
* Provide consistent visibility to distributed R/W transactions spanning into more than one postgres database instance.<br />
* No impact to local transactions.<br />
<br />
=== Problem to solve ===<br />
<br />
We have two-phase commit protocol to maitain write consistency of groups of transactions as a distributed transaction.<br />
<br />
In this case, final "commit" timestamp is different in database instances.<br />
<br />
Because of this, result of write can be visible in some of involved instances but invisible in others.<br />
This is known as distributed read anomaly.<br />
<br />
We need to solve this read anomaly problem.<br />
<br />
=== Outline of the solution ===<br />
<br />
* We define global transaction identifier unique to each distributed transactions and same among transactions begonging to a same global transaction.<br />
* Use global snapshot for global transactions in row visibility check.<br />
<br />
=== Definition and structure of a distrubuted transaction ===<br />
<br />
* Any distributed transaction has one "root" transaction which starts the transaction.<br />
* Other transactions in a distributed transaction must be invoked by the "root" transaction or a "child transaction", which is invoked directly or indirectly by the root transaction.<br />
* "'''root'''" transaction is given a global transaction id ('''GXID''') as <code>(dbid, gxid)</code>, where <code>dbid</code> is identifier of the database root transaction is running and gxid is unique number given in ascending order by global transaction proxy as described below.<br />
Typically, database system identifier given by initdb can be used for this purpose. This depends upon implementation and we can choose any other means as long as dbid is unique. In the case of database system identifier, this is copied by cloning the database by `pg_basebackup` and streaming replication, for example. We may need a means to assign different database system identifier if such cloned database should be used as separate database instance.<br />
* When "root" transaction and its "'''child'''" transaction invokes child transactions, such "parent" transactions has to propagate its global transaction Id to their children. Locally, this global transaction ID has to be associated with the local transction.<br />
<br />
=== Assignment of global transaction ID ===<br />
<br />
* We need separate facility to manage global transaction information in a given postgreSQL cluster, called "'''global transaction proxy'''" ('''GTP''').<br />
* When a transaction wants to be the root transaction and would like to provide and receive consistent visibility need to connect to GPT and request for new GXID. GTP assignes gxid and record this with root transaction's dbid as <code>(dbid, gxid)</code> pair. Please note that this is needed not only for write transaction but for read transaction which need consistent visiblity among different database instances.<br />
* A root transact has to propagate this GXID to its children. When a child is invoking its child, then it also has to propagate this GXID, which should be associated to the local transaction of the child.<br />
* When a root transaction (and all its children) finishes with final COMMIT, it has to report GTP that the transaction finished. GTP records this is finished. Maintenance of GXID information at GTP will be given below.<br />
<br />
=== Global snapshot and its handling in local transactions. ===<br />
<br />
* The root has to ask for snapshot of running global transactions to GTP. GTP collects GXID of running global transactions and provides this set of GXIDs as a global snapshot.<br />
* Root has to propagate this global snapshot to its children and so forth. This is associated with the local transaction of children.<br />
* In visibility check, if a local XID (in xmin/xmax and others) is associated with GXID (we may need a flag in each row to indicate this), such transaction visibility should be determined based on its GXID appearing in the snapshot (note that in global snapshot, both terminated GXID and runing GXID appears, depending upon its syntax). Details will be given below.<br />
* Visibility of local transactions not associated with GXID can be handled in the same manner as current usual local transactions.<br />
* For a local transaction without GXID, there are no global snapshot and no GXID is used for visibility check even target row's xid is associated with GXID. Thus, for the local transactions, there will be no additional overhead.<br />
<br />
=== Global snapshot data format ===<br />
<br />
Global snapshot data can be represented in the following format:<br />
<br />
<code><br />
(GXID_0_0, ..., GXID_0_M) GXMIN (GXID_1_0, ..., GXID_1_N) GXMAX<br />
</code><br />
<br />
<code>GXMIN</code> means global tansaction whose gxid is smaller than this value are all terminated (commited/aborted), unless appearing in the list <code>(GXID_0_0, ... GXID_0_M)</code>.<br />
This is used to save the amount of snapshot especially when only very few number of global transactions are alive for error ercovery (typically local commit failure).<br />
<br />
<code>GXMAX</code> means this is the maximum value of currently assigned gxid and global transactions with gxid larger than GXMAX has to be regarded as running.<br />
<br />
<code>(GXID_1_0, .... , GXOD_1_N)</code> represents set of GXIDs whose gxid is between <code>GXMIN</code> and <code>GXMAX</codee and has already terminated. The reason for this is we cannot guarantee the transaction is running when assocaited database instance is down after the transaction is reported to begin and before it reports to be terminated.<br />
<br />
From the above, there are multiple snapshot representation possible for a given situation. GPT should provide the minumum size snapshot by calculating optimal value for <code>GXMIN</code> and <code>GXMAX</code>.<br />
<br />
== Global snapshot maintenance in GTP ===<br />
<br />
GXID in GTP cannot be removed in GTP immediately because we have a chance that other global transaction depends upon this.<br />
<br />
When a transaction terminates, its GXID is marked as finished and is associated with set of running GXIDs as referring GXIDs. Each of such GXID can be removed from the list when it terminates. When referring GXID becomes zero, then such GXID information can be removed from GTP.<br />
<br />
When a database instance gets down and restarts, all the associated transaction are automatically aborted. In this case, each database instance should report this to GTP and GTP can clean up its global transaction information if dbid matches to such database instance.<br />
<br />
Please note that such GXID will not appear in the snapshot requested after it ternates. This can be used to calculate GXMIN in the snapshot to minimize the amount of global snapshot.<br />
<br />
=== Database instance beloging to more than one database cluster ===<br />
<br />
We can allow a database instance can belong to more than one cluster. In this case, we need to extend the above with cluster-id. Any global transaction need to determine which cluster it belongs to and GXID and global snapshot request to appropriate GTP. In this case, GXID can be represented as (cluster-id, dbid, gxid).</div>Koichihttps://wiki.postgresql.org/index.php?title=Distributed_transaction_visibility&diff=37338Distributed transaction visibility2022-11-18T02:05:36Z<p>Koichi: </p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
[[Koichi]]<br />
<br />
== Consistent visibility for distributed transactions ==<br />
<br />
# Global Visibility for distributed R/W workload amount PostgreSQL database cluster.<br />
<br />
## Goal of the project<br />
<br />
* Provide consistent visibility to distributed R/W transactions spanning into more than one postgres database instance.<br />
* No impact to local transactions.<br />
<br />
## Problem to solve<br />
<br />
We have two-phase commit protocol to maitain write consistency of groups of transactions as a distributed transaction.<br />
<br />
In this case, final "commit" timestamp is different in database instances.<br />
<br />
Because of this, result of write can be visible in some of involved instances but invisible in others.<br />
This is known as distributed read anomaly.<br />
<br />
We need to solve this read anomaly problem.<br />
<br />
## Outline of the solution<br />
<br />
* We define global transaction identifier unique to each distributed transactions and same among transactions begonging to a same global transaction.<br />
* Use global snapshot for global transactions in row visibility check.<br />
<br />
## Definition and structure of a distrubuted transaction<br />
<br />
* Any distributed transaction has one "root" transaction which starts the transaction.<br />
* Other transactions in a distributed transaction must be invoked by the "root" transaction or a "child transaction", which is invoked directly or indirectly by the root transaction.<br />
* "root" transaction is given a global transaction id (GXID) as (dbid, gxid), where dbid is identifier of the database root transaction is running and gxid is unique number given in ascending order by global transaction proxy as described below.<br />
Typically, database system identifier given by initdb can be used for this purpose. This depends upon implementation and we can choose any other means as long as dbid is unique. In the case of database system identifier, this is copied by cloning the database by `pg_basebackup` and streaming replication, for example. We may need a means to assign different database system identifier if such cloned database should be used as separate database instance.<br />
* When "root" transaction and its "child" transaction invokes child transactions, such "parent" transactions has to propagate its global transaction Id to their children. Locally, this global transaction ID has to be associated with the local transction.<br />
<br />
## Assignment of global transaction ID<br />
<br />
* We need separate facility to manage global transaction information in a given postgreSQL cluster, called "global transaction proxy" (GTP).<br />
* When a transaction wants to be the root transaction and would like to provide and receive consistent visibility need to connect to GPT and request for new GXID. GTP assignes gxid and record this with root transaction's dbid as (dbid, gxid) pair. Please note that this is needed not only for write transaction but for read transaction which need consistent visiblity among different database instances.<br />
* A root transact has to propagate this GXID to its children. When a child is invoking its child, then it also has to propagate this GXID, which should be associated to the local transaction of the child.<br />
* When a root transaction (and all its children) finishes with final COMMIT, it has to report GTP that the transaction finished. GTP records this is finished. Maintenance of GXID information at GTP will be given below.<br />
<br />
## Global snapshot and its handling in local transactions.<br />
<br />
* The root has to ask for snapshot of running global transactions to GTP. GTP collects GXID of running global transactions and provides this set of GXIDs as a global snapshot.<br />
* Root has to propagate this global snapshot to its children and so forth. This is associated with the local transaction of children.<br />
* In visibility check, if a local XID (in xmin/xmax and others) is associated with GXID (we may need a flag in each row to indicate this), such transaction visibility should be determined based on its GXID appearing in the snapshot (note that in global snapshot, both terminated GXID and runing GXID appears, depending upon its syntax). Details will be given below.<br />
* Visibility of local transactions not associated with GXID can be handled in the same manner as current usual local transactions.<br />
* For a local transaction without GXID, there are no global snapshot and no GXID is used for visibility check even target row's xid is associated with GXID. Thus, for the local transactions, there will be no additional overhead.<br />
<br />
## Global snapshot data format<br />
<br />
Global snapshot data can be represented in the following format:<br />
<br />
```<br />
(GXID_0_0, ..., GXID_0_M) GXMIN (GXID_1_0, ..., GXID_1_N) GXMAX<br />
```<br />
<br />
`GXMIN` means global tansaction whose gxid is smaller than this value are all terminated (commited/aborted), unless appearing in the list `(GXID_0_0, ... GXID_0_M)`.<br />
This is used to save the amount of snapshot especially when only very few number of global transactions are alive for error ercovery (typically local commit failure).<br />
<br />
`GXMAX` means this is the maximum value of currently assigned gxid and global transactions with gxid larger than GXMAX has to be regarded as running.<br />
<br />
`(GXID_1_0, .... , GXOD_1_N)` represents set of GXIDs whose gxid is between `GXMIN` and `GXMAX` and has already terminated. The reason for this is we cannot guarantee the transaction is running when assocaited database instance is down after the transaction is reported to begin and before it reports to be terminated.<br />
<br />
From the above, there are multiple snapshot representation possible for a given situation. GPT should provide the minumum size snapshot by calculating optimal value for `GXMIN` and `GXMAX`.<br />
<br />
## Global snapshot maintenance in GTP<br />
<br />
GXID in GTP cannot be removed in GTP immediately because we have a chance that other global transaction depends upon this.<br />
<br />
When a transaction terminates, its GXID is marked as finished and is associated with set of running GXIDs as referring GXIDs. Each of such GXID can be removed from the list when it terminates. When referring GXID becomes zero, then such GXID information can be removed from GTP.<br />
<br />
When a database instance gets down and restarts, all the associated transaction are automatically aborted. In this case, each database instance should report this to GTP and GTP can clean up its global transaction information if dbid matches to such database instance.<br />
<br />
Please note that such GXID will not appear in the snapshot requested after it ternates. This can be used to calculate GXMIN in the snapshot to minimize the amount of global snapshot.<br />
<br />
## Database instance beloging to more than one database cluster<br />
<br />
We can allow a database instance can belong to more than one cluster. In this case, we need to extend the above with cluster-id. Any global transaction need to determine which cluster it belongs to and GXID and global snapshot request to appropriate GTP. In this case, GXID can be represented as (cluster-id, dbid, gxid).</div>Koichihttps://wiki.postgresql.org/index.php?title=Parallel_Recovery&diff=36930Parallel Recovery2022-05-07T04:20:15Z<p>Koichi: /* Can replay be done in paralle ? */</p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
My Wiki home: [[Koichi]]<br />
<br />
email: koichi.dbms_at_gmail.com<br />
<br />
Other topics: [[Distributed deadlock detection]]<br />
== '''Parallel Recovery''' ==<br />
<br><br />
<br />
=== PostgreSQL redo log (WAL) outline ===<br />
<br />
In PostgreSQL, WAL (write-ahead log) is used for recovery and log-shipping replication. Outline is described in [https://www.postgresql.org/docs/current/wal-intro.html this documentation page]. The outline of WAL is:<br />
<br />
* Every updating backend process must write its redo log to WAL before it really updates data files.<br />
* Each WAL record (piece of updating data record) has to reach stable storage before the data is synch'ed to the storage.<br />
<br />
WAL is used in recovery and log-shipping replication. In recovery, the WAL is used as follows:<br />
<br />
* Each WAL record is read from WAL segments in the written order,<br />
* Read WAL record is applied to corresponding data block. This is called "redo" or "replay".<br />
* To maintain redo consistency, redo was done in single thread.<br />
<br />
In the case of log-shipping replication, each WAL record is:<br />
<br />
* Transfered to each replica in the written order and<br />
* Replayed at replica in sngle thread just in the same way of recovery<br />
* When replay reaches some consistent state, it allows postmaster to begin read transaction, while following WAL records reach and are replayed.<br />
<br><br />
<br />
----<br />
<br />
=== Issues around serialized replay ===<br />
<br />
At present, WAL records are replayed strictly in written order in a single thread, while they are generated by multiple backend processes in parallel. The issues are:<br />
<br />
* Recovery/replay performance is much worse than performance to write WALs. Recovery can take longer than the period WAL records were written.<br />
* Replication lag can be long when the master has heavy updating workloads.<br />
<br><br />
<br />
----<br />
<br />
=== Can replay be done in paralle ? ===<br />
<br />
If we apply several rules about recovery order, yes. The rules are:<br />
<br />
* For a given data block, WAL records must be applied in the written order.<br />
* For a given transaction, when we apply WAL record terminating the transaction (commit/abort/prepare for commit), all the associated WAL records must have been replayed,<br />
* In the case of multi-block WAL, it should be applied only by one worker. For this purpose, such WAL record is assigned to all the block workers responsible for each block involved. When block worker picks up multi-block WAL and if it is not the last thread, it should wait until the last thread picking up this record actually replay it. If a thread is the last to pick up multi-block WAL record, it replays the record and send sync to other waiging workers.<br />
* To replay specific WAL record such as updating timeline, it should wait until all the preceding WAL records are replayed This may not be necessary but it looks safer at present to have such condition for certain set of WAL records.<br />
* Replayed LSN at pg_controldata will be updated when all the preceding WAL records are replayed.<br />
<br><br />
<br />
----<br />
<br />
=== Implementation Note ===<br />
<br><br />
==== Worker configuration ====<br />
<br />
* Because current recovery is done in startup process, parallel replay threads are implemented as its child process.<br />
* Now we have following processes for this purpose:<br />
** Starup process: read WAL record from WAL segment/walsender and perform overall control,<br />
** Dispatching worker: analyze and dispatch WAL records,<br />
** Error page registration: collects error pages from other worker. Current implementation is using hash functions based on memory context, which is not simple to port it to shared memory. We can replace this with simpler configuration if we can manage such information using shared memory.<br />
** Transaction worker: replays WAL record which does not have any block information to update.<br />
** Block worker: replays block information (mostly HEAP/HEAP2).<br />
<br />
We can have more than one block workers. Other workers consists of single process each.<br />
<br><br />
==== Shared information and locks ====<br />
<br />
To pass and share information such as WAL record and status, shared memory defined in <code>storage/dsm.h</code> is used.<br />
<br />
To protest these data, spin lock defined in <code>storage/spin.h</code> is used.<br />
<br />
<br><br />
==== Synchronization among workers ====<br />
<br />
For light-weight synchronization among workers, unix-domain udp socket is used. This can be replaced with <code>stoage/latch.h</code>.<br />
<br><br />
==== New GUC ====<br />
<br />
Following GUC parameters are added:<br />
<br />
* <code>parallel_replay</code> (bool)<br />
* <code>parallel_replay_test</code> (bool, just internal) enable additional code to work to attach workers with gdb for test.<br />
* <code>num_preplay_workers</code> (int) number of total replay workers.<br />
* <code>num_preplay_queue</code> (int) number of queues holding outstanding WAL records,<br />
* <code>num_preplay_max_txn</code> (int) number of max outstanding transaction allowed in recovery. Max_connections or larger. In the case of replication slave, must be master's max_connections or larger.<br />
* <code>preplay_buffers</code> (int)<br />
<br><br />
<br />
----<br />
<br />
=== Current code ===<br />
<br />
Current code is available from [https://github.com/koichi-szk/postgres.git Koichi's GITHub repository]. Branch is <code>parallel_replay_14_2</code>.<br />
<br />
<br><br />
<br />
----<br />
<br />
=== Current code status ===<br />
<br />
No test yet. The code just pass the build and under the review before testing. Now a couple of functions need improvement/development and some others are influenced by this change.<br />
<br />
'''Any cooperation for discussion/test/development is very very welcome.''' You can contact to koichi.dbms_at_gmail.com.<br />
<br />
Please understand that the repo can be incomplete due to occasional work.<br />
<br />
Further addition is:<br />
<br />
* To add a code to have WAL record in text format (like pg_waldump) and hold it to some portion of the memory to help the test. <code>xlog_outrec()</code> looks very useful for this purpose. At present, this is enabled with <code>WAL_DEBUG</code> flag.</div>Koichihttps://wiki.postgresql.org/index.php?title=Parallel_Recovery&diff=36929Parallel Recovery2022-05-07T04:18:58Z<p>Koichi: /* Can replay be done in paralle ? */</p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
My Wiki home: [[Koichi]]<br />
<br />
email: koichi.dbms_at_gmail.com<br />
<br />
Other topics: [[Distributed deadlock detection]]<br />
== '''Parallel Recovery''' ==<br />
<br><br />
<br />
=== PostgreSQL redo log (WAL) outline ===<br />
<br />
In PostgreSQL, WAL (write-ahead log) is used for recovery and log-shipping replication. Outline is described in [https://www.postgresql.org/docs/current/wal-intro.html this documentation page]. The outline of WAL is:<br />
<br />
* Every updating backend process must write its redo log to WAL before it really updates data files.<br />
* Each WAL record (piece of updating data record) has to reach stable storage before the data is synch'ed to the storage.<br />
<br />
WAL is used in recovery and log-shipping replication. In recovery, the WAL is used as follows:<br />
<br />
* Each WAL record is read from WAL segments in the written order,<br />
* Read WAL record is applied to corresponding data block. This is called "redo" or "replay".<br />
* To maintain redo consistency, redo was done in single thread.<br />
<br />
In the case of log-shipping replication, each WAL record is:<br />
<br />
* Transfered to each replica in the written order and<br />
* Replayed at replica in sngle thread just in the same way of recovery<br />
* When replay reaches some consistent state, it allows postmaster to begin read transaction, while following WAL records reach and are replayed.<br />
<br><br />
<br />
----<br />
<br />
=== Issues around serialized replay ===<br />
<br />
At present, WAL records are replayed strictly in written order in a single thread, while they are generated by multiple backend processes in parallel. The issues are:<br />
<br />
* Recovery/replay performance is much worse than performance to write WALs. Recovery can take longer than the period WAL records were written.<br />
* Replication lag can be long when the master has heavy updating workloads.<br />
<br><br />
<br />
----<br />
<br />
=== Can replay be done in paralle ? ===<br />
<br />
If we apply several rules about recovery order, yes. The rules are:<br />
<br />
* For a given data block, WAL records must be applied in the written order.<br />
* For a given transaction, when we apply WAL record terminating the transaction (commit/abort/prepare for commit), all the associated WAL records must have been replayed,<br />
* In the case of multi-block WAL, it should be applyed by only one thread. For this purpose, such WAL record is assigned to all the block workers responsible for each block involved. When block worker picks up multi-block WAL and if it is not the last thread, it should wait until the last thread picking up this record actually replay it. If a thread is the last to pick up multi-block WAL record, it replays the record and send sync to other waiging workers.<br />
* To replay specific WAL record such as updating timeline, it should wait until all the preceding WAL records are replayed This may not be necessary but it looks safer at present to have such condition for certain set of WAL records.<br />
* Replayed LSN at pg_controldata will be updated when all the preceding WAL records are replayed.<br />
<br><br />
<br />
----<br />
<br />
=== Implementation Note ===<br />
<br><br />
==== Worker configuration ====<br />
<br />
* Because current recovery is done in startup process, parallel replay threads are implemented as its child process.<br />
* Now we have following processes for this purpose:<br />
** Starup process: read WAL record from WAL segment/walsender and perform overall control,<br />
** Dispatching worker: analyze and dispatch WAL records,<br />
** Error page registration: collects error pages from other worker. Current implementation is using hash functions based on memory context, which is not simple to port it to shared memory. We can replace this with simpler configuration if we can manage such information using shared memory.<br />
** Transaction worker: replays WAL record which does not have any block information to update.<br />
** Block worker: replays block information (mostly HEAP/HEAP2).<br />
<br />
We can have more than one block workers. Other workers consists of single process each.<br />
<br><br />
==== Shared information and locks ====<br />
<br />
To pass and share information such as WAL record and status, shared memory defined in <code>storage/dsm.h</code> is used.<br />
<br />
To protest these data, spin lock defined in <code>storage/spin.h</code> is used.<br />
<br />
<br><br />
==== Synchronization among workers ====<br />
<br />
For light-weight synchronization among workers, unix-domain udp socket is used. This can be replaced with <code>stoage/latch.h</code>.<br />
<br><br />
==== New GUC ====<br />
<br />
Following GUC parameters are added:<br />
<br />
* <code>parallel_replay</code> (bool)<br />
* <code>parallel_replay_test</code> (bool, just internal) enable additional code to work to attach workers with gdb for test.<br />
* <code>num_preplay_workers</code> (int) number of total replay workers.<br />
* <code>num_preplay_queue</code> (int) number of queues holding outstanding WAL records,<br />
* <code>num_preplay_max_txn</code> (int) number of max outstanding transaction allowed in recovery. Max_connections or larger. In the case of replication slave, must be master's max_connections or larger.<br />
* <code>preplay_buffers</code> (int)<br />
<br><br />
<br />
----<br />
<br />
=== Current code ===<br />
<br />
Current code is available from [https://github.com/koichi-szk/postgres.git Koichi's GITHub repository]. Branch is <code>parallel_replay_14_2</code>.<br />
<br />
<br><br />
<br />
----<br />
<br />
=== Current code status ===<br />
<br />
No test yet. The code just pass the build and under the review before testing. Now a couple of functions need improvement/development and some others are influenced by this change.<br />
<br />
'''Any cooperation for discussion/test/development is very very welcome.''' You can contact to koichi.dbms_at_gmail.com.<br />
<br />
Please understand that the repo can be incomplete due to occasional work.<br />
<br />
Further addition is:<br />
<br />
* To add a code to have WAL record in text format (like pg_waldump) and hold it to some portion of the memory to help the test. <code>xlog_outrec()</code> looks very useful for this purpose. At present, this is enabled with <code>WAL_DEBUG</code> flag.</div>Koichihttps://wiki.postgresql.org/index.php?title=Parallel_Recovery&diff=36927Parallel Recovery2022-05-06T09:16:29Z<p>Koichi: /* Current code status */</p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
My Wiki home: [[Koichi]]<br />
<br />
email: koichi.dbms_at_gmail.com<br />
<br />
Other topics: [[Distributed deadlock detection]]<br />
== '''Parallel Recovery''' ==<br />
<br><br />
<br />
=== PostgreSQL redo log (WAL) outline ===<br />
<br />
In PostgreSQL, WAL (write-ahead log) is used for recovery and log-shipping replication. Outline is described in [https://www.postgresql.org/docs/current/wal-intro.html this documentation page]. The outline of WAL is:<br />
<br />
* Every updating backend process must write its redo log to WAL before it really updates data files.<br />
* Each WAL record (piece of updating data record) has to reach stable storage before the data is synch'ed to the storage.<br />
<br />
WAL is used in recovery and log-shipping replication. In recovery, the WAL is used as follows:<br />
<br />
* Each WAL record is read from WAL segments in the written order,<br />
* Read WAL record is applied to corresponding data block. This is called "redo" or "replay".<br />
* To maintain redo consistency, redo was done in single thread.<br />
<br />
In the case of log-shipping replication, each WAL record is:<br />
<br />
* Transfered to each replica in the written order and<br />
* Replayed at replica in sngle thread just in the same way of recovery<br />
* When replay reaches some consistent state, it allows postmaster to begin read transaction, while following WAL records reach and are replayed.<br />
<br><br />
<br />
----<br />
<br />
=== Issues around serialized replay ===<br />
<br />
At present, WAL records are replayed strictly in written order in a single thread, while they are generated by multiple backend processes in parallel. The issues are:<br />
<br />
* Recovery/replay performance is much worse than performance to write WALs. Recovery can take longer than the period WAL records were written.<br />
* Replication lag can be long when the master has heavy updating workloads.<br />
<br><br />
<br />
----<br />
<br />
=== Can replay be done in paralle ? ===<br />
<br />
If we apply several rules about recovery order, yes. The rules are:<br />
<br />
* For a given data block, WAL records must be apply in the written order.<br />
* For a given transaction, when we apply WAL record terminating the transaction (commit/abort/prepare for commit), all the associated WAL records must have been replayed,<br />
* In the case of multi-block WAL, it should be applyed by only one thread. For this purpose, such WAL record is assigned to all the block workers responsible for each block involved. When block worker picks up multi-block WAL and if it is not the last thread, it should wait until the last thread picking up this record actually replay it. If a thread is the last to pick up multi-block WAL record, it replays the record and send sync to other waiging workers.<br />
* To replay specific WAL record such as updating timeline, it should wait until all the preceding WAL records are replayed This may not be necessary but it looks safer at present to have such condition for certain set of WAL records.<br />
* Replayed LSN at pg_controldata will be updated when all the preceding WAL records are replayed.<br />
<br><br />
<br />
----<br />
<br />
=== Implementation Note ===<br />
<br><br />
==== Worker configuration ====<br />
<br />
* Because current recovery is done in startup process, parallel replay threads are implemented as its child process.<br />
* Now we have following processes for this purpose:<br />
** Starup process: read WAL record from WAL segment/walsender and perform overall control,<br />
** Dispatching worker: analyze and dispatch WAL records,<br />
** Error page registration: collects error pages from other worker. Current implementation is using hash functions based on memory context, which is not simple to port it to shared memory. We can replace this with simpler configuration if we can manage such information using shared memory.<br />
** Transaction worker: replays WAL record which does not have any block information to update.<br />
** Block worker: replays block information (mostly HEAP/HEAP2).<br />
<br />
We can have more than one block workers. Other workers consists of single process each.<br />
<br><br />
==== Shared information and locks ====<br />
<br />
To pass and share information such as WAL record and status, shared memory defined in <code>storage/dsm.h</code> is used.<br />
<br />
To protest these data, spin lock defined in <code>storage/spin.h</code> is used.<br />
<br />
<br><br />
==== Synchronization among workers ====<br />
<br />
For light-weight synchronization among workers, unix-domain udp socket is used. This can be replaced with <code>stoage/latch.h</code>.<br />
<br><br />
==== New GUC ====<br />
<br />
Following GUC parameters are added:<br />
<br />
* <code>parallel_replay</code> (bool)<br />
* <code>parallel_replay_test</code> (bool, just internal) enable additional code to work to attach workers with gdb for test.<br />
* <code>num_preplay_workers</code> (int) number of total replay workers.<br />
* <code>num_preplay_queue</code> (int) number of queues holding outstanding WAL records,<br />
* <code>num_preplay_max_txn</code> (int) number of max outstanding transaction allowed in recovery. Max_connections or larger. In the case of replication slave, must be master's max_connections or larger.<br />
* <code>preplay_buffers</code> (int)<br />
<br><br />
<br />
----<br />
<br />
=== Current code ===<br />
<br />
Current code is available from [https://github.com/koichi-szk/postgres.git Koichi's GITHub repository]. Branch is <code>parallel_replay_14_2</code>.<br />
<br />
<br><br />
<br />
----<br />
<br />
=== Current code status ===<br />
<br />
No test yet. The code just pass the build and under the review before testing. Now a couple of functions need improvement/development and some others are influenced by this change.<br />
<br />
'''Any cooperation for discussion/test/development is very very welcome.''' You can contact to koichi.dbms_at_gmail.com.<br />
<br />
Please understand that the repo can be incomplete due to occasional work.<br />
<br />
Further addition is:<br />
<br />
* To add a code to have WAL record in text format (like pg_waldump) and hold it to some portion of the memory to help the test. <code>xlog_outrec()</code> looks very useful for this purpose. At present, this is enabled with <code>WAL_DEBUG</code> flag.</div>Koichihttps://wiki.postgresql.org/index.php?title=Parallel_Recovery&diff=36926Parallel Recovery2022-05-06T09:15:24Z<p>Koichi: /* Current code status */</p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
My Wiki home: [[Koichi]]<br />
<br />
email: koichi.dbms_at_gmail.com<br />
<br />
Other topics: [[Distributed deadlock detection]]<br />
== '''Parallel Recovery''' ==<br />
<br><br />
<br />
=== PostgreSQL redo log (WAL) outline ===<br />
<br />
In PostgreSQL, WAL (write-ahead log) is used for recovery and log-shipping replication. Outline is described in [https://www.postgresql.org/docs/current/wal-intro.html this documentation page]. The outline of WAL is:<br />
<br />
* Every updating backend process must write its redo log to WAL before it really updates data files.<br />
* Each WAL record (piece of updating data record) has to reach stable storage before the data is synch'ed to the storage.<br />
<br />
WAL is used in recovery and log-shipping replication. In recovery, the WAL is used as follows:<br />
<br />
* Each WAL record is read from WAL segments in the written order,<br />
* Read WAL record is applied to corresponding data block. This is called "redo" or "replay".<br />
* To maintain redo consistency, redo was done in single thread.<br />
<br />
In the case of log-shipping replication, each WAL record is:<br />
<br />
* Transfered to each replica in the written order and<br />
* Replayed at replica in sngle thread just in the same way of recovery<br />
* When replay reaches some consistent state, it allows postmaster to begin read transaction, while following WAL records reach and are replayed.<br />
<br><br />
<br />
----<br />
<br />
=== Issues around serialized replay ===<br />
<br />
At present, WAL records are replayed strictly in written order in a single thread, while they are generated by multiple backend processes in parallel. The issues are:<br />
<br />
* Recovery/replay performance is much worse than performance to write WALs. Recovery can take longer than the period WAL records were written.<br />
* Replication lag can be long when the master has heavy updating workloads.<br />
<br><br />
<br />
----<br />
<br />
=== Can replay be done in paralle ? ===<br />
<br />
If we apply several rules about recovery order, yes. The rules are:<br />
<br />
* For a given data block, WAL records must be apply in the written order.<br />
* For a given transaction, when we apply WAL record terminating the transaction (commit/abort/prepare for commit), all the associated WAL records must have been replayed,<br />
* In the case of multi-block WAL, it should be applyed by only one thread. For this purpose, such WAL record is assigned to all the block workers responsible for each block involved. When block worker picks up multi-block WAL and if it is not the last thread, it should wait until the last thread picking up this record actually replay it. If a thread is the last to pick up multi-block WAL record, it replays the record and send sync to other waiging workers.<br />
* To replay specific WAL record such as updating timeline, it should wait until all the preceding WAL records are replayed This may not be necessary but it looks safer at present to have such condition for certain set of WAL records.<br />
* Replayed LSN at pg_controldata will be updated when all the preceding WAL records are replayed.<br />
<br><br />
<br />
----<br />
<br />
=== Implementation Note ===<br />
<br><br />
==== Worker configuration ====<br />
<br />
* Because current recovery is done in startup process, parallel replay threads are implemented as its child process.<br />
* Now we have following processes for this purpose:<br />
** Starup process: read WAL record from WAL segment/walsender and perform overall control,<br />
** Dispatching worker: analyze and dispatch WAL records,<br />
** Error page registration: collects error pages from other worker. Current implementation is using hash functions based on memory context, which is not simple to port it to shared memory. We can replace this with simpler configuration if we can manage such information using shared memory.<br />
** Transaction worker: replays WAL record which does not have any block information to update.<br />
** Block worker: replays block information (mostly HEAP/HEAP2).<br />
<br />
We can have more than one block workers. Other workers consists of single process each.<br />
<br><br />
==== Shared information and locks ====<br />
<br />
To pass and share information such as WAL record and status, shared memory defined in <code>storage/dsm.h</code> is used.<br />
<br />
To protest these data, spin lock defined in <code>storage/spin.h</code> is used.<br />
<br />
<br><br />
==== Synchronization among workers ====<br />
<br />
For light-weight synchronization among workers, unix-domain udp socket is used. This can be replaced with <code>stoage/latch.h</code>.<br />
<br><br />
==== New GUC ====<br />
<br />
Following GUC parameters are added:<br />
<br />
* <code>parallel_replay</code> (bool)<br />
* <code>parallel_replay_test</code> (bool, just internal) enable additional code to work to attach workers with gdb for test.<br />
* <code>num_preplay_workers</code> (int) number of total replay workers.<br />
* <code>num_preplay_queue</code> (int) number of queues holding outstanding WAL records,<br />
* <code>num_preplay_max_txn</code> (int) number of max outstanding transaction allowed in recovery. Max_connections or larger. In the case of replication slave, must be master's max_connections or larger.<br />
* <code>preplay_buffers</code> (int)<br />
<br><br />
<br />
----<br />
<br />
=== Current code ===<br />
<br />
Current code is available from [https://github.com/koichi-szk/postgres.git Koichi's GITHub repository]. Branch is <code>parallel_replay_14_2</code>.<br />
<br />
<br><br />
<br />
----<br />
<br />
=== Current code status ===<br />
<br />
No test yet. The code just pass the build and under the review before testing. Now a couple of functions need improvement/development. '''Any cooperation for the test is very very welcome.''' You can contact to koichi.dbms_at_gmail.com.<br />
<br />
Please understand that the repo can be incomplete due to occasional work.<br />
<br />
Further addition is:<br />
<br />
* To add a code to have WAL record in text format (like pg_waldump) and hold it to some portion of the memory to help the test. <code>xlog_outrec()</code> looks very useful for this purpose. At present, this is enabled with <code>WAL_DEBUG</code> flag.</div>Koichihttps://wiki.postgresql.org/index.php?title=Parallel_Recovery&diff=36925Parallel Recovery2022-05-06T09:15:06Z<p>Koichi: /* Current code status */</p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
My Wiki home: [[Koichi]]<br />
<br />
email: koichi.dbms_at_gmail.com<br />
<br />
Other topics: [[Distributed deadlock detection]]<br />
== '''Parallel Recovery''' ==<br />
<br><br />
<br />
=== PostgreSQL redo log (WAL) outline ===<br />
<br />
In PostgreSQL, WAL (write-ahead log) is used for recovery and log-shipping replication. Outline is described in [https://www.postgresql.org/docs/current/wal-intro.html this documentation page]. The outline of WAL is:<br />
<br />
* Every updating backend process must write its redo log to WAL before it really updates data files.<br />
* Each WAL record (piece of updating data record) has to reach stable storage before the data is synch'ed to the storage.<br />
<br />
WAL is used in recovery and log-shipping replication. In recovery, the WAL is used as follows:<br />
<br />
* Each WAL record is read from WAL segments in the written order,<br />
* Read WAL record is applied to corresponding data block. This is called "redo" or "replay".<br />
* To maintain redo consistency, redo was done in single thread.<br />
<br />
In the case of log-shipping replication, each WAL record is:<br />
<br />
* Transfered to each replica in the written order and<br />
* Replayed at replica in sngle thread just in the same way of recovery<br />
* When replay reaches some consistent state, it allows postmaster to begin read transaction, while following WAL records reach and are replayed.<br />
<br><br />
<br />
----<br />
<br />
=== Issues around serialized replay ===<br />
<br />
At present, WAL records are replayed strictly in written order in a single thread, while they are generated by multiple backend processes in parallel. The issues are:<br />
<br />
* Recovery/replay performance is much worse than performance to write WALs. Recovery can take longer than the period WAL records were written.<br />
* Replication lag can be long when the master has heavy updating workloads.<br />
<br><br />
<br />
----<br />
<br />
=== Can replay be done in paralle ? ===<br />
<br />
If we apply several rules about recovery order, yes. The rules are:<br />
<br />
* For a given data block, WAL records must be apply in the written order.<br />
* For a given transaction, when we apply WAL record terminating the transaction (commit/abort/prepare for commit), all the associated WAL records must have been replayed,<br />
* In the case of multi-block WAL, it should be applyed by only one thread. For this purpose, such WAL record is assigned to all the block workers responsible for each block involved. When block worker picks up multi-block WAL and if it is not the last thread, it should wait until the last thread picking up this record actually replay it. If a thread is the last to pick up multi-block WAL record, it replays the record and send sync to other waiging workers.<br />
* To replay specific WAL record such as updating timeline, it should wait until all the preceding WAL records are replayed This may not be necessary but it looks safer at present to have such condition for certain set of WAL records.<br />
* Replayed LSN at pg_controldata will be updated when all the preceding WAL records are replayed.<br />
<br><br />
<br />
----<br />
<br />
=== Implementation Note ===<br />
<br><br />
==== Worker configuration ====<br />
<br />
* Because current recovery is done in startup process, parallel replay threads are implemented as its child process.<br />
* Now we have following processes for this purpose:<br />
** Starup process: read WAL record from WAL segment/walsender and perform overall control,<br />
** Dispatching worker: analyze and dispatch WAL records,<br />
** Error page registration: collects error pages from other worker. Current implementation is using hash functions based on memory context, which is not simple to port it to shared memory. We can replace this with simpler configuration if we can manage such information using shared memory.<br />
** Transaction worker: replays WAL record which does not have any block information to update.<br />
** Block worker: replays block information (mostly HEAP/HEAP2).<br />
<br />
We can have more than one block workers. Other workers consists of single process each.<br />
<br><br />
==== Shared information and locks ====<br />
<br />
To pass and share information such as WAL record and status, shared memory defined in <code>storage/dsm.h</code> is used.<br />
<br />
To protest these data, spin lock defined in <code>storage/spin.h</code> is used.<br />
<br />
<br><br />
==== Synchronization among workers ====<br />
<br />
For light-weight synchronization among workers, unix-domain udp socket is used. This can be replaced with <code>stoage/latch.h</code>.<br />
<br><br />
==== New GUC ====<br />
<br />
Following GUC parameters are added:<br />
<br />
* <code>parallel_replay</code> (bool)<br />
* <code>parallel_replay_test</code> (bool, just internal) enable additional code to work to attach workers with gdb for test.<br />
* <code>num_preplay_workers</code> (int) number of total replay workers.<br />
* <code>num_preplay_queue</code> (int) number of queues holding outstanding WAL records,<br />
* <code>num_preplay_max_txn</code> (int) number of max outstanding transaction allowed in recovery. Max_connections or larger. In the case of replication slave, must be master's max_connections or larger.<br />
* <code>preplay_buffers</code> (int)<br />
<br><br />
<br />
----<br />
<br />
=== Current code ===<br />
<br />
Current code is available from [https://github.com/koichi-szk/postgres.git Koichi's GITHub repository]. Branch is <code>parallel_replay_14_2</code>.<br />
<br />
<br><br />
<br />
----<br />
<br />
=== Current code status ===<br />
<br />
No test yet. The code just pass the build and under the review before testing. Now a couple of function needs improvement/development. '''Any cooperation for the test is very very welcome.''' You can contact to koichi.dbms_at_gmail.com.<br />
<br />
Please understand that the repo can be incomplete due to occasional work.<br />
<br />
Further addition is:<br />
<br />
* To add a code to have WAL record in text format (like pg_waldump) and hold it to some portion of the memory to help the test. <code>xlog_outrec()</code> looks very useful for this purpose. At present, this is enabled with <code>WAL_DEBUG</code> flag.</div>Koichihttps://wiki.postgresql.org/index.php?title=Parallel_Recovery&diff=36924Parallel Recovery2022-05-06T08:55:03Z<p>Koichi: /* Current code */</p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
My Wiki home: [[Koichi]]<br />
<br />
email: koichi.dbms_at_gmail.com<br />
<br />
Other topics: [[Distributed deadlock detection]]<br />
== '''Parallel Recovery''' ==<br />
<br><br />
<br />
=== PostgreSQL redo log (WAL) outline ===<br />
<br />
In PostgreSQL, WAL (write-ahead log) is used for recovery and log-shipping replication. Outline is described in [https://www.postgresql.org/docs/current/wal-intro.html this documentation page]. The outline of WAL is:<br />
<br />
* Every updating backend process must write its redo log to WAL before it really updates data files.<br />
* Each WAL record (piece of updating data record) has to reach stable storage before the data is synch'ed to the storage.<br />
<br />
WAL is used in recovery and log-shipping replication. In recovery, the WAL is used as follows:<br />
<br />
* Each WAL record is read from WAL segments in the written order,<br />
* Read WAL record is applied to corresponding data block. This is called "redo" or "replay".<br />
* To maintain redo consistency, redo was done in single thread.<br />
<br />
In the case of log-shipping replication, each WAL record is:<br />
<br />
* Transfered to each replica in the written order and<br />
* Replayed at replica in sngle thread just in the same way of recovery<br />
* When replay reaches some consistent state, it allows postmaster to begin read transaction, while following WAL records reach and are replayed.<br />
<br><br />
<br />
----<br />
<br />
=== Issues around serialized replay ===<br />
<br />
At present, WAL records are replayed strictly in written order in a single thread, while they are generated by multiple backend processes in parallel. The issues are:<br />
<br />
* Recovery/replay performance is much worse than performance to write WALs. Recovery can take longer than the period WAL records were written.<br />
* Replication lag can be long when the master has heavy updating workloads.<br />
<br><br />
<br />
----<br />
<br />
=== Can replay be done in paralle ? ===<br />
<br />
If we apply several rules about recovery order, yes. The rules are:<br />
<br />
* For a given data block, WAL records must be apply in the written order.<br />
* For a given transaction, when we apply WAL record terminating the transaction (commit/abort/prepare for commit), all the associated WAL records must have been replayed,<br />
* In the case of multi-block WAL, it should be applyed by only one thread. For this purpose, such WAL record is assigned to all the block workers responsible for each block involved. When block worker picks up multi-block WAL and if it is not the last thread, it should wait until the last thread picking up this record actually replay it. If a thread is the last to pick up multi-block WAL record, it replays the record and send sync to other waiging workers.<br />
* To replay specific WAL record such as updating timeline, it should wait until all the preceding WAL records are replayed This may not be necessary but it looks safer at present to have such condition for certain set of WAL records.<br />
* Replayed LSN at pg_controldata will be updated when all the preceding WAL records are replayed.<br />
<br><br />
<br />
----<br />
<br />
=== Implementation Note ===<br />
<br><br />
==== Worker configuration ====<br />
<br />
* Because current recovery is done in startup process, parallel replay threads are implemented as its child process.<br />
* Now we have following processes for this purpose:<br />
** Starup process: read WAL record from WAL segment/walsender and perform overall control,<br />
** Dispatching worker: analyze and dispatch WAL records,<br />
** Error page registration: collects error pages from other worker. Current implementation is using hash functions based on memory context, which is not simple to port it to shared memory. We can replace this with simpler configuration if we can manage such information using shared memory.<br />
** Transaction worker: replays WAL record which does not have any block information to update.<br />
** Block worker: replays block information (mostly HEAP/HEAP2).<br />
<br />
We can have more than one block workers. Other workers consists of single process each.<br />
<br><br />
==== Shared information and locks ====<br />
<br />
To pass and share information such as WAL record and status, shared memory defined in <code>storage/dsm.h</code> is used.<br />
<br />
To protest these data, spin lock defined in <code>storage/spin.h</code> is used.<br />
<br />
<br><br />
==== Synchronization among workers ====<br />
<br />
For light-weight synchronization among workers, unix-domain udp socket is used. This can be replaced with <code>stoage/latch.h</code>.<br />
<br><br />
==== New GUC ====<br />
<br />
Following GUC parameters are added:<br />
<br />
* <code>parallel_replay</code> (bool)<br />
* <code>parallel_replay_test</code> (bool, just internal) enable additional code to work to attach workers with gdb for test.<br />
* <code>num_preplay_workers</code> (int) number of total replay workers.<br />
* <code>num_preplay_queue</code> (int) number of queues holding outstanding WAL records,<br />
* <code>num_preplay_max_txn</code> (int) number of max outstanding transaction allowed in recovery. Max_connections or larger. In the case of replication slave, must be master's max_connections or larger.<br />
* <code>preplay_buffers</code> (int)<br />
<br><br />
<br />
----<br />
<br />
=== Current code ===<br />
<br />
Current code is available from [https://github.com/koichi-szk/postgres.git Koichi's GITHub repository]. Branch is <code>parallel_replay_14_2</code>.<br />
<br />
<br><br />
<br />
----<br />
<br />
=== Current code status ===<br />
<br />
No test yet. The code just pass the build and under the review before testing. '''Any cooperation for the test is very very welcome.''' You can contact to koichi.dbms_at_gmail.com.<br />
<br />
Please understand that the repo can be incomplete due to occasional work.<br />
<br />
Further addition is:<br />
<br />
* To add a code to have WAL record in text format (like pg_waldump) and hold it to some portion of the memory to help the test. <code>xlog_outrec()</code> looks very useful for this purpose. At present, this is enabled with <code>WAL_DEBUG</code> flag.</div>Koichihttps://wiki.postgresql.org/index.php?title=Parallel_Recovery&diff=36923Parallel Recovery2022-05-06T08:54:45Z<p>Koichi: /* Parallel recovery */</p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
My Wiki home: [[Koichi]]<br />
<br />
email: koichi.dbms_at_gmail.com<br />
<br />
Other topics: [[Distributed deadlock detection]]<br />
== '''Parallel Recovery''' ==<br />
<br><br />
<br />
=== PostgreSQL redo log (WAL) outline ===<br />
<br />
In PostgreSQL, WAL (write-ahead log) is used for recovery and log-shipping replication. Outline is described in [https://www.postgresql.org/docs/current/wal-intro.html this documentation page]. The outline of WAL is:<br />
<br />
* Every updating backend process must write its redo log to WAL before it really updates data files.<br />
* Each WAL record (piece of updating data record) has to reach stable storage before the data is synch'ed to the storage.<br />
<br />
WAL is used in recovery and log-shipping replication. In recovery, the WAL is used as follows:<br />
<br />
* Each WAL record is read from WAL segments in the written order,<br />
* Read WAL record is applied to corresponding data block. This is called "redo" or "replay".<br />
* To maintain redo consistency, redo was done in single thread.<br />
<br />
In the case of log-shipping replication, each WAL record is:<br />
<br />
* Transfered to each replica in the written order and<br />
* Replayed at replica in sngle thread just in the same way of recovery<br />
* When replay reaches some consistent state, it allows postmaster to begin read transaction, while following WAL records reach and are replayed.<br />
<br><br />
<br />
----<br />
<br />
=== Issues around serialized replay ===<br />
<br />
At present, WAL records are replayed strictly in written order in a single thread, while they are generated by multiple backend processes in parallel. The issues are:<br />
<br />
* Recovery/replay performance is much worse than performance to write WALs. Recovery can take longer than the period WAL records were written.<br />
* Replication lag can be long when the master has heavy updating workloads.<br />
<br><br />
<br />
----<br />
<br />
=== Can replay be done in paralle ? ===<br />
<br />
If we apply several rules about recovery order, yes. The rules are:<br />
<br />
* For a given data block, WAL records must be apply in the written order.<br />
* For a given transaction, when we apply WAL record terminating the transaction (commit/abort/prepare for commit), all the associated WAL records must have been replayed,<br />
* In the case of multi-block WAL, it should be applyed by only one thread. For this purpose, such WAL record is assigned to all the block workers responsible for each block involved. When block worker picks up multi-block WAL and if it is not the last thread, it should wait until the last thread picking up this record actually replay it. If a thread is the last to pick up multi-block WAL record, it replays the record and send sync to other waiging workers.<br />
* To replay specific WAL record such as updating timeline, it should wait until all the preceding WAL records are replayed This may not be necessary but it looks safer at present to have such condition for certain set of WAL records.<br />
* Replayed LSN at pg_controldata will be updated when all the preceding WAL records are replayed.<br />
<br><br />
<br />
----<br />
<br />
=== Implementation Note ===<br />
<br><br />
==== Worker configuration ====<br />
<br />
* Because current recovery is done in startup process, parallel replay threads are implemented as its child process.<br />
* Now we have following processes for this purpose:<br />
** Starup process: read WAL record from WAL segment/walsender and perform overall control,<br />
** Dispatching worker: analyze and dispatch WAL records,<br />
** Error page registration: collects error pages from other worker. Current implementation is using hash functions based on memory context, which is not simple to port it to shared memory. We can replace this with simpler configuration if we can manage such information using shared memory.<br />
** Transaction worker: replays WAL record which does not have any block information to update.<br />
** Block worker: replays block information (mostly HEAP/HEAP2).<br />
<br />
We can have more than one block workers. Other workers consists of single process each.<br />
<br><br />
==== Shared information and locks ====<br />
<br />
To pass and share information such as WAL record and status, shared memory defined in <code>storage/dsm.h</code> is used.<br />
<br />
To protest these data, spin lock defined in <code>storage/spin.h</code> is used.<br />
<br />
<br><br />
==== Synchronization among workers ====<br />
<br />
For light-weight synchronization among workers, unix-domain udp socket is used. This can be replaced with <code>stoage/latch.h</code>.<br />
<br><br />
==== New GUC ====<br />
<br />
Following GUC parameters are added:<br />
<br />
* <code>parallel_replay</code> (bool)<br />
* <code>parallel_replay_test</code> (bool, just internal) enable additional code to work to attach workers with gdb for test.<br />
* <code>num_preplay_workers</code> (int) number of total replay workers.<br />
* <code>num_preplay_queue</code> (int) number of queues holding outstanding WAL records,<br />
* <code>num_preplay_max_txn</code> (int) number of max outstanding transaction allowed in recovery. Max_connections or larger. In the case of replication slave, must be master's max_connections or larger.<br />
* <code>preplay_buffers</code> (int)<br />
<br><br />
<br />
----<br />
<br />
=== Current code ===<br />
<br />
Current code is available from [https://github.com/koichi-szk/postgres.git Koichi's GITHub repository]. Branch is <code>parallel_replay_14_2</code>.<br />
<br><br />
<br />
----<br />
<br />
=== Current code status ===<br />
<br />
No test yet. The code just pass the build and under the review before testing. '''Any cooperation for the test is very very welcome.''' You can contact to koichi.dbms_at_gmail.com.<br />
<br />
Please understand that the repo can be incomplete due to occasional work.<br />
<br />
Further addition is:<br />
<br />
* To add a code to have WAL record in text format (like pg_waldump) and hold it to some portion of the memory to help the test. <code>xlog_outrec()</code> looks very useful for this purpose. At present, this is enabled with <code>WAL_DEBUG</code> flag.</div>Koichihttps://wiki.postgresql.org/index.php?title=Parallel_Recovery&diff=36922Parallel Recovery2022-05-06T08:52:37Z<p>Koichi: /* PostgreSQL redo log (WAL) outline */</p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
My Wiki home: [[Koichi]]<br />
<br />
email: koichi.dbms_at_gmail.com<br />
<br />
Other topics: [[Distributed deadlock detection]]<br />
== Parallel recovery ==<br />
<br />
<br />
=== PostgreSQL redo log (WAL) outline ===<br />
<br />
In PostgreSQL, WAL (write-ahead log) is used for recovery and log-shipping replication. Outline is described in [https://www.postgresql.org/docs/current/wal-intro.html this documentation page]. The outline of WAL is:<br />
<br />
* Every updating backend process must write its redo log to WAL before it really updates data files.<br />
* Each WAL record (piece of updating data record) has to reach stable storage before the data is synch'ed to the storage.<br />
<br />
WAL is used in recovery and log-shipping replication. In recovery, the WAL is used as follows:<br />
<br />
* Each WAL record is read from WAL segments in the written order,<br />
* Read WAL record is applied to corresponding data block. This is called "redo" or "replay".<br />
* To maintain redo consistency, redo was done in single thread.<br />
<br />
In the case of log-shipping replication, each WAL record is:<br />
<br />
* Transfered to each replica in the written order and<br />
* Replayed at replica in sngle thread just in the same way of recovery<br />
* When replay reaches some consistent state, it allows postmaster to begin read transaction, while following WAL records reach and are replayed.<br />
<br><br />
----<br />
<br />
=== Issues around serialized replay ===<br />
<br />
At present, WAL records are replayed strictly in written order in a single thread, while they are generated by multiple backend processes in parallel. The issues are:<br />
<br />
* Recovery/replay performance is much worse than performance to write WALs. Recovery can take longer than the period WAL records were written.<br />
* Replication lag can be long when the master has heavy updating workloads.<br />
<br />
----<br />
<br />
=== Can replay be done in paralle ? ===<br />
<br />
If we apply several rules about recovery order, yes. The rules are:<br />
<br />
* For a given data block, WAL records must be apply in the written order.<br />
* For a given transaction, when we apply WAL record terminating the transaction (commit/abort/prepare for commit), all the associated WAL records must have been replayed,<br />
* In the case of multi-block WAL, it should be applyed by only one thread. For this purpose, such WAL record is assigned to all the block workers responsible for each block involved. When block worker picks up multi-block WAL and if it is not the last thread, it should wait until the last thread picking up this record actually replay it. If a thread is the last to pick up multi-block WAL record, it replays the record and send sync to other waiging workers.<br />
* To replay specific WAL record such as updating timeline, it should wait until all the preceding WAL records are replayed This may not be necessary but it looks safer at present to have such condition for certain set of WAL records.<br />
* Replayed LSN at pg_controldata will be updated when all the preceding WAL records are replayed.<br />
<br />
----<br />
<br />
=== Implementation Note ===<br />
<br />
==== Worker configuration ====<br />
<br />
* Because current recovery is done in startup process, parallel replay threads are implemented as its child process.<br />
* Now we have following processes for this purpose:<br />
** Starup process: read WAL record from WAL segment/walsender and perform overall control,<br />
** Dispatching worker: analyze and dispatch WAL records,<br />
** Error page registration: collects error pages from other worker. Current implementation is using hash functions based on memory context, which is not simple to port it to shared memory. We can replace this with simpler configuration if we can manage such information using shared memory.<br />
** Transaction worker: replays WAL record which does not have any block information to update.<br />
** Block worker: replays block information (mostly HEAP/HEAP2).<br />
<br />
We can have more than one block workers. Other workers consists of single process each.<br />
<br />
==== Shared information and locks ====<br />
<br />
To pass and share information such as WAL record and status, shared memory defined in <code>storage/dsm.h</code> is used.<br />
<br />
To protest these data, spin lock defined in <code>storage/spin.h</code> is used.<br />
<br />
<br />
==== Synchronization among workers ====<br />
<br />
For light-weight synchronization among workers, unix-domain udp socket is used. This can be replaced with <code>stoage/latch.h</code>.<br />
<br />
==== New GUC ====<br />
<br />
Following GUC parameters are added:<br />
<br />
* <code>parallel_replay</code> (bool)<br />
* <code>parallel_replay_test</code> (bool, just internal) enable additional code to work to attach workers with gdb for test.<br />
* <code>num_preplay_workers</code> (int) number of total replay workers.<br />
* <code>num_preplay_queue</code> (int) number of queues holding outstanding WAL records,<br />
* <code>num_preplay_max_txn</code> (int) number of max outstanding transaction allowed in recovery. Max_connections or larger. In the case of replication slave, must be master's max_connections or larger.<br />
* <code>preplay_buffers</code> (int)<br />
<br />
----<br />
<br />
=== Current code ===<br />
<br />
Current code is available from [https://github.com/koichi-szk/postgres.git Koichi's GITHub repository]. Branch is <code>parallel_replay_14_2</code>.<br />
<br />
----<br />
<br />
=== Current code status ===<br />
<br />
No test yet. The code just pass the build and under the review before testing. '''Any cooperation for the test is very very welcome.''' You can contact to koichi.dbms_at_gmail.com.<br />
<br />
Please understand that the repo can be incomplete due to occasional work.<br />
<br />
Further addition is:<br />
<br />
* To add a code to have WAL record in text format (like pg_waldump) and hold it to some portion of the memory to help the test. <code>xlog_outrec()</code> looks very useful for this purpose. At present, this is enabled with <code>WAL_DEBUG</code> flag.</div>Koichihttps://wiki.postgresql.org/index.php?title=Parallel_Recovery&diff=36921Parallel Recovery2022-05-06T08:52:21Z<p>Koichi: /* PostgreSQL redo log (WAL) outline */</p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
My Wiki home: [[Koichi]]<br />
<br />
email: koichi.dbms_at_gmail.com<br />
<br />
Other topics: [[Distributed deadlock detection]]<br />
== Parallel recovery ==<br />
<br />
<br />
=== PostgreSQL redo log (WAL) outline ===<br />
<br />
In PostgreSQL, WAL (write-ahead log) is used for recovery and log-shipping replication. Outline is described in [https://www.postgresql.org/docs/current/wal-intro.html this documentation page]. The outline of WAL is:<br />
<br />
* Every updating backend process must write its redo log to WAL before it really updates data files.<br />
* Each WAL record (piece of updating data record) has to reach stable storage before the data is synch'ed to the storage.<br />
<br />
WAL is used in recovery and log-shipping replication. In recovery, the WAL is used as follows:<br />
<br />
* Each WAL record is read from WAL segments in the written order,<br />
* Read WAL record is applied to corresponding data block. This is called "redo" or "replay".<br />
* To maintain redo consistency, redo was done in single thread.<br />
<br />
In the case of log-shipping replication, each WAL record is:<br />
<br />
* Transfered to each replica in the written order and<br />
* Replayed at replica in sngle thread just in the same way of recovery<br />
* When replay reaches some consistent state, it allows postmaster to begin read transaction, while following WAL records reach and are replayed.<br />
<br><br><br />
----<br />
<br />
=== Issues around serialized replay ===<br />
<br />
At present, WAL records are replayed strictly in written order in a single thread, while they are generated by multiple backend processes in parallel. The issues are:<br />
<br />
* Recovery/replay performance is much worse than performance to write WALs. Recovery can take longer than the period WAL records were written.<br />
* Replication lag can be long when the master has heavy updating workloads.<br />
<br />
----<br />
<br />
=== Can replay be done in paralle ? ===<br />
<br />
If we apply several rules about recovery order, yes. The rules are:<br />
<br />
* For a given data block, WAL records must be apply in the written order.<br />
* For a given transaction, when we apply WAL record terminating the transaction (commit/abort/prepare for commit), all the associated WAL records must have been replayed,<br />
* In the case of multi-block WAL, it should be applyed by only one thread. For this purpose, such WAL record is assigned to all the block workers responsible for each block involved. When block worker picks up multi-block WAL and if it is not the last thread, it should wait until the last thread picking up this record actually replay it. If a thread is the last to pick up multi-block WAL record, it replays the record and send sync to other waiging workers.<br />
* To replay specific WAL record such as updating timeline, it should wait until all the preceding WAL records are replayed This may not be necessary but it looks safer at present to have such condition for certain set of WAL records.<br />
* Replayed LSN at pg_controldata will be updated when all the preceding WAL records are replayed.<br />
<br />
----<br />
<br />
=== Implementation Note ===<br />
<br />
==== Worker configuration ====<br />
<br />
* Because current recovery is done in startup process, parallel replay threads are implemented as its child process.<br />
* Now we have following processes for this purpose:<br />
** Starup process: read WAL record from WAL segment/walsender and perform overall control,<br />
** Dispatching worker: analyze and dispatch WAL records,<br />
** Error page registration: collects error pages from other worker. Current implementation is using hash functions based on memory context, which is not simple to port it to shared memory. We can replace this with simpler configuration if we can manage such information using shared memory.<br />
** Transaction worker: replays WAL record which does not have any block information to update.<br />
** Block worker: replays block information (mostly HEAP/HEAP2).<br />
<br />
We can have more than one block workers. Other workers consists of single process each.<br />
<br />
==== Shared information and locks ====<br />
<br />
To pass and share information such as WAL record and status, shared memory defined in <code>storage/dsm.h</code> is used.<br />
<br />
To protest these data, spin lock defined in <code>storage/spin.h</code> is used.<br />
<br />
<br />
==== Synchronization among workers ====<br />
<br />
For light-weight synchronization among workers, unix-domain udp socket is used. This can be replaced with <code>stoage/latch.h</code>.<br />
<br />
==== New GUC ====<br />
<br />
Following GUC parameters are added:<br />
<br />
* <code>parallel_replay</code> (bool)<br />
* <code>parallel_replay_test</code> (bool, just internal) enable additional code to work to attach workers with gdb for test.<br />
* <code>num_preplay_workers</code> (int) number of total replay workers.<br />
* <code>num_preplay_queue</code> (int) number of queues holding outstanding WAL records,<br />
* <code>num_preplay_max_txn</code> (int) number of max outstanding transaction allowed in recovery. Max_connections or larger. In the case of replication slave, must be master's max_connections or larger.<br />
* <code>preplay_buffers</code> (int)<br />
<br />
----<br />
<br />
=== Current code ===<br />
<br />
Current code is available from [https://github.com/koichi-szk/postgres.git Koichi's GITHub repository]. Branch is <code>parallel_replay_14_2</code>.<br />
<br />
----<br />
<br />
=== Current code status ===<br />
<br />
No test yet. The code just pass the build and under the review before testing. '''Any cooperation for the test is very very welcome.''' You can contact to koichi.dbms_at_gmail.com.<br />
<br />
Please understand that the repo can be incomplete due to occasional work.<br />
<br />
Further addition is:<br />
<br />
* To add a code to have WAL record in text format (like pg_waldump) and hold it to some portion of the memory to help the test. <code>xlog_outrec()</code> looks very useful for this purpose. At present, this is enabled with <code>WAL_DEBUG</code> flag.</div>Koichihttps://wiki.postgresql.org/index.php?title=Parallel_Recovery&diff=36920Parallel Recovery2022-05-06T08:50:57Z<p>Koichi: /* Current code status */</p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
My Wiki home: [[Koichi]]<br />
<br />
email: koichi.dbms_at_gmail.com<br />
<br />
Other topics: [[Distributed deadlock detection]]<br />
== Parallel recovery ==<br />
<br />
<br />
=== PostgreSQL redo log (WAL) outline ===<br />
<br />
In PostgreSQL, WAL (write-ahead log) is used for recovery and log-shipping replication. Outline is described in [https://www.postgresql.org/docs/current/wal-intro.html this documentation page]. The outline of WAL is:<br />
<br />
* Every updating backend process must write its redo log to WAL before it really updates data files.<br />
* Each WAL record (piece of updating data record) has to reach stable storage before the data is synch'ed to the storage.<br />
<br />
WAL is used in recovery and log-shipping replication. In recovery, the WAL is used as follows:<br />
<br />
* Each WAL record is read from WAL segments in the written order,<br />
* Read WAL record is applied to corresponding data block. This is called "redo" or "replay".<br />
* To maintain redo consistency, redo was done in single thread.<br />
<br />
In the case of log-shipping replication, each WAL record is:<br />
<br />
* Transfered to each replica in the written order and<br />
* Replayed at replica in sngle thread just in the same way of recovery<br />
* When replay reaches some consistent state, it allows postmaster to begin read transaction, while following WAL records reach and are replayed.<br />
<br />
----<br />
<br />
=== Issues around serialized replay ===<br />
<br />
At present, WAL records are replayed strictly in written order in a single thread, while they are generated by multiple backend processes in parallel. The issues are:<br />
<br />
* Recovery/replay performance is much worse than performance to write WALs. Recovery can take longer than the period WAL records were written.<br />
* Replication lag can be long when the master has heavy updating workloads.<br />
<br />
----<br />
<br />
=== Can replay be done in paralle ? ===<br />
<br />
If we apply several rules about recovery order, yes. The rules are:<br />
<br />
* For a given data block, WAL records must be apply in the written order.<br />
* For a given transaction, when we apply WAL record terminating the transaction (commit/abort/prepare for commit), all the associated WAL records must have been replayed,<br />
* In the case of multi-block WAL, it should be applyed by only one thread. For this purpose, such WAL record is assigned to all the block workers responsible for each block involved. When block worker picks up multi-block WAL and if it is not the last thread, it should wait until the last thread picking up this record actually replay it. If a thread is the last to pick up multi-block WAL record, it replays the record and send sync to other waiging workers.<br />
* To replay specific WAL record such as updating timeline, it should wait until all the preceding WAL records are replayed This may not be necessary but it looks safer at present to have such condition for certain set of WAL records.<br />
* Replayed LSN at pg_controldata will be updated when all the preceding WAL records are replayed.<br />
<br />
----<br />
<br />
=== Implementation Note ===<br />
<br />
==== Worker configuration ====<br />
<br />
* Because current recovery is done in startup process, parallel replay threads are implemented as its child process.<br />
* Now we have following processes for this purpose:<br />
** Starup process: read WAL record from WAL segment/walsender and perform overall control,<br />
** Dispatching worker: analyze and dispatch WAL records,<br />
** Error page registration: collects error pages from other worker. Current implementation is using hash functions based on memory context, which is not simple to port it to shared memory. We can replace this with simpler configuration if we can manage such information using shared memory.<br />
** Transaction worker: replays WAL record which does not have any block information to update.<br />
** Block worker: replays block information (mostly HEAP/HEAP2).<br />
<br />
We can have more than one block workers. Other workers consists of single process each.<br />
<br />
==== Shared information and locks ====<br />
<br />
To pass and share information such as WAL record and status, shared memory defined in <code>storage/dsm.h</code> is used.<br />
<br />
To protest these data, spin lock defined in <code>storage/spin.h</code> is used.<br />
<br />
<br />
==== Synchronization among workers ====<br />
<br />
For light-weight synchronization among workers, unix-domain udp socket is used. This can be replaced with <code>stoage/latch.h</code>.<br />
<br />
==== New GUC ====<br />
<br />
Following GUC parameters are added:<br />
<br />
* <code>parallel_replay</code> (bool)<br />
* <code>parallel_replay_test</code> (bool, just internal) enable additional code to work to attach workers with gdb for test.<br />
* <code>num_preplay_workers</code> (int) number of total replay workers.<br />
* <code>num_preplay_queue</code> (int) number of queues holding outstanding WAL records,<br />
* <code>num_preplay_max_txn</code> (int) number of max outstanding transaction allowed in recovery. Max_connections or larger. In the case of replication slave, must be master's max_connections or larger.<br />
* <code>preplay_buffers</code> (int)<br />
<br />
----<br />
<br />
=== Current code ===<br />
<br />
Current code is available from [https://github.com/koichi-szk/postgres.git Koichi's GITHub repository]. Branch is <code>parallel_replay_14_2</code>.<br />
<br />
----<br />
<br />
=== Current code status ===<br />
<br />
No test yet. The code just pass the build and under the review before testing. '''Any cooperation for the test is very very welcome.''' You can contact to koichi.dbms_at_gmail.com.<br />
<br />
Please understand that the repo can be incomplete due to occasional work.<br />
<br />
Further addition is:<br />
<br />
* To add a code to have WAL record in text format (like pg_waldump) and hold it to some portion of the memory to help the test. <code>xlog_outrec()</code> looks very useful for this purpose. At present, this is enabled with <code>WAL_DEBUG</code> flag.</div>Koichihttps://wiki.postgresql.org/index.php?title=User:Koichi&diff=36919User:Koichi2022-05-06T07:24:57Z<p>Koichi: </p>
<hr />
<div>[https://wiki.postgresql.org/wiki/Koichi My main Wiki entry]<br />
<br />
email: koichi.dbms_at_gmail.com<br />
<br />
[[File:MyPicture_github.jpg]]<br />
<br />
== Koichi's Idea/Architecture/Implementation page ==<br />
<br />
* Deadlocks among distributed transaction: Scenario, detection method and implementation, [https://wiki.postgresql.org/wiki/distributed_deadlock_detection Visit Here]<br />
* Parallel recovery for both standalone PostgreSQL database and log shipping replication (physical WAL send/receive), [[Parallel Recovery]]<br />
* Consistent visibility in distributed transactions: scenario, methodology and implementation. [[Distributed transaction visibility]]</div>Koichihttps://wiki.postgresql.org/index.php?title=Parallel_Recovery&diff=36918Parallel Recovery2022-05-06T07:24:31Z<p>Koichi: </p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
My Wiki home: [[Koichi]]<br />
<br />
email: koichi.dbms_at_gmail.com<br />
<br />
Other topics: [[Distributed deadlock detection]]<br />
== Parallel recovery ==<br />
<br />
<br />
=== PostgreSQL redo log (WAL) outline ===<br />
<br />
In PostgreSQL, WAL (write-ahead log) is used for recovery and log-shipping replication. Outline is described in [https://www.postgresql.org/docs/current/wal-intro.html this documentation page]. The outline of WAL is:<br />
<br />
* Every updating backend process must write its redo log to WAL before it really updates data files.<br />
* Each WAL record (piece of updating data record) has to reach stable storage before the data is synch'ed to the storage.<br />
<br />
WAL is used in recovery and log-shipping replication. In recovery, the WAL is used as follows:<br />
<br />
* Each WAL record is read from WAL segments in the written order,<br />
* Read WAL record is applied to corresponding data block. This is called "redo" or "replay".<br />
* To maintain redo consistency, redo was done in single thread.<br />
<br />
In the case of log-shipping replication, each WAL record is:<br />
<br />
* Transfered to each replica in the written order and<br />
* Replayed at replica in sngle thread just in the same way of recovery<br />
* When replay reaches some consistent state, it allows postmaster to begin read transaction, while following WAL records reach and are replayed.<br />
<br />
----<br />
<br />
=== Issues around serialized replay ===<br />
<br />
At present, WAL records are replayed strictly in written order in a single thread, while they are generated by multiple backend processes in parallel. The issues are:<br />
<br />
* Recovery/replay performance is much worse than performance to write WALs. Recovery can take longer than the period WAL records were written.<br />
* Replication lag can be long when the master has heavy updating workloads.<br />
<br />
----<br />
<br />
=== Can replay be done in paralle ? ===<br />
<br />
If we apply several rules about recovery order, yes. The rules are:<br />
<br />
* For a given data block, WAL records must be apply in the written order.<br />
* For a given transaction, when we apply WAL record terminating the transaction (commit/abort/prepare for commit), all the associated WAL records must have been replayed,<br />
* In the case of multi-block WAL, it should be applyed by only one thread. For this purpose, such WAL record is assigned to all the block workers responsible for each block involved. When block worker picks up multi-block WAL and if it is not the last thread, it should wait until the last thread picking up this record actually replay it. If a thread is the last to pick up multi-block WAL record, it replays the record and send sync to other waiging workers.<br />
* To replay specific WAL record such as updating timeline, it should wait until all the preceding WAL records are replayed This may not be necessary but it looks safer at present to have such condition for certain set of WAL records.<br />
* Replayed LSN at pg_controldata will be updated when all the preceding WAL records are replayed.<br />
<br />
----<br />
<br />
=== Implementation Note ===<br />
<br />
==== Worker configuration ====<br />
<br />
* Because current recovery is done in startup process, parallel replay threads are implemented as its child process.<br />
* Now we have following processes for this purpose:<br />
** Starup process: read WAL record from WAL segment/walsender and perform overall control,<br />
** Dispatching worker: analyze and dispatch WAL records,<br />
** Error page registration: collects error pages from other worker. Current implementation is using hash functions based on memory context, which is not simple to port it to shared memory. We can replace this with simpler configuration if we can manage such information using shared memory.<br />
** Transaction worker: replays WAL record which does not have any block information to update.<br />
** Block worker: replays block information (mostly HEAP/HEAP2).<br />
<br />
We can have more than one block workers. Other workers consists of single process each.<br />
<br />
==== Shared information and locks ====<br />
<br />
To pass and share information such as WAL record and status, shared memory defined in <code>storage/dsm.h</code> is used.<br />
<br />
To protest these data, spin lock defined in <code>storage/spin.h</code> is used.<br />
<br />
<br />
==== Synchronization among workers ====<br />
<br />
For light-weight synchronization among workers, unix-domain udp socket is used. This can be replaced with <code>stoage/latch.h</code>.<br />
<br />
==== New GUC ====<br />
<br />
Following GUC parameters are added:<br />
<br />
* <code>parallel_replay</code> (bool)<br />
* <code>parallel_replay_test</code> (bool, just internal) enable additional code to work to attach workers with gdb for test.<br />
* <code>num_preplay_workers</code> (int) number of total replay workers.<br />
* <code>num_preplay_queue</code> (int) number of queues holding outstanding WAL records,<br />
* <code>num_preplay_max_txn</code> (int) number of max outstanding transaction allowed in recovery. Max_connections or larger. In the case of replication slave, must be master's max_connections or larger.<br />
* <code>preplay_buffers</code> (int)<br />
<br />
----<br />
<br />
=== Current code ===<br />
<br />
Current code is available from [https://github.com/koichi-szk/postgres.git Koichi's GITHub repository]. Branch is <code>parallel_replay_14_2</code>.<br />
<br />
----<br />
<br />
=== Current code status ===<br />
<br />
No test yet. The code just pass the build. '''Any cooperation for the test is very very welcome.''' You can contact to koichi.dbms_at_gmail.com.<br />
<br />
Please understand that the repo can be incomplete due to occasional work.<br />
<br />
Further addition is:<br />
<br />
* To add a code to have WAL record in text format (like pg_waldump) and hold it to some portion of the memory to help the test. <code>xlog_outrec()</code> looks very useful for this purpose. At present, this is enabled with <code>WAL_DEBUG</code> flag.</div>Koichihttps://wiki.postgresql.org/index.php?title=Parallel_Recovery&diff=36917Parallel Recovery2022-05-06T07:24:18Z<p>Koichi: </p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
My Wiki home: [[Koichi]]<br />
email: koichi.dbms_at_gmail.com<br />
<br />
Other topics: [[Distributed deadlock detection]]<br />
== Parallel recovery ==<br />
<br />
<br />
=== PostgreSQL redo log (WAL) outline ===<br />
<br />
In PostgreSQL, WAL (write-ahead log) is used for recovery and log-shipping replication. Outline is described in [https://www.postgresql.org/docs/current/wal-intro.html this documentation page]. The outline of WAL is:<br />
<br />
* Every updating backend process must write its redo log to WAL before it really updates data files.<br />
* Each WAL record (piece of updating data record) has to reach stable storage before the data is synch'ed to the storage.<br />
<br />
WAL is used in recovery and log-shipping replication. In recovery, the WAL is used as follows:<br />
<br />
* Each WAL record is read from WAL segments in the written order,<br />
* Read WAL record is applied to corresponding data block. This is called "redo" or "replay".<br />
* To maintain redo consistency, redo was done in single thread.<br />
<br />
In the case of log-shipping replication, each WAL record is:<br />
<br />
* Transfered to each replica in the written order and<br />
* Replayed at replica in sngle thread just in the same way of recovery<br />
* When replay reaches some consistent state, it allows postmaster to begin read transaction, while following WAL records reach and are replayed.<br />
<br />
----<br />
<br />
=== Issues around serialized replay ===<br />
<br />
At present, WAL records are replayed strictly in written order in a single thread, while they are generated by multiple backend processes in parallel. The issues are:<br />
<br />
* Recovery/replay performance is much worse than performance to write WALs. Recovery can take longer than the period WAL records were written.<br />
* Replication lag can be long when the master has heavy updating workloads.<br />
<br />
----<br />
<br />
=== Can replay be done in paralle ? ===<br />
<br />
If we apply several rules about recovery order, yes. The rules are:<br />
<br />
* For a given data block, WAL records must be apply in the written order.<br />
* For a given transaction, when we apply WAL record terminating the transaction (commit/abort/prepare for commit), all the associated WAL records must have been replayed,<br />
* In the case of multi-block WAL, it should be applyed by only one thread. For this purpose, such WAL record is assigned to all the block workers responsible for each block involved. When block worker picks up multi-block WAL and if it is not the last thread, it should wait until the last thread picking up this record actually replay it. If a thread is the last to pick up multi-block WAL record, it replays the record and send sync to other waiging workers.<br />
* To replay specific WAL record such as updating timeline, it should wait until all the preceding WAL records are replayed This may not be necessary but it looks safer at present to have such condition for certain set of WAL records.<br />
* Replayed LSN at pg_controldata will be updated when all the preceding WAL records are replayed.<br />
<br />
----<br />
<br />
=== Implementation Note ===<br />
<br />
==== Worker configuration ====<br />
<br />
* Because current recovery is done in startup process, parallel replay threads are implemented as its child process.<br />
* Now we have following processes for this purpose:<br />
** Starup process: read WAL record from WAL segment/walsender and perform overall control,<br />
** Dispatching worker: analyze and dispatch WAL records,<br />
** Error page registration: collects error pages from other worker. Current implementation is using hash functions based on memory context, which is not simple to port it to shared memory. We can replace this with simpler configuration if we can manage such information using shared memory.<br />
** Transaction worker: replays WAL record which does not have any block information to update.<br />
** Block worker: replays block information (mostly HEAP/HEAP2).<br />
<br />
We can have more than one block workers. Other workers consists of single process each.<br />
<br />
==== Shared information and locks ====<br />
<br />
To pass and share information such as WAL record and status, shared memory defined in <code>storage/dsm.h</code> is used.<br />
<br />
To protest these data, spin lock defined in <code>storage/spin.h</code> is used.<br />
<br />
<br />
==== Synchronization among workers ====<br />
<br />
For light-weight synchronization among workers, unix-domain udp socket is used. This can be replaced with <code>stoage/latch.h</code>.<br />
<br />
==== New GUC ====<br />
<br />
Following GUC parameters are added:<br />
<br />
* <code>parallel_replay</code> (bool)<br />
* <code>parallel_replay_test</code> (bool, just internal) enable additional code to work to attach workers with gdb for test.<br />
* <code>num_preplay_workers</code> (int) number of total replay workers.<br />
* <code>num_preplay_queue</code> (int) number of queues holding outstanding WAL records,<br />
* <code>num_preplay_max_txn</code> (int) number of max outstanding transaction allowed in recovery. Max_connections or larger. In the case of replication slave, must be master's max_connections or larger.<br />
* <code>preplay_buffers</code> (int)<br />
<br />
----<br />
<br />
=== Current code ===<br />
<br />
Current code is available from [https://github.com/koichi-szk/postgres.git Koichi's GITHub repository]. Branch is <code>parallel_replay_14_2</code>.<br />
<br />
----<br />
<br />
=== Current code status ===<br />
<br />
No test yet. The code just pass the build. '''Any cooperation for the test is very very welcome.''' You can contact to koichi.dbms_at_gmail.com.<br />
<br />
Please understand that the repo can be incomplete due to occasional work.<br />
<br />
Further addition is:<br />
<br />
* To add a code to have WAL record in text format (like pg_waldump) and hold it to some portion of the memory to help the test. <code>xlog_outrec()</code> looks very useful for this purpose. At present, this is enabled with <code>WAL_DEBUG</code> flag.</div>Koichihttps://wiki.postgresql.org/index.php?title=Parallel_Recovery&diff=36916Parallel Recovery2022-05-06T07:23:47Z<p>Koichi: /* Current code status */</p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
My Wiki home: [[Koichi]]<br />
<br />
Other topics: [[Distributed deadlock detection]]<br />
== Parallel recovery ==<br />
<br />
<br />
=== PostgreSQL redo log (WAL) outline ===<br />
<br />
In PostgreSQL, WAL (write-ahead log) is used for recovery and log-shipping replication. Outline is described in [https://www.postgresql.org/docs/current/wal-intro.html this documentation page]. The outline of WAL is:<br />
<br />
* Every updating backend process must write its redo log to WAL before it really updates data files.<br />
* Each WAL record (piece of updating data record) has to reach stable storage before the data is synch'ed to the storage.<br />
<br />
WAL is used in recovery and log-shipping replication. In recovery, the WAL is used as follows:<br />
<br />
* Each WAL record is read from WAL segments in the written order,<br />
* Read WAL record is applied to corresponding data block. This is called "redo" or "replay".<br />
* To maintain redo consistency, redo was done in single thread.<br />
<br />
In the case of log-shipping replication, each WAL record is:<br />
<br />
* Transfered to each replica in the written order and<br />
* Replayed at replica in sngle thread just in the same way of recovery<br />
* When replay reaches some consistent state, it allows postmaster to begin read transaction, while following WAL records reach and are replayed.<br />
<br />
----<br />
<br />
=== Issues around serialized replay ===<br />
<br />
At present, WAL records are replayed strictly in written order in a single thread, while they are generated by multiple backend processes in parallel. The issues are:<br />
<br />
* Recovery/replay performance is much worse than performance to write WALs. Recovery can take longer than the period WAL records were written.<br />
* Replication lag can be long when the master has heavy updating workloads.<br />
<br />
----<br />
<br />
=== Can replay be done in paralle ? ===<br />
<br />
If we apply several rules about recovery order, yes. The rules are:<br />
<br />
* For a given data block, WAL records must be apply in the written order.<br />
* For a given transaction, when we apply WAL record terminating the transaction (commit/abort/prepare for commit), all the associated WAL records must have been replayed,<br />
* In the case of multi-block WAL, it should be applyed by only one thread. For this purpose, such WAL record is assigned to all the block workers responsible for each block involved. When block worker picks up multi-block WAL and if it is not the last thread, it should wait until the last thread picking up this record actually replay it. If a thread is the last to pick up multi-block WAL record, it replays the record and send sync to other waiging workers.<br />
* To replay specific WAL record such as updating timeline, it should wait until all the preceding WAL records are replayed This may not be necessary but it looks safer at present to have such condition for certain set of WAL records.<br />
* Replayed LSN at pg_controldata will be updated when all the preceding WAL records are replayed.<br />
<br />
----<br />
<br />
=== Implementation Note ===<br />
<br />
==== Worker configuration ====<br />
<br />
* Because current recovery is done in startup process, parallel replay threads are implemented as its child process.<br />
* Now we have following processes for this purpose:<br />
** Starup process: read WAL record from WAL segment/walsender and perform overall control,<br />
** Dispatching worker: analyze and dispatch WAL records,<br />
** Error page registration: collects error pages from other worker. Current implementation is using hash functions based on memory context, which is not simple to port it to shared memory. We can replace this with simpler configuration if we can manage such information using shared memory.<br />
** Transaction worker: replays WAL record which does not have any block information to update.<br />
** Block worker: replays block information (mostly HEAP/HEAP2).<br />
<br />
We can have more than one block workers. Other workers consists of single process each.<br />
<br />
==== Shared information and locks ====<br />
<br />
To pass and share information such as WAL record and status, shared memory defined in <code>storage/dsm.h</code> is used.<br />
<br />
To protest these data, spin lock defined in <code>storage/spin.h</code> is used.<br />
<br />
<br />
==== Synchronization among workers ====<br />
<br />
For light-weight synchronization among workers, unix-domain udp socket is used. This can be replaced with <code>stoage/latch.h</code>.<br />
<br />
==== New GUC ====<br />
<br />
Following GUC parameters are added:<br />
<br />
* <code>parallel_replay</code> (bool)<br />
* <code>parallel_replay_test</code> (bool, just internal) enable additional code to work to attach workers with gdb for test.<br />
* <code>num_preplay_workers</code> (int) number of total replay workers.<br />
* <code>num_preplay_queue</code> (int) number of queues holding outstanding WAL records,<br />
* <code>num_preplay_max_txn</code> (int) number of max outstanding transaction allowed in recovery. Max_connections or larger. In the case of replication slave, must be master's max_connections or larger.<br />
* <code>preplay_buffers</code> (int)<br />
<br />
----<br />
<br />
=== Current code ===<br />
<br />
Current code is available from [https://github.com/koichi-szk/postgres.git Koichi's GITHub repository]. Branch is <code>parallel_replay_14_2</code>.<br />
<br />
----<br />
<br />
=== Current code status ===<br />
<br />
No test yet. The code just pass the build. '''Any cooperation for the test is very very welcome.''' You can contact to koichi.dbms_at_gmail.com.<br />
<br />
Please understand that the repo can be incomplete due to occasional work.<br />
<br />
Further addition is:<br />
<br />
* To add a code to have WAL record in text format (like pg_waldump) and hold it to some portion of the memory to help the test. <code>xlog_outrec()</code> looks very useful for this purpose. At present, this is enabled with <code>WAL_DEBUG</code> flag.</div>Koichihttps://wiki.postgresql.org/index.php?title=User:Koichi&diff=36915User:Koichi2022-05-06T07:23:25Z<p>Koichi: </p>
<hr />
<div>[https://wiki.postgresql.org/wiki/Koichi My main Wiki entry]<br />
email: koichi.dbms_at_gmail.com<br />
<br />
[[File:MyPicture_github.jpg]]<br />
<br />
== Koichi's Idea/Architecture/Implementation page ==<br />
<br />
* Deadlocks among distributed transaction: Scenario, detection method and implementation, [https://wiki.postgresql.org/wiki/distributed_deadlock_detection Visit Here]<br />
* Parallel recovery for both standalone PostgreSQL database and log shipping replication (physical WAL send/receive), [[Parallel Recovery]]<br />
* Consistent visibility in distributed transactions: scenario, methodology and implementation. [[Distributed transaction visibility]]</div>Koichihttps://wiki.postgresql.org/index.php?title=Koichi&diff=36914Koichi2022-05-06T07:22:58Z<p>Koichi: </p>
<hr />
<div>email: koichi.dbms_at_gmail.com<br />
<br />
[[File:MyPicture_github.jpg]]<br />
<br />
== Koichi's Idea/Architecture/Implementation page ==<br />
<br />
* Deadlocks among distributed transaction: Scenario, detection method and implementation, [https://wiki.postgresql.org/wiki/distributed_deadlock_detection Visit Here]<br />
* Parallel recovery for both standalone PostgreSQL database and log shipping replication (physical WAL send/receive), [[Parallel Recovery]]<br />
* Consistent visibility in distributed transactions: scenario, methodology and implementation. [[Distributed transaction visibility]]</div>Koichihttps://wiki.postgresql.org/index.php?title=Koichi&diff=36913Koichi2022-05-06T07:22:29Z<p>Koichi: </p>
<hr />
<div>email: koichi.dbms@gmail.com<br />
<br />
[[File:MyPicture_github.jpg]]<br />
<br />
== Koichi's Idea/Architecture/Implementation page ==<br />
<br />
* Deadlocks among distributed transaction: Scenario, detection method and implementation, [https://wiki.postgresql.org/wiki/distributed_deadlock_detection Visit Here]<br />
* Parallel recovery for both standalone PostgreSQL database and log shipping replication (physical WAL send/receive), [[Parallel Recovery]]<br />
* Consistent visibility in distributed transactions: scenario, methodology and implementation. [[Distributed transaction visibility]]</div>Koichihttps://wiki.postgresql.org/index.php?title=Parallel_Recovery&diff=36912Parallel Recovery2022-05-06T07:21:47Z<p>Koichi: /* Current code status */</p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
My Wiki home: [[Koichi]]<br />
<br />
Other topics: [[Distributed deadlock detection]]<br />
== Parallel recovery ==<br />
<br />
<br />
=== PostgreSQL redo log (WAL) outline ===<br />
<br />
In PostgreSQL, WAL (write-ahead log) is used for recovery and log-shipping replication. Outline is described in [https://www.postgresql.org/docs/current/wal-intro.html this documentation page]. The outline of WAL is:<br />
<br />
* Every updating backend process must write its redo log to WAL before it really updates data files.<br />
* Each WAL record (piece of updating data record) has to reach stable storage before the data is synch'ed to the storage.<br />
<br />
WAL is used in recovery and log-shipping replication. In recovery, the WAL is used as follows:<br />
<br />
* Each WAL record is read from WAL segments in the written order,<br />
* Read WAL record is applied to corresponding data block. This is called "redo" or "replay".<br />
* To maintain redo consistency, redo was done in single thread.<br />
<br />
In the case of log-shipping replication, each WAL record is:<br />
<br />
* Transfered to each replica in the written order and<br />
* Replayed at replica in sngle thread just in the same way of recovery<br />
* When replay reaches some consistent state, it allows postmaster to begin read transaction, while following WAL records reach and are replayed.<br />
<br />
----<br />
<br />
=== Issues around serialized replay ===<br />
<br />
At present, WAL records are replayed strictly in written order in a single thread, while they are generated by multiple backend processes in parallel. The issues are:<br />
<br />
* Recovery/replay performance is much worse than performance to write WALs. Recovery can take longer than the period WAL records were written.<br />
* Replication lag can be long when the master has heavy updating workloads.<br />
<br />
----<br />
<br />
=== Can replay be done in paralle ? ===<br />
<br />
If we apply several rules about recovery order, yes. The rules are:<br />
<br />
* For a given data block, WAL records must be apply in the written order.<br />
* For a given transaction, when we apply WAL record terminating the transaction (commit/abort/prepare for commit), all the associated WAL records must have been replayed,<br />
* In the case of multi-block WAL, it should be applyed by only one thread. For this purpose, such WAL record is assigned to all the block workers responsible for each block involved. When block worker picks up multi-block WAL and if it is not the last thread, it should wait until the last thread picking up this record actually replay it. If a thread is the last to pick up multi-block WAL record, it replays the record and send sync to other waiging workers.<br />
* To replay specific WAL record such as updating timeline, it should wait until all the preceding WAL records are replayed This may not be necessary but it looks safer at present to have such condition for certain set of WAL records.<br />
* Replayed LSN at pg_controldata will be updated when all the preceding WAL records are replayed.<br />
<br />
----<br />
<br />
=== Implementation Note ===<br />
<br />
==== Worker configuration ====<br />
<br />
* Because current recovery is done in startup process, parallel replay threads are implemented as its child process.<br />
* Now we have following processes for this purpose:<br />
** Starup process: read WAL record from WAL segment/walsender and perform overall control,<br />
** Dispatching worker: analyze and dispatch WAL records,<br />
** Error page registration: collects error pages from other worker. Current implementation is using hash functions based on memory context, which is not simple to port it to shared memory. We can replace this with simpler configuration if we can manage such information using shared memory.<br />
** Transaction worker: replays WAL record which does not have any block information to update.<br />
** Block worker: replays block information (mostly HEAP/HEAP2).<br />
<br />
We can have more than one block workers. Other workers consists of single process each.<br />
<br />
==== Shared information and locks ====<br />
<br />
To pass and share information such as WAL record and status, shared memory defined in <code>storage/dsm.h</code> is used.<br />
<br />
To protest these data, spin lock defined in <code>storage/spin.h</code> is used.<br />
<br />
<br />
==== Synchronization among workers ====<br />
<br />
For light-weight synchronization among workers, unix-domain udp socket is used. This can be replaced with <code>stoage/latch.h</code>.<br />
<br />
==== New GUC ====<br />
<br />
Following GUC parameters are added:<br />
<br />
* <code>parallel_replay</code> (bool)<br />
* <code>parallel_replay_test</code> (bool, just internal) enable additional code to work to attach workers with gdb for test.<br />
* <code>num_preplay_workers</code> (int) number of total replay workers.<br />
* <code>num_preplay_queue</code> (int) number of queues holding outstanding WAL records,<br />
* <code>num_preplay_max_txn</code> (int) number of max outstanding transaction allowed in recovery. Max_connections or larger. In the case of replication slave, must be master's max_connections or larger.<br />
* <code>preplay_buffers</code> (int)<br />
<br />
----<br />
<br />
=== Current code ===<br />
<br />
Current code is available from [https://github.com/koichi-szk/postgres.git Koichi's GITHub repository]. Branch is <code>parallel_replay_14_2</code>.<br />
<br />
----<br />
<br />
=== Current code status ===<br />
<br />
No test yet. The code just pass the build. '''Any cooperation for the test is very very welcome.''' You can contact to koichi.dbms@gmail.com.<br />
<br />
Please understand that the repo can be incomplete due to occasional work.<br />
<br />
Further addition is:<br />
<br />
* To add a code to have WAL record in text format (like pg_waldump) and hold it to some portion of the memory to help the test. <code>xlog_outrec()</code> looks very useful for this purpose. At present, this is enabled with <code>WAL_DEBUG</code> flag.</div>Koichihttps://wiki.postgresql.org/index.php?title=Parallel_Recovery&diff=36911Parallel Recovery2022-05-06T05:57:09Z<p>Koichi: </p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
My Wiki home: [[Koichi]]<br />
<br />
Other topics: [[Distributed deadlock detection]]<br />
== Parallel recovery ==<br />
<br />
<br />
=== PostgreSQL redo log (WAL) outline ===<br />
<br />
In PostgreSQL, WAL (write-ahead log) is used for recovery and log-shipping replication. Outline is described in [https://www.postgresql.org/docs/current/wal-intro.html this documentation page]. The outline of WAL is:<br />
<br />
* Every updating backend process must write its redo log to WAL before it really updates data files.<br />
* Each WAL record (piece of updating data record) has to reach stable storage before the data is synch'ed to the storage.<br />
<br />
WAL is used in recovery and log-shipping replication. In recovery, the WAL is used as follows:<br />
<br />
* Each WAL record is read from WAL segments in the written order,<br />
* Read WAL record is applied to corresponding data block. This is called "redo" or "replay".<br />
* To maintain redo consistency, redo was done in single thread.<br />
<br />
In the case of log-shipping replication, each WAL record is:<br />
<br />
* Transfered to each replica in the written order and<br />
* Replayed at replica in sngle thread just in the same way of recovery<br />
* When replay reaches some consistent state, it allows postmaster to begin read transaction, while following WAL records reach and are replayed.<br />
<br />
----<br />
<br />
=== Issues around serialized replay ===<br />
<br />
At present, WAL records are replayed strictly in written order in a single thread, while they are generated by multiple backend processes in parallel. The issues are:<br />
<br />
* Recovery/replay performance is much worse than performance to write WALs. Recovery can take longer than the period WAL records were written.<br />
* Replication lag can be long when the master has heavy updating workloads.<br />
<br />
----<br />
<br />
=== Can replay be done in paralle ? ===<br />
<br />
If we apply several rules about recovery order, yes. The rules are:<br />
<br />
* For a given data block, WAL records must be apply in the written order.<br />
* For a given transaction, when we apply WAL record terminating the transaction (commit/abort/prepare for commit), all the associated WAL records must have been replayed,<br />
* In the case of multi-block WAL, it should be applyed by only one thread. For this purpose, such WAL record is assigned to all the block workers responsible for each block involved. When block worker picks up multi-block WAL and if it is not the last thread, it should wait until the last thread picking up this record actually replay it. If a thread is the last to pick up multi-block WAL record, it replays the record and send sync to other waiging workers.<br />
* To replay specific WAL record such as updating timeline, it should wait until all the preceding WAL records are replayed This may not be necessary but it looks safer at present to have such condition for certain set of WAL records.<br />
* Replayed LSN at pg_controldata will be updated when all the preceding WAL records are replayed.<br />
<br />
----<br />
<br />
=== Implementation Note ===<br />
<br />
==== Worker configuration ====<br />
<br />
* Because current recovery is done in startup process, parallel replay threads are implemented as its child process.<br />
* Now we have following processes for this purpose:<br />
** Starup process: read WAL record from WAL segment/walsender and perform overall control,<br />
** Dispatching worker: analyze and dispatch WAL records,<br />
** Error page registration: collects error pages from other worker. Current implementation is using hash functions based on memory context, which is not simple to port it to shared memory. We can replace this with simpler configuration if we can manage such information using shared memory.<br />
** Transaction worker: replays WAL record which does not have any block information to update.<br />
** Block worker: replays block information (mostly HEAP/HEAP2).<br />
<br />
We can have more than one block workers. Other workers consists of single process each.<br />
<br />
==== Shared information and locks ====<br />
<br />
To pass and share information such as WAL record and status, shared memory defined in <code>storage/dsm.h</code> is used.<br />
<br />
To protest these data, spin lock defined in <code>storage/spin.h</code> is used.<br />
<br />
<br />
==== Synchronization among workers ====<br />
<br />
For light-weight synchronization among workers, unix-domain udp socket is used. This can be replaced with <code>stoage/latch.h</code>.<br />
<br />
==== New GUC ====<br />
<br />
Following GUC parameters are added:<br />
<br />
* <code>parallel_replay</code> (bool)<br />
* <code>parallel_replay_test</code> (bool, just internal) enable additional code to work to attach workers with gdb for test.<br />
* <code>num_preplay_workers</code> (int) number of total replay workers.<br />
* <code>num_preplay_queue</code> (int) number of queues holding outstanding WAL records,<br />
* <code>num_preplay_max_txn</code> (int) number of max outstanding transaction allowed in recovery. Max_connections or larger. In the case of replication slave, must be master's max_connections or larger.<br />
* <code>preplay_buffers</code> (int)<br />
<br />
----<br />
<br />
=== Current code ===<br />
<br />
Current code is available from [https://github.com/koichi-szk/postgres.git Koichi's GITHub repository]. Branch is <code>parallel_replay_14_2</code>.<br />
<br />
----<br />
<br />
=== Current code status ===<br />
<br />
No test yet. The code just pass the build. '''Any cooperation for the test is very very welcome.'''<br />
<br />
Further addition is:<br />
<br />
* To add a code to have WAL record in text format (like pg_waldump) and hold it to some portion of the memory to help the test. <code>xlog_outrec()</code> looks very useful for this purpose. At present, this is enabled with <code>WAL_DEBUG</code> flag.</div>Koichihttps://wiki.postgresql.org/index.php?title=Distributed_transaction_visibility&diff=36910Distributed transaction visibility2022-05-06T05:56:30Z<p>Koichi: </p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
[[Koichi]]<br />
<br />
== Consistent visibility for distributed transactions ==<br />
<br />
'''TBD'''</div>Koichihttps://wiki.postgresql.org/index.php?title=Distributed_deadlock_detection&diff=36909Distributed deadlock detection2022-05-06T05:55:52Z<p>Koichi: </p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
[[Koichi]]<br />
<br />
== Distributed deadlock detection ==<br />
<br />
'''TBD'''</div>Koichihttps://wiki.postgresql.org/index.php?title=User:Koichi&diff=36908User:Koichi2022-05-06T05:54:47Z<p>Koichi: </p>
<hr />
<div>[https://wiki.postgresql.org/wiki/Koichi My main Wiki entry]<br />
<br />
[[File:MyPicture_github.jpg]]<br />
<br />
== Koichi's Idea/Architecture/Implementation page ==<br />
<br />
* Deadlocks among distributed transaction: Scenario, detection method and implementation, [https://wiki.postgresql.org/wiki/distributed_deadlock_detection Visit Here]<br />
* Parallel recovery for both standalone PostgreSQL database and log shipping replication (physical WAL send/receive), [[Parallel Recovery]]<br />
* Consistent visibility in distributed transactions: scenario, methodology and implementation. [[Distributed transaction visibility]]</div>Koichihttps://wiki.postgresql.org/index.php?title=Koichi&diff=36907Koichi2022-05-06T05:54:03Z<p>Koichi: </p>
<hr />
<div>[[File:MyPicture_github.jpg]]<br />
<br />
== Koichi's Idea/Architecture/Implementation page ==<br />
<br />
* Deadlocks among distributed transaction: Scenario, detection method and implementation, [https://wiki.postgresql.org/wiki/distributed_deadlock_detection Visit Here]<br />
* Parallel recovery for both standalone PostgreSQL database and log shipping replication (physical WAL send/receive), [[Parallel Recovery]]<br />
* Consistent visibility in distributed transactions: scenario, methodology and implementation. [[Distributed transaction visibility]]</div>Koichi