From PostgreSQL wiki

Jump to: navigation, search



The purpose of this page is to track the pending code items and open issues in zheap. We have also mentioned about some of the points that need to be considered for integrating with Pluggable storage API.

Pending Items

  • Currently we have fixed number of transaction slots on a page, so multiple transactions operating on a page can lead to deadlock. The simplest case says transaction T-1 acquires a slot on Page p-1 and is waiting for acquiring a slot on Page-2, transaction T-2 has acquired a slot on P-2 and is waiting to acquire on P-1. So, both the transactions will wait for each other which will result in deadlock. We are planning to add a mechanism to allow the array of transactions slots to be continued on a separate overflow page. We also need such a mechanism to support cases where a large number of transactions acquire SHARE or KEY SHARE locks on a single page.
  • Delete marking in indexes: This will allow inplace updates even when index columns are updated and additionally with this we can avoid the need for a dedicated vacuum process to perform retail deletes.
  • Free Space Map: For the current heap, vacuum takes care of updating free space map. In zheap, individual operation will optimistically update the freespace map when it removes the tuples from a page in the hope that eventually most of the transactions will commit and space will be available.
  • Alignment padding: We would like to eliminate most of the alignment padding in the tuple. Currently, we have a very crude implementation of this in the code which allows 1-byte, 4-byte or 8-byte padding depending on a GUC variable. However, we want it to be without any GUC such that padding will be only done for the columns that are varlenas with 4-byte headers or fixed-length pass-by-reference types (e.g. interval, box).
  • Page wise undo: To save locking and repeated writing of same page, we want to collect all the undo records that belong to same page and then apply them together. Currently, we collect all consecutive records which apply to the same page and then apply them at one shot. This will be okay for cases where most of the changes to heap pages are performed together, but if the changes are randomly distributed across the undo of a transaction, it won’t work. So, we want some efficient mechanism to collect all the undo records that belong to a page.
  • Currently to track which undo logs can be discarded, we use a very big array which consumes a lot of memory. DiscardXact array's size is proportional to MaxBackends which is a lot of memory. We want to allocate memory for it in a better way probably by doing some logical mapping as we have done in undolog.c.
  • It is unclear at this stage how visibility maps will work with zheap, but we have kept some code related to visibility maps in zheap API’s in the hope that we need it for indexonlyscans of indexes that don’t support delete marking. I think we will get more clarity once we implement two-pass vacuum.
  • We have to perform additional buffer locking to do the allocation for tuple before zheap_lock_tuple which can be fixed if we change that API. However, we want to be compatible with existing heap_lock_tuple. During integration with storage API, we need some work to make a standard API, so that we can avoid additional allocation and locking.
  • RLS: We have yet not investigated whether any changes are required to make it work with zheap. We expect the changes required if any to support this will be some sort of tuple conversion work.

Open Issues

  • We are locking the undo buffers in critical section during InsertPreparedUndo. It should be done outside critical section as per locking protocol. Locking the buffer can give an error, so we shouldn’t do it inside the critical section.
  • While replaying the WAL for transaction slots that got reused, we need to ensure that in hotstandby mode, there are no running queries which can see that transaction. The mechanism works in general, but we might need some handling for transaction wraparound cases. See zheap_xlog_freeze_xact_slot.
  • We can generally rely on the contents of the page to regenerate undo tuple during WAL replay if full_page_writes is on, but in some boundary cases where the full_page_image gets included along with the tuple being modified, that assumption doesn’t hold true. We need to take care of such boundary cases.
  • Recovery of undo logs is not completely reliable. Basically, xid->lognumber mapping is not completely reliable. This map is used to calculate the location of undo insertion point during recovery. If the WAL for the operation is written before RedoPoint and insertion location in meta page of log is updated after it, then after the crash, the undo insertion point won’t be correct.
  • Rollbacks for crashed transactions will be performed after recovery and currently, we can only rollback the transactions which happen on Postgres database. Undo worker is always connected to Postgres database, so it can't be dropped.

Integration with Pluggable Storage API

  • Currently, we don't have a nice way to handle different type of heaps in the backend. So, we have used storage engine option to define different code path for zheap. To integrate it with storage API, we need to check for all the places where RelationStorageIsZHeap is used.
  • HeapTuple is used widely in the backend, so instead of changing all such places, we have written converter functions zheap_to_heap, heap_to_zheap which will convert tuples from one format to another. To integrate it with storage API, we need to check for all the places where these API's are used.
  • Currently, we have stored ZHeapTuple in TupleTableSlot as a separate variable which doesn't appear to be the best way. We would like to integrate it with storage API. Andres has proposed an idea for the same.
  • Snapshot satisfies API's - The snapshot mechanism works differently in zheap as we need to traverse the undo chain to check if the prior tuple is visible to a snapshot.
Personal tools