DirectTOAST

From PostgreSQL wiki
Jump to navigationJump to search

WiP

Extracted from my messages in the thread Support for 8-byte TOAST values (aka the TOAST infinite loop problem)


I still think we should go with direct TOAST tid pointers in varlena and not some kind of oid.

It will remove the need for any oid management and also will be a few orders of magnitude faster for large tables (just 2x faster for in-memory small tables)

I plan to go over Michael's patch set here and see how much change is needed to add the "direct toast"

My goals are:

  1. fast lookup from skipping index lookup
  2. making the toast pointer in main heap as small as possible - hopefully just the 6 bytes of tid pointer - so that scans that do not need toasted values get more tuples from each page
  3. moving all the extra data into toast chunk record as there we are free to add whatever is needed

Currently I plan to introduces something like this for toast chunk record

TOAST table columns
Column Type Storage comment
chunk_id oid plain
  • 0 when not using toast index
  • 0xfffe non-deletable, for example when used as dictionary for multiple toasted values.
chunk_seq integer plain

if not 0 when referenced from toast pointer then the toasted data starts at toast_pages[0] (or below it in that tree), which *must* have chunk_id = 0

chunk_data bytea plain
added fields
chunk_descriptor bytea plain
  • empty for non-compressed single-chunk toast value as original size === size(chunk_data)
  • (original_size) for compressed single-chunk toast value
  • (toast_ctid) for non-compressed multi-chunk toast
  • (toast_tid, compression_span_id) for compressed multi-chunk toast
toast_chunks bytea plain encodes and array of (chunk_id, chunk_end_offset)
toast_encoding bytea plain make this bytea as well for future-proofing. Initially it will empty for non-compressed toast value and 2 bytes (length + method) for compressed
dict_chunks bytea plain encodes an array of (chunk_id, chunk_end_offset)
ver 0.1 added fields (REJECTED, arrays have huge storage overhead)
toast_pages tid[] plain can be chained or make up a tree
offsets integer[] plain

Starting offsets of the toast_pages (octets or type-specific units), upper bit is used to indicate that a new compressed span starts at that offset, 2nd highest bit indicates that the page is another tree page, that is it has also a non-null toast_pages and offsets fields.

comp_method int plain -- compression methos used maybe should be enum ?
dict_pages tid[] plain -- pages to use as the compression dictionary, up to N pages, one level

The added bytea fields are explained below

chunk_descriptor (WiP, may specialize for single-chunk, uncompressed, and compression span)
<in progress>
An user function pg_get_toast_descriptor(...) is provided for introspection
toast_chunks
array of (chunk_id, chunk_end_offset)
The array length is length(bytea) / 10. We may prefer to store as two concatenated arrays not array of pairs for easier processing.
provide introspection functions
* pg_toast_get_chunk_end_offsets(bytea) --> int[]
* pg_toast_get_toast_chunks(bytea) --> tid[]
for easy debugging
toast_encoding
We make this bytea as well for simplicity. Initially it will be 2 bytes - length + compression_method and can be thus directly compared
future type-specific extensions can store here much more complex info
dict_chunks
* this uses same encoding as toast_chunks and can be used for referring to compression dictionaries
* maybe it is better to have here just the tid + length of the dictionary and have the rest of info in the ditionary page itself ? Length is needed as we may want to use the firts few KB of the data itself in case we do parallelized copy and saving of large toasted values

Why not explicit tid[] and int[] types?

Testing showed that explicit array types have extra 20 bytes overhead per array. Also packing related fields in same bytea lets us not extend the null bitmap (currently there are 7 fields)

dtoast=# select *, pg_column_size(tidarr) tidarrsize, pg_column_size(tidbytea) as byteasize from ta order by 1;
              tidarr               |                      tidbytea                      | tidarrsize | byteasize 
-----------------------------------+----------------------------------------------------+------------+-----------
 {}                                | \x                                                 |         13 |         1
 {"(1,1)"}                         | \x000000010001                                     |         27 |         7
 {"(1,1)","(1,2)"}                 | \x000000010001000000010002                         |         33 |        13
 {"(1,1)","(1,2)","(1,3)"}         | \x000000010001000000010002000000010003             |         39 |        19
 {"(1,1)","(1,2)","(1,3)","(1,4)"} | \x000000010001000000010002000000010003000000010004 |         45 |        25
                                   |                                                    |            |          
(6 rows)

dtoast=# \d ta
                 Table "public.ta"
  Column  | Type  | Collation | Nullable | Default 
----------+-------+-----------+----------+---------
 tidarr   | tid[] |           |          | 
 tidbytea | bytea |           |          | 

Issues with current TOAST

What is your thinking of which of the attributes currently stored in varatt can be moved to the toast record itself (I have not yet had a chance to look at the patches, hope to get to it soon)

TOAST using OID is SLOW

Going through index lookup for each toasted field can become really slow once the index is large. It can be as bad as sequential scan being able to scan only a few lines per second.

Running out of OIDs

OID is 32-bit uint , so once you have used up 4 billion of them you are done.

This can easily happen if your record has four toasted columns and you have 1 billion rows in a table.

The toast pointer stored in the record (varatt_external) is larger than needed

Out of four 4-byte values stored in the record (va_rawsize, va_extinfo,va_valueid and va_toastrelid) only the va_valueid really needs to be there.

I very much hope that we can move the two size attributes other than va_valueid from varatt_* to the toast record and maybe just get rid of va_toastrelid, or at least compress it to a few bits to select one of a very small number of possible toastrelids stored elsewhere.

The correct place for compression info definitely is in the toast record so we should split it out from the two bits in va_extinfo to a separate field.

Here is the current structure for reference

typedef struct varatt_external
{
  int32 va_rawsize;  /* Original data size (includes header) */
  uint32 va_extinfo; /* External saved size (without header) and 
                      * compression method */
  Oid va_valueid;    /* Unique ID of value within TOAST table */
  Oid va_toastrelid; /* RelID of TOAST table containing it */
} varatt_external;

When stored in a tuple it is wrapped into varattrib_1b_e for extra 2 bytes (varatt_external goes into the place of va_data[FLEXIBLE_ARRAY_MEMBER];)

/* TOAST pointers are a subset of varattrib_1b with an identifying tag byte */
typedef struct
{
   uint8       va_header;      /* Always 0x80 or 0x01 */
   uint8       va_tag;         /* Type of datum */
   char        va_data[FLEXIBLE_ARRAY_MEMBER]; /* Type-specific data */
} varattrib_1b_e;


For Direct Toast the shortest varatt_external version would have only the TID pointer and maybe a small "toastrelid selector" field if we can not figure out how to make VACUUM FULL and CLUSTER work without having it. Maybe we can re-use the bits now left free from movinge compression info away to encode the toastrelid selector?.

/*
 * struct varatt_ext_direct is a new "Direct TOAST pointer", that is, the
 * information needed to fetch a Datum stored out-of-line in a TOAST table.
 * All the data about sizes and compression are stored in the toast record 
 * pointed by ItemPointerData constructed from .ip_blkid and .ip_posid
 *
 * order of ip_posid and ip_blkid is swapped compared to 
 * ItemPointerData so that it can be packed tightly without padding
 * 
 * typedef struct ItemPointerData
 * {
 * 		BlockIdData ip_blkid;
 * 		OffsetNumber ip_posid;
 * } ItemPointer;
 * 
 * This struct must not contain any padding, because we sometimes compare
 * these pointers using memcmp.
 *
 * Note that this information is stored unaligned within actual tuples, so
 * you need to memcpy from the tuple into a local struct variable before
 * you can look at these fields!  (The reason we use memcmp is to avoid
 * having to do that just to detect equality of two TOAST pointers...)
 */
typedef struct varatt_ext_direct
{
    OffsetNumber  ip_posid;
    BlockIdData   ip_blkid;
} varatt_ext_direct;

By leaving just the direct tid pointer in the varatt_external_direct structure we can decrease the storage needed in the heap tuple more than 2X - from 18 bytes to just 8, probably more in cases 18 causes extra alignment padding.

Storing the TOASTed value

Small values

If the toasted value is small enough to fit in one record, we just go and store it, returning the ctid to be stored in the toast pointer.

If the value is compressed, we also fill in the corresponding compression fields.

The original, uncompressed size is stored in offsets[0] , and toast_pages is set to NULL

Multi-Page values

If value does not fit in one page we proceed as follows

<wip>