WiP

Extracted from my messages in the thread Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

I still think we should go with direct TOAST tid pointers in varlena and not some kind of oid.

It will remove the need for any oid management and also will be a few orders of magnitude faster for large tables (just 2x faster for in-memory small tables)

I plan to go over Michael's patch set here and see how much change is needed to add the "direct toast"

My goals are:

fast lookup from skipping index lookup
making the toast pointer in main heap as small as possible - hopefully just the 6 bytes of tid pointer - so that scans that do not need toasted values get more tuples from each page
moving all the extra data into toast chunk record as there we are free to add whatever is needed

Currently I plan to introduces something like this for toast chunk record

TOAST table columns
Column	Type	Storage	comment
`chunk_id`	`oid`	`plain`	0 when not using toast index 0xfffe non-deletable, for example when used as dictionary for multiple toasted values.
`chunk_seq`	`integer`	`plain`	if not 0 when referenced from toast pointer then the toasted data starts at `toast_pages[0]` (or below it in that tree), which must have `chunk_id = 0`
`chunk_data`	`bytea`	`plain`
added fields
`chunk_descriptor`	`bytea`	`plain`	empty for non-compressed single-chunk toast value as original size === size(chunk_data) (original_size) for compressed single-chunk toast value (toast_ctid) for non-compressed multi-chunk toast (toast_tid, compression_span_id) for compressed multi-chunk toast
`toast_chunks`	`bytea`	`plain`	encodes and array of (chunk_id, chunk_end_offset)
`toast_encoding`	`bytea`	`plain`	make this bytea as well for future-proofing. Initially it will empty for non-compressed toast value and 2 bytes (length + method) for compressed
`dict_chunks`	`bytea`	`plain`	encodes an array of (chunk_id, chunk_end_offset)
ver 0.1 added fields (REJECTED, arrays have huge storage overhead)
`toast_pages`	`tid[]`	`plain`	can be chained or make up a tree
`offsets`	`integer[]`	`plain`	Starting offsets of the `toast_pages` (octets or type-specific units), upper bit is used to indicate that a new compressed span starts at that offset, 2nd highest bit indicates that the page is another tree page, that is it has also a non-null `toast_pages` and `offsets` fields.
`comp_method`	`int`	`plain`	-- compression methos used maybe should be enum ?
`dict_pages`	`tid[]`	`plain`	-- pages to use as the compression dictionary, up to N pages, one level

The added `bytea` fields are explained below

chunk_descriptor (WiP, may specialize for single-chunk, uncompressed, and compression span): <in progress>; An user function pg_get_toast_descriptor(...) is provided for introspection

toast_chunks: array of (chunk_id, chunk_end_offset); The array length is length(bytea) / 10. We may prefer to store as two concatenated arrays not array of pairs for easier processing.; provide introspection functions; * pg_toast_get_chunk_end_offsets(bytea) --> int[]; * pg_toast_get_toast_chunks(bytea) --> tid[]; for easy debugging

toast_encoding: We make this bytea as well for simplicity. Initially it will be 2 bytes - length + compression_method and can be thus directly compared; future type-specific extensions can store here much more complex info

dict_chunks: * this uses same encoding as toast_chunks and can be used for referring to compression dictionaries; * maybe it is better to have here just the tid + length of the dictionary and have the rest of info in the ditionary page itself ? Length is needed as we may want to use the firts few KB of the data itself in case we do parallelized copy and saving of large toasted values

Why not explicit tid[] and int[] types?

Testing showed that explicit array types have extra 20 bytes overhead per array. Also packing related fields in same bytea lets us not extend the null bitmap (currently there are 7 fields)

dtoast=# select *, pg_column_size(tidarr) tidarrsize, pg_column_size(tidbytea) as byteasize from ta order by 1;
              tidarr               |                      tidbytea                      | tidarrsize | byteasize 
-----------------------------------+----------------------------------------------------+------------+-----------
 {}                                | \x                                                 |         13 |         1
 {"(1,1)"}                         | \x000000010001                                     |         27 |         7
 {"(1,1)","(1,2)"}                 | \x000000010001000000010002                         |         33 |        13
 {"(1,1)","(1,2)","(1,3)"}         | \x000000010001000000010002000000010003             |         39 |        19
 {"(1,1)","(1,2)","(1,3)","(1,4)"} | \x000000010001000000010002000000010003000000010004 |         45 |        25
                                   |                                                    |            |          
(6 rows)

dtoast=# \d ta
                 Table "public.ta"
  Column  | Type  | Collation | Nullable | Default 
----------+-------+-----------+----------+---------
 tidarr   | tid[] |           |          | 
 tidbytea | bytea |           |          |

Issues with current TOAST

What is your thinking of which of the attributes currently stored in varatt can be moved to the toast record itself (I have not yet had a chance to look at the patches, hope to get to it soon)

TOAST using OID is SLOW

Going through index lookup for each toasted field can become really slow once the index is large. It can be as bad as sequential scan being able to scan only a few lines per second.

Running out of OIDs

OID is 32-bit uint , so once you have used up 4 billion of them you are done.

This can easily happen if your record has four toasted columns and you have 1 billion rows in a table.

The toast pointer stored in the record (`varatt_external`) is larger than needed

Out of four 4-byte values stored in the record (va_rawsize, va_extinfo,va_valueid and va_toastrelid) only the va_valueid really needs to be there.

I very much hope that we can move the two size attributes other than va_valueid from varatt_* to the toast record and maybe just get rid of va_toastrelid, or at least compress it to a few bits to select one of a very small number of possible toastrelids stored elsewhere.

The correct place for compression info definitely is in the toast record so we should split it out from the two bits in va_extinfo to a separate field.

Here is the current structure for reference

typedef struct varatt_external
{
  int32 va_rawsize;  /* Original data size (includes header) */
  uint32 va_extinfo; /* External saved size (without header) and 
                      * compression method */
  Oid va_valueid;    /* Unique ID of value within TOAST table */
  Oid va_toastrelid; /* RelID of TOAST table containing it */
} varatt_external;

When stored in a tuple it is wrapped into varattrib_1b_e for extra 2 bytes (varatt_external goes into the place of va_data[FLEXIBLE_ARRAY_MEMBER];)

/* TOAST pointers are a subset of varattrib_1b with an identifying tag byte */
typedef struct
{
   uint8       va_header;      /* Always 0x80 or 0x01 */
   uint8       va_tag;         /* Type of datum */
   char        va_data[FLEXIBLE_ARRAY_MEMBER]; /* Type-specific data */
} varattrib_1b_e;

For Direct Toast the shortest varatt_external version would have only the TID pointer and maybe a small "toastrelid selector" field if we can not figure out how to make VACUUM FULL and CLUSTER work without having it. Maybe we can re-use the bits now left free from movinge compression info away to encode the toastrelid selector?.

/*
 * struct varatt_ext_direct is a new "Direct TOAST pointer", that is, the
 * information needed to fetch a Datum stored out-of-line in a TOAST table.
 * All the data about sizes and compression are stored in the toast record 
 * pointed by ItemPointerData constructed from .ip_blkid and .ip_posid
 *
 * order of ip_posid and ip_blkid is swapped compared to 
 * ItemPointerData so that it can be packed tightly without padding
 * 
 * typedef struct ItemPointerData
 * {
 * 		BlockIdData ip_blkid;
 * 		OffsetNumber ip_posid;
 * } ItemPointer;
 * 
 * This struct must not contain any padding, because we sometimes compare
 * these pointers using memcmp.
 *
 * Note that this information is stored unaligned within actual tuples, so
 * you need to memcpy from the tuple into a local struct variable before
 * you can look at these fields!  (The reason we use memcmp is to avoid
 * having to do that just to detect equality of two TOAST pointers...)
 */
typedef struct varatt_ext_direct
{
    OffsetNumber  ip_posid;
    BlockIdData   ip_blkid;
} varatt_ext_direct;

ⓘ	By leaving just the direct tid pointer in the `varatt_external_direct` structure we can decrease the storage needed in the heap tuple more than 2X - from 18 bytes to just 8, probably more in cases 18 causes extra alignment padding.

Storing the TOASTed value

Small values

If the toasted value is small enough to fit in one record, we just go and store it, returning the ctid to be stored in the toast pointer.

If the value is compressed, we also fill in the corresponding compression fields.

The original, uncompressed size is stored in offsets[0] , and toast_pages is set to NULL

Multi-Page values

If value does not fit in one page we proceed as follows

<wip>

DirectTOAST

Contents

WiP

My goals are:

Currently I plan to introduces something like this for toast chunk record

The added `bytea` fields are explained below

Why not explicit tid[] and int[] types?

Issues with current TOAST

TOAST using OID is SLOW

Running out of OIDs

The toast pointer stored in the record (`varatt_external`) is larger than needed

Storing the TOASTed value

Small values

Multi-Page values

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Tools

Search

DirectTOAST

WiP

My goals are:

Currently I plan to introduces something like this for toast chunk record

The added bytea fields are explained below

Why not explicit tid[] and int[] types?

Issues with current TOAST

TOAST using OID is SLOW

Running out of OIDs

The toast pointer stored in the record (varatt_external) is larger than needed

Storing the TOASTed value

Small values

Multi-Page values

Navigation menu

Search

The added `bytea` fields are explained below

The toast pointer stored in the record (`varatt_external`) is larger than needed