DirectTOAST
WiP
Extracted from my messages in the thread Support for 8-byte TOAST values (aka the TOAST infinite loop problem)
I still think we should go with direct TOAST tid pointers in varlena and not some kind of oid.
It will remove the need for any oid management and also will be a few orders of magnitude faster for large tables (just 2x faster for in-memory small tables)
I plan to go over Michael's patch set here and see how much change is needed to add the "direct toast"
My goals are:
- fast lookup from skipping index lookup
- making the toast pointer in main heap as small as possible - hopefully just the 6 bytes of tid pointer - so that scans that do not need toasted values get more tuples from each page
- moving all the extra data into toast chunk record as there we are free to add whatever is needed
Currently I plan to introduces something like this for toast chunk record
Column | Type | Storage | comment |
---|---|---|---|
chunk_id
|
oid
|
plain
|
|
chunk_seq
|
integer
|
plain
|
if not 0 when referenced from toast pointer then the toasted data
starts at |
chunk_data
|
bytea
|
plain
| |
added fields | |||
chunk_descriptor
|
bytea
|
plain
|
|
toast_chunks
|
bytea
|
plain
|
encodes and array of (chunk_id, chunk_end_offset) |
toast_encoding
|
bytea
|
plain
|
make this bytea as well for future-proofing. Initially it will empty for non-compressed toast value and 2 bytes (length + method) for compressed |
dict_chunks
|
bytea
|
plain
|
encodes an array of (chunk_id, chunk_end_offset) |
ver 0.1 added fields (REJECTED, arrays have huge storage overhead) | |||
|
tid[]
|
plain
|
can be chained or make up a tree |
|
integer[]
|
plain
|
Starting offsets of the |
|
int
|
plain
|
-- compression methos used maybe should be enum ? |
|
tid[]
|
plain
|
-- pages to use as the compression dictionary, up to N pages, one level |
The added bytea
fields are explained below
- chunk_descriptor (WiP, may specialize for single-chunk, uncompressed, and compression span)
- <in progress>
- An user function
pg_get_toast_descriptor(...)
is provided for introspection
- toast_chunks
- array of (chunk_id, chunk_end_offset)
- The array length is
length(bytea) / 10
. We may prefer to store as two concatenated arrays not array of pairs for easier processing. - provide introspection functions
- *
pg_toast_get_chunk_end_offsets(bytea) --> int[]
- *
pg_toast_get_toast_chunks(bytea) --> tid[]
- for easy debugging
- toast_encoding
- We make this bytea as well for simplicity. Initially it will be 2 bytes - length + compression_method and can be thus directly compared
- future type-specific extensions can store here much more complex info
- dict_chunks
- * this uses same encoding as toast_chunks and can be used for referring to compression dictionaries
- * maybe it is better to have here just the tid + length of the dictionary and have the rest of info in the ditionary page itself ? Length is needed as we may want to use the firts few KB of the data itself in case we do parallelized copy and saving of large toasted values
Why not explicit tid[] and int[] types?
Testing showed that explicit array types have extra 20 bytes overhead per array. Also packing related fields in same bytea lets us not extend the null bitmap (currently there are 7 fields)
dtoast=# select *, pg_column_size(tidarr) tidarrsize, pg_column_size(tidbytea) as byteasize from ta order by 1; tidarr | tidbytea | tidarrsize | byteasize -----------------------------------+----------------------------------------------------+------------+----------- {} | \x | 13 | 1 {"(1,1)"} | \x000000010001 | 27 | 7 {"(1,1)","(1,2)"} | \x000000010001000000010002 | 33 | 13 {"(1,1)","(1,2)","(1,3)"} | \x000000010001000000010002000000010003 | 39 | 19 {"(1,1)","(1,2)","(1,3)","(1,4)"} | \x000000010001000000010002000000010003000000010004 | 45 | 25 | | | (6 rows) dtoast=# \d ta Table "public.ta" Column | Type | Collation | Nullable | Default ----------+-------+-----------+----------+--------- tidarr | tid[] | | | tidbytea | bytea | | |
Issues with current TOAST
What is your thinking of which of the attributes currently stored in varatt can be moved to the toast record itself (I have not yet had a chance to look at the patches, hope to get to it soon)
TOAST using OID is SLOW
Going through index lookup for each toasted field can become really slow once the index is large. It can be as bad as sequential scan being able to scan only a few lines per second.
Running out of OIDs
OID is 32-bit uint , so once you have used up 4 billion of them you are done.
This can easily happen if your record has four toasted columns and you have 1 billion rows in a table.
The toast pointer stored in the record (varatt_external
) is larger than needed
Out of four 4-byte values stored in the record (va_rawsize
,
va_extinfo
,va_valueid
and va_toastrelid
)
only the va_valueid
really needs to be there.
I very much hope that we can move the two size attributes other than
va_valueid
from varatt_* to the toast record and maybe just get rid of
va_toastrelid
, or at least compress it to a few bits to select one of
a very small number of possible toastrelids stored elsewhere.
The correct place for compression info definitely is in the toast
record so we should split it out from the two bits in va_extinfo
to a
separate field.
Here is the current structure for reference
typedef struct varatt_external { int32 va_rawsize; /* Original data size (includes header) */ uint32 va_extinfo; /* External saved size (without header) and * compression method */ Oid va_valueid; /* Unique ID of value within TOAST table */ Oid va_toastrelid; /* RelID of TOAST table containing it */ } varatt_external;
When stored in a tuple it is wrapped into varattrib_1b_e
for extra 2 bytes
(varatt_external
goes into the place of va_data[FLEXIBLE_ARRAY_MEMBER];
)
/* TOAST pointers are a subset of varattrib_1b with an identifying tag byte */ typedef struct { uint8 va_header; /* Always 0x80 or 0x01 */ uint8 va_tag; /* Type of datum */char va_data[FLEXIBLE_ARRAY_MEMBER]; /* Type-specific data */} varattrib_1b_e;
For Direct Toast the shortest varatt_external
version would have only the TID
pointer and maybe a small "toastrelid selector" field if we can not
figure out how to make VACUUM FULL
and CLUSTER
work without having it. Maybe we can re-use the bits now left free from movinge compression info away to encode the toastrelid selector?.
/* * struct varatt_ext_direct is a new "Direct TOAST pointer", that is, the * information needed to fetch a Datum stored out-of-line in a TOAST table. * All the data about sizes and compression are stored in the toast record * pointed by ItemPointerData constructed from .ip_blkid and .ip_posid * * order of ip_posid and ip_blkid is swapped compared to * ItemPointerData so that it can be packed tightly without padding * * typedef struct ItemPointerData * { * BlockIdData ip_blkid; * OffsetNumber ip_posid; * } ItemPointer; * * This struct must not contain any padding, because we sometimes compare * these pointers using memcmp. * * Note that this information is stored unaligned within actual tuples, so * you need to memcpy from the tuple into a local struct variable before * you can look at these fields! (The reason we use memcmp is to avoid * having to do that just to detect equality of two TOAST pointers...) */ typedef struct varatt_ext_direct { OffsetNumber ip_posid; BlockIdData ip_blkid; } varatt_ext_direct;
ⓘ |
By leaving just the direct tid pointer in the |
---|
Storing the TOASTed value
Small values
If the toasted value is small enough to fit in one record, we just go and store it, returning the ctid to be stored in the toast pointer.
If the value is compressed, we also fill in the corresponding compression fields.
The original, uncompressed size is stored in offsets[0]
, and toast_pages
is set to NULL
Multi-Page values
If value does not fit in one page we proceed as follows
<wip>