Segment Visibility Map
Segment Visibility Map
We would keep a dynamic visibility map at *segment* level, showing which segments have all rows as 100% visible. This is known as the Segment Visibility Map (SVM).
We have at least two bits for each segment in the table, rather than each block. The two bits allow us to represent four states:
- read_write
- read_only
- read_only_pending (explained later)
- read_only_frozen
Some additional states that have been discussed are
- explicitly marked read only
- clustered
- compressed
- off-line (indicates high cost to access)
We should allow for 1 byte (8 bits) per segment.
That's possibly small enough to store directly on pg_class, even for fairly large tables. However we want to avoid bloat when the SVM changes, plus we may need to handle some very big tables. So we would store SVM as a single column, in a table pg_svm with storage attribute MAIN, so it is inline, compressible.
A segment is a range of blocks, size of which can be altered as required. Initial suggestion is that a segment would be 1GB in size, same as a physical data segment. We might allow this to be user settable, or we might automatically adjust it to minimise the overhead and maximise the usefulness.
It would be possible to change the SVM segment size dynamically, recalculating the bit positions accordingly. If that was desirable to optimise the overhead and benefit from this approach.
Access to the SVM
Since the map is very small and hardly ever changes for a table we can cache it easily on each backend.
If the map does change we can perform a relcache invalidation to make everybody re-read the map. (Few improvements needed there also, but not relevant here).
No dynamic shared memory cache is required because any concurrent changes to the table would be ignored by a scan anyway, so it doesn't matter if an INSERT, UPDATE or DELETE occurs while we are scanning. Any new scans that start will attempt to lock the table and then perform a rel cache check before continuing. So the visibility will be set correctly for *that* scan at least.
In most cases the visibility map can be summarised as a single boolean to show whether *any* 100% visible segments exist. That makes accessing the map very cheap in the common, unset case. This would be an additional boolean entry on the Relation struct.
The check to see if the rel cache has changed is essentially free, since we do it already when we grab a lock. Since INSERT, UPDATE and DELETE only spoil the cache, they don't need to have a completely up-to-date picture. (That aspect may change).
If we want backends running non-utility commands to see the SVM then we would need to check the rel cache each time we access the table, not just each time we lock a table. We need not do that in all cases, only those that would benefit from knowing SVM information.
UnSetting the Segment Visibility Map
The SVM will be "unset", i.e. set to read_write if an INSERT, UPDATE or DELETE occurs within a segment marked read_only or read_only_pending. This catalog change can be handled with a non-transactional overwrite - the bit always exists already, so the overwritten data is always same size. That's pessimistic, since it resets the state even if the UPDATE/DELETE aborts, but its better than holding the row locked until the transaction completes, which might prevent other things from happening. Or maybe we could do that at transaction commit.
DML commands would first check the boolean rel.has_read_only_segments to see if any non read_write segments exist. If so, then we would calculate the segment being written to by the DML operation then set the state accordingly.
The relcache invalidation caused by making a segment read_only_pending can flush the rd_targBlock for the relation in each backend. Other INSERTs are not possible since all current code-paths either use extension or the FSM when the table is non-empty, and neither of those will cause a problem. Extension only touches last segment, which will never by read_only_pending. The FSM contains no entries for an read_only_pending segment.
SHARE locks currently write to data blocks, but since they represent transient state we do not need to unset either the partitioning or the visibility map.
Setting the Segment Visibility Map
Setting the visibility map is tricky since this must work whilst concurrent changes occur to the tables.
We do this by having an intermediate state, similar to the way that CREATE INDEX CONCURRENTLY works.
First, VACUUM will scan a segment and if it finds all tuples are visible then it will set the state of that segment to read_only_pending. We also record the xid of the VACUUM, so we know when this was set.
If the next VACUUM still sees read_only_pending state then we VACUUM a second time. If all rows are visible then at the end of the second VACUUM if read_only_pending is still set then we set read_only state. (If changes to SVM are immediate we don't have to wait; if we only make SVM changes at end of transaction then we have to wait for all concurrent lockers of the table to complete before we can set read_only state).
Why do we do this? It's possible the first VACUUM missed some concurrent changes. However, once read_only_pending is set any further changes will clear it. So at the end of the second VACUUM we are guaranteed to have seen any changes between the start of the first VACUUM and the end of the second VACUUM.
If we discover that a segment is completely frozen, then we set read_only_frozen rather than just read_only.
If we perform a VACUUM FREEZE then we re-scan read_only segments, expecting to set them to read_only_frozen.
VACUUM will skip read_only and read_only_frozen segments.
VACUUM FREEZE will skip read_only_frozen segments.
VACUUM FULL will always scan all segments, so that we always have a safe option to fall back to in case of suspected problems.
We never mark the highest numbered segment read_only_pending, since it is likely to expand when INSERTs occur. This is also a useful way of excluding smaller tables from the overheads involved for this feature.
VACUUM Benefits
In an insert-only table this would mean that only the last 1 or 2 segments would be read write, so VACUUM would always run in a bounded time, no matter how large the table.
Frequently tables have areas of them deleted en-masse, so this would help direct VACUUM only to the areas of change.
We recognise that read only segments will change the autovacuum calculation - we simply exclude them entirely. This will mean that VACUUMs are run more regularly than they are now, but when they do run they will be quicker. (Note that currently, VACUUMs on a growing table get further and further apart as the table grows, assuming constant write rate).
A later VACUUM will still be required to freeze the rows, though that can happen later when we hit the frozen limit, or earlier if we run an explicit VACUUM FREEZE. We would still store only one oldest xid for each table. There's no need to store a specific xid for each segment.
Interaction with FSM
This departs completely from the idea that the visibility map and the freespace map are related somehow. However, there is a change required in the FSM to ensure the proposed technique is stable and useful: If we scan a segment and it is 100% visible, if there is freespace in just one of the blocks in the segment then we will soon find that our visibility-bit will be set to off again very quickly.
If there is more than 5% freespace total in the segment then we will not set read_only_pending, nor report blocks in a segment to the FSM. That way, small amounts of freespace won't destroy the benefits of making segments read-only.
VACUUM currently overwrites FSM information, though if this was not the case then we would have to actively remove it for the newly read only segments. Perhaps that limit would be (100 - fillfactor)%, which is normally set according to a user's expectation of the number of updates likely on a table.
Utilities
It would be possible to have a segment-aware CLUSTER. That would require us to use another flag in the visibility map to indicate clustered/unclustered. That might assist us in improving REINDEX time, since the rows for read_only_clustered segments would not require sorting.