Simon Riggs, 2ndQuadrant
Segment Exclusion is also known as Storage Indexes in Oracle and MinMax Indexes in Vectorwise, though both of those implementations come later than the first mention on Postgres hackers.
- 1 Recap: Very Large Table Use Case
- 2 In-table Partitioning
- 3 Segment Exclusion
- 4 Noting which segments are read-only
- 5 Visibility Map
- 6 Setting the Visibility Map
- 7 Storing the visibility map
- 8 Unsetting the visibility map
- 9 Utilities
- 10 Overlap with Visibility Map Proposal
- 11 Index Scans
- 12 Other Changes
- 13 Comparison with other Partitioning approaches
- 14 Testing whether this would work for your app
- 15 Conclusions
Recap: Very Large Table Use Case
Many large tables have a regular access pattern:
- new rows inserted frequently
- some updates and deletes on the "recent" data
- older data becomes effectively read-only
- at some point data may be removed in bulk from the table
Most tables vary on what "recent" means; some are insert only, many not. A typical example might be a large history table where the last 6 months data is updated, but after that the data has almost zero change.
Partitioning relies on knowledge of the physical location of data to exclude sections of data from a scan.
Currently the partitioning knowledge is provided by user declarations in the form of constraints on tables linked together by views or inheritance. We also currently do this in "constraint-first" fashion, i.e. we put the data where the constraints say we should, rather than deriving the constraints from the data.
In the common use-case there is a clear affinity between certain data columns and the physical placement of data in a table. This is true for columns such as LOG_DATE or TRANSACTION_DATE, but is also true for any columns where the id column is assigned by a Sequence. HOT helps this to remain true over time.
Given the above use case, when there is an affinity between data values and data placement *and* we know that older data tends to become read only, we can then realise that certain physical sections of a table will become effectively read only. (My experience is that the larger the table, the less random the write pattern - whatever the guys that wrote TPC-C say).
If we were able to keep track of which sections of a table are now read-only then we would be able to record information about the data in that section and use it to help solve queries. This is turning the current thinking on its head: we could derive the constraints from the data, rather than placing the data according to the constraints. That would be much more natural: load data into the table and have the system work out the rest.
We can imagine that we do this within a single table. Imagine: split a table up into 100 sections, then record the min/max values of all tuples in those sections. When we perform a SeqScan we could skip over sections that could be proved would never satisfy the query predicate. Same thing as partitioning, but within the table.
Currently tables are already broken up into 1GB file segments, so it seems fairly natural to pick that as our section size. That fits reasonably well with recommendations for ideal partition size. It is also easy with this technique to provide variable sized sections, by using a storage parameter at table-level, like fillfactor.
After we note that a segment is read-only we can scan the segment and record min/max values for all columns. These are then "implicit constraints", which can then be used for segment exclusion in a similar manner as we do with the current constraint exclusion mechanism.
The implicit constraints can then allow SeqScans to assess segment exclusion for each segment at execution time, i.e. a scan can skip a whole segment if constraints don't match. If a segment does not (yet) have implicit constraints it will always be scanned in full. This allows partitioning, synch scans and buffer recycling to work much better together than they currently do.
Other scans types might also use segment exclusion, though this would only be useful for scans retrieving many rows, otherwise the overhead of segment exclusion might not be worthwhile.
This would also allow a Merge Join to utilise exclusion for the inner plan at execution time. Instead of scanning the whole inner table plan from the start, we would be able to start the scan from the appropriate segment. (This would be for the first tuple from the inner table, not for every inner tuple). This would require us to pass the current value of the outer plan down into the inner plan. The typical executor nodes on the inner plan would be a SortNode and below that a SeqScan. In that case the executor would need to pass the outer value from the MergeJoinNode down thru the SortNode to the SeqScan node. The SeqScan node could then perform partition exclusion, reducing the time for that scan and also reducing the time for the resulting sort. This sounds like it might be worth doing in normal cases also. It might turn out that the potentially applicable cases are already excluded during planning, I haven't thought about that aspect in enough detail yet.
If we collect data for all columns then many of our implicit constraints would be useless. e.g. if a column only has a few values and these are present in all segments. Matching our predicate against all constraints would be expensive, so we must weed out poor constraints. We would do this by removing any constraint that overlapped more than 10% of other segments. Various heuristics would likely need to be discovered during development to make this work without resorting to manual commands.
Note that all of this exclusion is happening in the executor, not the planner. That allows this form of exclusion to work with stable functions and parameters without problem.
It would also be possible to exclude some types of value in the planner only, but this only works when the planner and executor are run at the same time.
One potential objection to this approach is that immutable functions and constants cannot be removed by the planner in a prepared query because the partition constraints are dynamic.
Noting which segments are read-only
Everything so far has relied upon our ability to note which segments of a table are read-only. We could do this in two ways
- 1) have the system automatically keep track of non-changing data
- 2) have the DBA issue a command to "mark" a segment as read-only now
Probably a combination of the two is better, so we have three states for segments
- EFFECTIVE_READ_ONLY (set by 1 only)
- MARKED_READ_ONLY (set by 2 only)
Having said that I want to concentrate on (1), though consider the other one also in case requested by reviewers.
So *how* would we mark segments as read only? It turns out that attempting to do this is extremely similar to Heikki's Visibility Map idea, so I'll describe it in those terms, even though there are some differences. It also brings to mind Andrew Sullivan's read-only tuples concept.
We would keep a dynamic visibility map at *segment* level, showing which segments have all rows as 100% visible. No freespace map data would be held at this level.
Most everything about Heikki's Visibility Map idea holds, just that we have a bit for each segment in the table, rather than each block. The map is very small and hardly ever changes for a table, so we can cache it easily on each backend. If the map does change we can perform a relcache invalidation to make everybody re-read the map.
No dynamic shared memory cache is required because any concurrent changes to the table would be ignored by a Scan anyway, so it doesn't matter if an INSERT, UPDATE or DELETE occurs while we are scanning. Any new scans that start will attempt to lock the table and then perform a rel cache check before continuing. So the visibility will be set correctly for *that* scan at least.
In most cases the visibility map can be summarised as a single boolean to show whether any 100% visible segments exist. That makes accessing the map very cheap in the common, unset case.
The check to see if the rel cache has changed in essentially free, since we do it already when we grab a lock. The only case this isn't true for is when we attempt to lock a table that we have already locked in this transaction, since that is currently optimised away. In that case we would check the rel cache only for tables that had read only segments when we *first* accessed the map during the transaction. We will always have a map to allow us to make the check at this time, and we only need to re-check if we had values set. If segments go from read write to effective read only during our access we don't need to worry. So the overhead in the non-partitioned case is a single boolean test, whilst for partitioned tables we would need to re-access the cache for each statement against the table. That seems acceptable because we may use the visibility information to increase performance considerably in many cases.
Setting the Visibility Map
We need another intermediate state in the VM to cope with concurrent changes while we are setting the map, called this READ_ONLY_PENDING.
VACUUM would scan all READ_WRITE_ALLOWED segments and mark some of them as READ_ONLY_PENDING if 100% of the remaining rows that it has seen are visible to all current backends. Marking the map will cause a rel cache invalidation.
We would then re-scan newly marked segments to derive min/max values for implicit constraints, performed by autovacuum worker processes. If we see a change made by a transaction after the xmin of the VACUUM, then we will abort this scan and reset the map to READ_WRITE_ALLOWED. This allows us to capture concurrent changes behind the first VACUUM scan.
Changes to the table after the first VACUUM but behind the scan point of the second scan will see the READ_ONLY_PENDING state and clear it back to READ_WRITE_ALLOWED.
If the second scan of a READ_ONLY_PENDING segment does not find any new changes then it can successfully update the map to EFFECTIVE_READ_ONLY.
We would never mark the highest numbered segment EFFECTIVE_READ_ONLY, since it is likely to expand when INSERTs occur. This is also a useful way of excluding smaller tables from the overheads involved for this feature.
In an insert-only table this would mean that only the last 1 or 2 segments would be read write, so VACUUM would always run in a bounded time, no matter how large the table.
When would we attempt to set the visibility map? If we do this aggressively, then it will be quickly unset. If we wait too long then we might lose some of the benefits of this approach.
First, we recognise that read only segments will change the autovacuum calculation - we simply exclude them entirely. This will mean that VACUUMs are run more regularly than they are now, but when they do run they will be quicker. (Note that currently, VACUUMs on a growing table get further and further apart as the table grows, assuming constant write rate).
My feeling is that VACUUMs are sufficiently infrequent that we should set the visibility map aggressively after each VACUUM. If the map is quickly unset, then we wait some time before trying again. Unsetting seems to be relatively low contention (see below), so that seems acceptable.
So setting and unsetting of the visibility map makes this partitioning approach "volatile" and would not be able to be relied upon to provide performance if the update pattern were inconsistent.
We would then re-scan newly marked segments to derive min/max values for implicit constraints, performed by autovacuum worker processes. The best way seems to be that an ANALYZE command would automatically detect that a segment has been recently set EFFECTIVE_READ_ONLY and perform an all-rows scan rather than just a sample. (We would keep min/max values only for datatypes with valid btree sort operators, just as ANALYZE already does).
A later VACUUM will still be required to freeze the rows, though that can happen later when we hit the frozen limit, or earlier if we run an explicit VACUUM FREEZE. We would need to store the oldest xid for each segment, so we know when to kick off a FREEZE on that part of the table.
This departs completely from the idea that the visibility map and the freespace map are related somehow. However, there is a change required in the FSM to ensure the proposed technique is stable and useful: If we scan a segment and it is 100% visible, if there is freespace in just one of the blocks in the segment then we will soon find that our visibility-bit will be set to off again very quickly.
If there is more than 5% freespace total in the segment then we will not set EFFECTIVE_READ_ONLY, nor report blocks in a segment to the FSM. That way, small amounts of freespace won't destroy the benefits of segment exclusion. VACUUM currently overwrites FSM information, though if this was not the case then we would have to actively remove it for the newly read only segments. Perhaps that limit would be (100 - fillfactor)%, which is normally set according to a user's expectation of the number of updates likely on a table.
Storing the visibility map
The visibility map is small enough to be cached on each backend, but it could still be challenging to store it on pg_class if we have more than about 20 segments.
There are three items that need to be stored
- the visibility map - segment status flags
- relfrozenxids for each segment
- segment boundary values
- segment aggregates (e.g. number of tuples in that segment)
None of these items will be stored in pg_class, to avoid bloat.
The visibility map needs to store these states for each segment
- 0x01 READ_WRITE_ALLOWED or EFFECTIVE_READ_ONLY
- 0x02 PENDING_READ_ONLY
- 0x04 MARKED_READ_ONLY
- 0x08 spare status flag
So we need to store 4 bits per segment
The relfrozenxids need 4 bytes per segment. We will also store relfrozenxid for each table on pg_class, which is the minimum value of any segment. This is mainly examined by autovacuum to see when to initiate vac freezes.
Segment boundary values would be stored within the stats tables, as part of the histogram values. This would allow them to be used as histograms for use in planning. (XXX needs more thought as to storage - we want the planner to use the values as stats, but we also want to make it easy to access them, plus we don't want ANALYZE to overwrite them)
Segment aggregates could be stored also. reltuples would require an array of integers.
Unsetting the visibility map
For EFFECTIVE_READ_ONLY, INSERTs, UPDATEs and DELETEs are still possible. For MARKED_READ_ONLY it would be explicitly refused.
The relcache invalidation caused by making a segment EFFECTIVE_READ_ONLY can flush the rd_targBlock for the relation in each backend. Other INSERTs are not possible since all current code-paths either use extension or the FSM when the table is non-empty, and neither of those will cause a problem. Extension only touches last segment, which will never by EFFECTIVE_READ_ONLY. The FSM contains no entries for an EFFECTIVE_READ_ONLY segment.
UPDATEs or DELETEs are possible on EFFECTIVE_READ_ONLY segments, but we don't expect it often. If this occurs, we will simply reset the visibility map for the table. This can be handled with a non-transactional overwrite - the bit always exists already, so the overwritten data is always same size. That's pessimistic, since it resets the state even if the UPDATE/DELETE aborts, but its better than holding the row locked until the transaction completes, which might prevent other things from happening. We probably also want VACUUM to clear up any aborted transactions anyway.
The user may also wish to clear down very old data, so allowing DELETEs can ensure the user can still remove old data from the table. By carefully choosing the values to be deleted, a whole segment can be deleted and then returned to the FSM.
Running a huge DELETE will likely perform a SeqScan that avoids all but the deleting data. The following VACUUM will also ignore the other segments, so removing older data seems fairly well optimised already, in comparison with now. It would be possible to optimise this to perform a segment-level truncate, but since the partitioning will already improve the DELETE and VACUUM, I'm not proposing that yet, if ever.
SHARE locks currently write to data blocks, but since they represent transient state we do not need to unset either the partitioning or the visibility map.
It would be possible and probably desirable to have a segment-aware CLUSTER. That would require us to use another flag in the visibility map to indicate clustered, unclustered. It also requires us to rebuild sections of indexes that refer to the clusters, which will require full table locks. An on-line version of CLUSTER would be required to make this work sensibly.
Overlap with Visibility Map Proposal
This proposal offers many of the advantages of the earlier Visibility Map proposal, yet without major changes to heap structure. Since the segment-level visibility map is more granular it would only be 80% as effective as the more detailed block-level map. Having said that, the bookkeeping overheads will also be considerably reduced, so it does seem a good joint approach. It also allows freezing to be handled fully, which was a problem with the original visibility map proposal. WAL logging visibility map changes is also now much easier.
My thoughts have been targeted directly at partitioning, yet I have to admit that this idea overlaps, and in my world view, replaces the Visibility Map proposal. I very much like the name Visibility Map though. I've re-read the earlier thread and agree with Jeff Davis that we *could* have multiple kinds of visibility map, but also agree with Heikki that it seems like too much work. It was never my intent to address both challenges in one proposal and this is just pure coincidence, or possibly just luck, depending upon how you like the proposal.
If we do need to differentiate between the two proposals, we can refer to this one as the Segment Visibility Map (SVM).
Index-only scans would look at the Visibility Map, just as they would have done in the earlier proposal.
It would be costly for an index scan to repeatedly check the visibility map, but probably worth it to do so.
We can handle select count(*) by scanning the non-100% visible segments of a table, then adding the stored counts for each segment to get a final total. Not sure if its really worth doing, but it does sound like an added bonus.
There would be additional complexity in selectivity estimation and plan costing. The above proposal allows dynamic segment exclusion, which cannot be assessed at planning time anyway, so suggestions welcome...
I think it would be prudent to have an override option on VACUUM to allow us to scan a table in full, checking rather than using the visibility map. This mode would be called VACUUM CHECK.
None of the above blocks earlier proposals for read only tables, and the two features could work together very straightforwardly.
Comparison with other Partitioning approaches
Once I realised this was possible in fairly automatic way, I've tried hard to keep away from manual overrides, commands and new DDL.
Declarative partitioning is a big overhead, though worth it for large databases. No overhead is *much* better though.
This approach to partitioning solves the following challenges
- allows automated table extension, so works automatically with Slony
- responds dynamically to changing data
- allows stable functions, nested loop joins and parametrised queries
- allows RI via SHARE locks
- avoids the need for operator push-down through Append nodes
- allows unique indexes
- allows both global indexes (because we only have one table)
- allows advanced planning/execution using read-only/visible data
- works naturally with synchronous scans and buffer recycling
All of the above are going to take considerably longer to do in any of the other ways I've come up with so far...
And yes, this is a different conclusion to earlier discussions, particularly those in 2005 where I was still thinking in terms of declarative partitioning.
Testing whether this would work for your app
It's probably worth examining existing applications to see how well they would migrate to segmented tables approach. The following query will analyse one column of a table to produce a list of boundary values, given a segment size of 131072 blocks (1 GB).
select substr(ctid::text,2,strpos(ctid::text,',')-2)::integer/131072 as seg, min(PrimaryKey), max(PrimaryKey) from bigtable group by seg;
We should be able to see whether this works for existing use cases, or not fairly easily.
You can change the segment size, but I would expect the practical limit of the number of partitions to be 10-20,000 with a maximum realistic size of 2-5,000 partitions. So change segment size, but not too small on large tables.
Note that the segment size can be changed, but it will take a complete VACUUM of the table to reset the implicit constraints. (We could optimize that also, but don't hold your breath).
General objective is to allow us to manage data more easily above 10TB
This technique would be useful for any table with historical data keyed by date or timestamp. It would also be useful for data where a time-of-insert component is implicit, such as many major entity tables where the object ids are assigned by a sequence. e.g. an Orders table with an OrderId as PK. Once all orders placed in a period have been shipped/resolved/closed then the segments will be marked read-only.
Its not really going to change the code path much for small tables, yet seems likely to work reasonably well for large tables of all shapes and sizes. If a segment is being updated, we leave it alone, and maybe never actually set the visibility map at all. So overall, this idea seems to cover the main use case well, yet with only minimal impact on the existing use cases we support.