v0.1 11 Dec 2007
Frequently with large data archives there is a requirement to reduce the footprint of the data to allow longer term storage costs to be reduced.
For Insert-only data we might imagine we can reduce the size of tables by removing unused tuple header information. Although that is possible, repeated headers compress fairly well, so it seems easier to tackle the problem directly by having compressed tables.
Using a streaming library like zlib, it will be easy to read/write data files into a still-usable form but with much reduced size. Access to a compressed table only makes sense as a SeqScan. That would be handled by introducing tablespace-specific access costs, discussed below. Indexes on compressed tables would still be allowed, but would hardly ever be used.
Access would probably be via tablespace-specific storage managers. So implement mdcompress.c alongside md.c in src/backend/storage/smgr. If that implementation route was chosen, it would then allow the compression option to be made at tablespace level, so commands would be:
CREATE TABLESPACE tablespacename LOCATION 'directory' [COMPRESS];
(ALTER TABLESPACE support is going to be much more complex, so leave that alone for now)
So when we copy a read-only table to another tablespace the compression would take place without additional syntax at ALTER TABLE level. i.e. nothing new goes in tblcmds.c. Cool.
mdcompress.c seems fairly straightforward, though we would need to think about how to implement smgr_nblocks() since lseek-ing to get it won't work because the file size is smaller than the actual decompressed table. Perhaps with an info file that contains something like an index metapage where we can read the number of blocks. Ideas?
In the first pass, I would only allow compressed read-only tables. In time we might allow Inserts, Updates and Deletes though the latter two will never be very effective. So my initial approach does *not* allow writing directly to a compressed table. A couple of reasons why that isn't desirable/easy: If we write data straight to the table then any aborted loads will be written to the table, so VACUUMing the table would need to re-compress the table which sounds horrible. Plus hint status bits would need to be set *after* the data was written. (Perhaps in conjunction with a visibility map we can just forget that aspect of it). The other problem is that blocks would need to be written out of shared buffers sequentially, which unfortunately we do not guarantee. We can force that in the smgr layer by keeping track of last written blockid and then writing all cached blocks up the currently requested one, but that seems problematic. I'd say if we want more, we do that in the next release, or at least the next phase of development. So we would prevent the direct use of CREATE TABLE on a COMPRESSED tablespace, for now.
I'm open to arguments that we don't need this at all because filesystem utilities exist that do everything we need. You're experience will be good to hear about in regard to this feature.