The Cloud and PostgreSQL
This page is a place to collect information on how running Postgres at a large scale -- both in organizationally (with a diverse set of applications) and in terms of infrastructure (with many servers) -- on "cloud" infrastructure. First, to better define "cloud":
- The combination of CPU/Memory/Block Device are not reliable, probably one to two orders of magnitude less so than physical hardware in a rack.
- In the case of the block device: corruption is unlikely, but availability problems are much more likely
- Provisioning a machine is sort of reliable, and probably three or four orders of magnitude easier than provisioning physical hardware in a rack.
- Immutable storage is considerably more reliable and available than block devices
- Reasoning about devices at the traditional "physical" level become much more complicated
- The Provider-Consumer relationship: a relatively small number of people will have to to run a lot of infrastructure well to be economically compelling, and certain kinds of interactions between infrastructure consumer and infrastructure provider that were more blurred before must be brought into focus.
Distilled: to run Postgres in such an environment means taking advantage of the ease of provisioning to provide availability and performance without a lot of coordination between provider and consumer.
Block Devices vs Immutable Extents
So far, elastically available block devices have been victim to availability and performance issues. This is likely because the problem of offering elastic, attachable/detachable block devices with (and maybe even without) snapshotting is quite difficult. Immutable storage is considerably more reliable and a good go-to choice for disaster recovery and long term archiving. Mature WAL-shipping abilities in PostgreSQL are a good fit here, although projects like pg_lesslog might prove enormously helpful. Another option is to strip index XLog-records and to rebuild the indices from the heap directly on the restored database.
This particular issue has a lot of implications: in particular, it draws a very clear relationship between database size and availability. A small database can be downloaded and replayed from immutable storage very quickly, a large one cannot be reconstituted nearly so quickly. This, in turn, informs decisions about capacity planning and when to engineer partitioning features into an application: the larger a partition, the worse its availability.
Rules of thumb in computing availability of a database given its size, using EC2 and S3 as examples:
- To rebuild a database from scratch in five minutes: needs to be around ~8GB or so.
- One gigabyte costs somewhere between half a minute and a minute to restore on S3.
- One terabyte of restoration will cost on the order of a half-day (12 hours) to restore.
- Compression often helps a great deal: in many cases, expect file sizes as seen on disk to be halved or more.
- Compressed data, when decompressed, can make even the ~0.2Gbit speed to S3 flood saturate sequential write speed of RAIDed EBS block devices, which have been seen achieving 0.8Gbit/s. Expectation: many devices will be made available over Ethernet, and 1Gbit/s is the theoretical maximum of what one can reasonably expect to get.
Optimizations are likely possible in a few categories:
- Enabling prefetching to shift some time-expenditure to before a loss of availability occurs.
- Decreasing the amount of actual data transferred over network or onto disk
- Not having to have the entire database materialized onto block device all at once to have availability.
- Parallel and partitioned recovery, to be able to use more network and I/O resources to perform recovery.
Provider vs. Consumer Separation of Concerns
An infrastructure provider that offers Postgres is probably going to commit to at least soft guarantees (or perhaps just guidance) about uptime and data retention/backup/loss. In general, consumers of the infrastructure provider do not want wish to correspond with the provider unless there was an extraordinary problem that required direct action and intervention. Furthermore, it can be assumed that the user wishes to do any action at any time, making some things difficult.
Here are some example pain points for a provider:
- DDL that can be issued at any time can make some tasks considerably harder. To ask a consumer not to issue DDL during a time window that occurs on a regular basis is far too much noise, and as a result, it is not clear how a logical replication solution can be offered for Postgres as a service. In general, the lack of robust logical replication (or the ability to write such a feature) is a serious problem for a number of reasons, including Postgres version upgrades.
- VACUUM FULL: Providers are often charged on the basis of low-level physical resources, but the heavy locks taken by VACUUM FULL make it impossible to ever reclaim resources. Furthermore, unnecessarily large relation heap files lengthen the time it takes to restore from immutable storage. Finally, VACUUM management is something of a touchy and blurry line between consumer and provider responsibility. Ideally, the provider would take care of most VACUUM issues in non-exceptional situations: most consumers will simply get frustrated with the service if performance degrades because of bloating under casual use conditions. It would be useful to be able to run VACUUM FULL without impacting availability.
- REINDEX: the lack of a "concurrently" option is problematic for reducing index bloat on-line. It can be worked around via CREATE INDEX CONCURRENTLY and renames, but such a feature feels best well-integrated into the DBMS.
- Enabling the use of handy PostgreSQL extensions that can succeed for pg_dump/pg_restore with non-superuser accounts.
- Any changes that need to be effected through postgresql.conf. When desirable, self-service versions that can be effected well through non-privileged user SQL statements (or perhaps SECURITY DEFINER shims installed by the provider) are preferred. Certain logging settings fall in this category.
- Providing planned switchover without loss of data: a provider is unlikely to fare well if it is ever the cause of silent loss of data when performing any internal maintenance task. Occasional, very short-lived availability blips where application processes are told a connection is dropped are acceptable. By contrast, ACKing a COMMIT that subsequently gets lost when performing, say, a switchover to perform a capacity upgrade or downgrade is unacceptable. The consumer (rightly) perceives it as a capacity change feature with a bug that has silently lost data, and left no recourse to do anything about it.
Clear examples of things that are definitely not in the provider purview:
- Creating indexes on the consumer's behalf to make things run better (note: providing visibility as to what is running slowly *is*, in practice, a provider responsibility)
- Designing a schema to run quickly enough
There are many small consumers and databases, or consumers with dynamic workload demands. Postgres is not, by itself, a very good multi-tenant system: noisy neighbors are hard to contain, and many features --- like moving individual consumer databases around via WAL replication --- do not work when there are multiple databases in one cluster. The engineering challenge in fixing those would be rather vast, so instead, lightweight containerization technologies (like LXC (Linux), FreeBSD Jails and Solaris Zones) are becoming increasingly important to provide sharing and leveraging of resources at the physical level. The most scarce system resource is memory, and an important property of the system is how quickly it can start (completely idle Postgreses can be put to sleep and wait for port activity: see systemd).
Postgres is pretty lean and quick-starting already (the 6MB code image can be shared, even with LXC), but the following could be useful:
- Periodic forking rather that continuous alive-ness of maintenance background processes by the postmaster in low-activity databases. A database that is online but has no connections and not much work to do would ideally only have the postmaster running.
- Dynamic, on-line resizing of some resources like shared memory, to more elegantly cope with rapid changes in demand on a machine with many tenants.
- The size of a fresh initdb on-disk. Some other tricks like COW that do not have to take place in Postgres may help with this, but a reasonable-practice approach should be published.
HA, switchover, and process pooling are made complicated by backend statefulness that is hard for external systems to evaluate, especially state that persists outside a transaction. Most common SQL features used by web applications, for instance, do not rely on any inter-transaction stateful functionality, but very occasionally a client will use a WITH HOLD cursor, set a global configuration setting, utilize pg_advisory_lock, use dblink, or some other feature that requires session state to be retained to get sensible behavior. Some of this statefulness might be easy to save into a table or other shared data structure so that processes can "context-switch" between states, for others it may not be possible. Regardless, these features that are typically seldom used for the bulk of OLTP queries are exacting a deep inflexibility upon what is often the most resource intensive, latency sensitive (but generally not very semantically complicated) workload on interactive queries.