This page summaries Dim ideas about how to distribute PostgreSQL.
To have something usable what we need is to reuse existing concepts and ideas and mix and match them in order to solve our problem.
The need we're trying to find a solution for is to be able to use more than one physical machine for a single PostgreSQL cluster.
Two problems are to get solved: we want our data either to get spread on more than one machine disks (think low cost storage facility), or we want local copies in order to be able to take down any node.
The proposal mainline is to work with the concept of tablespace: it's a known concept tied to physical storage location. Now to steer this to remote places.
Building atop 8.4 Foreign Data Wrappers, we could setup remote tablespace. Maybe we'll need a new API for this, the main idea is to have data accesses (read and writes) happen in a remote PostgreSQL server where we're granted the necessary permissions.
The remote place would have to be "remote tablespace" aware and accept our requests, by having the necessary foreign data wrapper module?
The specific FDW module will be responsible of having a PostgreSQL server able to use the same API wrt tablespace either locally or remotely.
A#= CREATE TABLESPACE local_a; B#= CREATE FOREIGN TABLESPACE bar SERVER A OPTIONS(tablespace 'local_a');
Now every access (read, write, table creation or expansion, etc) that happen at server B has to get done by server A, which could also use the same tablespace locally. Security would be achieved as usual, B users could e.g. only obtain reading permissions, or full access whereas local A users would only get reading perms...
Physical accesses are to get done only by tablespace-local backends, via some network level API or protocol to get implemented in the specific Foreign Data Wrapper.
Now that we're able to store relations on a remote PostgreSQL server, and to transparently get back data from there, we want two important features :
- a local mirror, for performances,
- a single relation in more than one tablespace, for storing huge data amounts with commodity hardware.
An idea to solve both issues is to create a tablespace volume, which is a tablespace shell using one or more "real" tablespace beneath (demo syntax):
CREATE TABLESPACE VOLUME foo_mirror MIRRORING bar;
Now a single relation could reside on more than one location and the planner would have to know about this and choose the "nearest" one. Stats are to get collected about data access performance metrics so that the data path is now a part of the cost estimations.
- Should the cost evaluation model be dynamic enough to account for load effects? IT'd be great if a degraded raid array have an influence here.
- There should be a way to tell that the mirror is down and continue processing without it: it's only an optimisation path.
- More than one mirror of the same tablespace is possible, thinking about how each PostgreSQL cluster is aware of its local tablespaces copies (network topology) might become necessary.
- The topology is necessary for High Availability: rather than an optimisation path, the mirrors allow to shut down one cluster and still answer queries.
The other use case is (demo syntax):
CREATE TABLESPACE VOLUME foo_spreaded SPREADED ON bar, baz, buz;
And now a single relation could have parts of its data not at the same location than others, so we need to maintain an hash system in order to know where to find the data. The hash function could be user provided (see plproxy).
Very useful to have say a 30TB table stored on ten 1TB systems, and having the user know where are his data.
When you have several 1TB (or more) remote partitions where to store your data, you'll want to be able to have mirror of partitions to be able to take any node down. And you'll want any single cluster to be the backup (mirror) of another one, either for HA or LB (local access performances).
So we'll have to be able to make up spreaded tablespace of mirror tablespaces, e.g.
Dump & Restore support
Is it necessary to be able to dump & restore the network tablespaces topology?
Seems not, as we will restore on a single cluster at any time, and will either need the other nodes in place at restore time but then you could start by mirroring existing tablespaces, or need to build a new cluster from scratch, with a new topology setup.
Processing power Sharing
Now that the data happens to be on more than one cluster at any time, when you have local and remote copies of the same data, and/or a single relation spreaded on more than one machine, why have a single CPU dedicated to a given query? You want remote async IO, so you need more than one CPU per request.
Two implementations of this already exists:
TODO: explain what Markus has done in  wrt helper backends, iMessage queues etc.
TODO: find again this CVS 4.3 era link about them.