Replication, Clustering, and Connection Pooling

From PostgreSQL wiki

(Difference between revisions)
Jump to: navigation, search
(Features in the Core of PostgreSQL: update Hot Standby comments)
(Features in the Core of PostgreSQL: updated other text in paragraph to reflect core teams changed stance on integrated replication solutions)
Line 11: Line 11:
  
 
==Features in the Core of PostgreSQL==
 
==Features in the Core of PostgreSQL==
The PostgreSQL core team considers replication and clustering technology outside the scope of the main project's focus. Jan Wieck works almost exclusively on Slony, so there is much high level attention in this area. While some features to help support this area are available, actually implementing a full solution is considered best implemented by a layer of software built on top of the core, rather than including it in there directly.
+
The PostgreSQL core team considered replication and clustering technology outside the scope of the main project's focus but this changed in Spring 2008, see the [http://archives.postgresql.org/pgsql-hackers/2008-05/msg00913.php Core Teams statement].
  
 
*[[Warm Standby]]/Log Shipping is a HA solution which 'replicates' a database cluster to an archive or a warm (can be brought up quickly, but not available for querying) standby server.  Overhead is very low and it's easy to set up.  This is the simplest and best solution if all you care about is continuous backup and short failover times.
 
*[[Warm Standby]]/Log Shipping is a HA solution which 'replicates' a database cluster to an archive or a warm (can be brought up quickly, but not available for querying) standby server.  Overhead is very low and it's easy to set up.  This is the simplest and best solution if all you care about is continuous backup and short failover times.

Revision as of 18:03, 31 January 2009

Contents

Introduction

There are many approaches available to scale PostgreSQL beyond running on a single server. An outline of the terminology and basic technologies involved is at High Availability and Load Balancing. There is a presentation covering some of these solutions.

There is no one-size fits all replication software. You have to understand your requirements and how various approaches fit into that. For example, here are two extremes in the replication problem space:

  • You have a few servers connected to a local network you want to always keep current for failover and load-balancing purposes. Here you would be considering solutions that are synchronous, eager, and therefore conflict-free.
  • Your users take a local copy of the database with them on laptops when they leave the office, make changes while they are away, and need to merge those with the main database when they return. Here you'd want an asynchronous, lazy replication approach, and will be forced to consider how to handle conflicts in cases where the same record has been modified both on the master server and on a local copy.

These are both database replication problems, but the best way to solve them is very different. And as you can see from these examples, replication has a lot of specific terminology that you'll have to understand to figure out what class of solution makes sense for your requirements. A great source for this background is in the Postgres-R Terms and Definitions for Database Replication. The main theoretical topic it doesn't mention is how to resolve conflict resolution in lazy replication cases like the laptop situation, which involves voting and similar schemes.

Features in the Core of PostgreSQL

The PostgreSQL core team considered replication and clustering technology outside the scope of the main project's focus but this changed in Spring 2008, see the Core Teams statement.

  • Warm Standby/Log Shipping is a HA solution which 'replicates' a database cluster to an archive or a warm (can be brought up quickly, but not available for querying) standby server. Overhead is very low and it's easy to set up. This is the simplest and best solution if all you care about is continuous backup and short failover times.
  • There's an ongoing project to integrate hot standby capabilites (read only queries on slave) into PostgreSQL...if/when complete, this would provide a replication mechanism similar to, but significantly better than mysql binary log replication, and would provide an excellent complement to slony. See Hot Standby.

Comparison matrix

Program License Maturity Replication Method Sync Connection Pooling Load Balancing Query Partitioning
PGCluster BSD See version details on site Master-Master Synchronous No Yes No
pgpool-I BSD Stable Statement-Based Middleware Synchronous Yes Yes No
pgpool-II BSD Recent release Statement-Based Middleware Synchronous Yes Yes Yes
slony-I BSD Stable Master-Slave Asynchronous No No No
Bucardo BSD Recent release Master-Master, Master-Slave Asynchronous No No No
Londiste BSD Recent release Master-Slave Asynchronous No No No
Mammoth BSD Stable Master-Slave Asynchronous No No No

Replication

Aside from Warm Standby, mentioned above...

  • Slony-I: Seems good, single master only, master is a single point of failure, no good failover system for electing a new master or having a failed master rejoin the cluster. Slave databases are mostly for safety or for parallelizing queries for performance. Suffers from O(N^2) communications (N = cluster size). with reasonable sysadmin you can implement failover system yourself. regarding communications, you can cascade the replication to reduce load on the master. If you were implementing a large replication cluster, this would probably be a good idea. Slony is powerful, trigger based, and highly configurable.
  • PGCluster: PGCluster (which, incidentally, is not the same as PGCluster-II, a shared-disk solution), which does synchronous multimaster replication. Two single-points failure spots, load balancer and the data replicator. The project has historically looked a bit dead, but they just released a new version and moved to a Trac-based web site at http://www.pgcluster.org/ and http://pgfoundry.org/projects/pgcluster is up to date (at least downloads page) One major downside to PGCluster is that it uses a modified version of PostgreSQL, and it usually lags a few releases behind.
  • http://pgpool.projects.postgresql.org/ pgpool 1/2 is a reasonable solution. it's statement level replication, which has some downsides, but is good for certain things. pgpool 2 has a neat distributed table mechanism which is interesting. You might want to be looking here if you have extremely high ratios of read to write but need to service a huge transaction volume. Supports load-balancing and replication by implementing a proxy that duplicates all updates to all slaves. It can partition data by doing this, and it can semi-intelligently route queries to the appropriate servers.
  • Bucardo: Trigger-based, asynchronous, multi-master or master-slave.
  • Cybertec, an Austrian company, offers a commercial packaging of PGCluster. They simply call it PostgreSQL Multimaster-Replication, see http://www.cybertec.at.
  • Postgres-R is still in development. It features eager and thus conflict-free, but async multi-master replication.
  • DRBD (http://www.drbd.org/), a device driver that replicates disk blocks to other nodes. This works for failover only, not for scaling reads. Easy migration of devices if combined with an NFS export.

Inactive projects

Clustering

  • Greenplum Database (formerly Bizgres MPP), commercial. Not so much a replication solution as a way to parallelize queries, and targeted at the data warehousing crowd. Similar to ExtenDB, but tightly integrated with PostgreSQL.
  • sequoia (jdbc, formerly known as c-jdbc)

Connection Pooling and Acceleration

Connection pooling programs let you reduce database-related overhead when it's the sheer number of physical connections dragging performance down. This is particularly important on Windows, where system limitations prevent large number of connections; see "I cannot run with more than about 125 connections at once" in the Windows FAQ. It's also vital for web applications where the number of connections can get very large.

Some programs that implement connection pooling are:

Some people also or alternately use memcached in various ways to reduce the work the database handles directly by caching popular data.

Credits

Sources for the initial information on this page include:

A existing page covering this topic in German is at http://burger-ag.de/postgresql_replikation.whtml It translates pretty well through Babelfish.

Sources for more information located but not yet integrated into here:

Personal tools