From PostgreSQL wiki
Clustering and Replication Developers Summit pgCon 2011
Tuesday, May 17th
9:30AM to 5:15pm
University of Ottawa
Desmerais (DMS) 3105 (break out room: DMS 10161) (this is the same building the sessions are in)
Maps for campus, building.
Sponsored by NTT Open Source
9AM to 9:30AM
Seating and coffee. Please bring any last-minute agenda items to Josh Berkus at this time.
9:30AM to 10:15AM
Introductions, and status reports from Replication/Clustering Projects:
- Slony-I: Jan Wieck
- pgPoolII: Tatsuo Ishii
- Postgres-XC: Koichi Suzuki and Michael Paquier
- GridSQL/Stado: Mason Sharp
- Bucardo: Selena Deckelmann
- SkyTools: Marko Kreen
- Binary Replication: Simon Riggs
- Mammoth Replication: Alexey Klyukin
If you are at the summit representing a specific replication or clustering tool, you should prepare a 1-3 minute summary of current progress and issues. If you want to use slides, please provide slides in PDF form to Josh Berkus by Friday, May 13.
10:15AM to 10:30 AM
10:30AM to 11:00AM
Summary of Clustering API projects.
Summit attendees who have been working on core clustering features should give an update as to progress and current issues. Please present a 2-3 minute summary. Attendees who wish to use slides should provide PDF slides to Josh Berkus by Friday May 13th.
11:00AM to 12:30PM
Discussion of priorities, progress and ideas for core clustering projects and APIs.
Goal of this discussion is to modify the list of core clustering features and get commitments for hackers to work on specific features. Also, to supply discussion items for the following day's Developer Meeting
12:30PM to 1:30PM
Lunch. Box lunches will be supplied.
1:30 to 2:00
Breakout sessions scheduling session.
2:00 to 3:15
Breakout sessions part I
Summit attendees should break into affinity groups and discuss specific core features and APIs. Attendees should come back from breakout session with rough specifications and/or plans.
3:15 to 3:30
All attendees gather and deliver summary of breakout sessions.
3:30 to 3:45
3:45 to 4:45
Breakout sessions II
4:45 to 5:15
Attendees gather and deliver summaries of 2nd breakout sessions.
Conclusion of Summit.
- Josh Berkus
- Andres Freund
- Selena Deckelmann
- Kevin Grittner
- Dimitri Fontaine
- Christopher Browne
- Steve Singer
- Marko Kreen
- Jan Wieck
- Tatsuo Ishii
- Koichi Suzuki
- Kevin Grittner
- Andrew Dunstan
- Jan Wieck
- Fujii Masao
- Ahsan Hadi
- Michael Paquier
- Jehan-Guillaume (ioguix) de Rorthais
- Mason Sharp
- Pavan Deolasee
- Ghulam Abbas Butt
- Sakata Tetsuo
- Greg Smith [leaving before lunch]
- Simon Riggs
- Toshiro Kitagawa
- Nozomi Anazai
- Alexey Klyukin
List of Features
- Export snapshots to other sessions
- Postgres-XC team working on, planning to contribute next year
- needs design discussion
- Global conflict resolution and detection
- Kevin Grittner to specify
- API into the Parser / Parser as an independent module
- Alvaro Herrera & Tatsuo Ishii & pgPool team
- Start/stop archiving at runtime
- Selena to find original patch
- Function scan push-down
- Postgres-XC team & SQL/MED
- DDL trigger
- Jan Wieck (design) & Dimitri (code)
- Modification trigger into core
- Jan Wieck & Marko Kreen to discuss during breakout session
- Commit order of transactions
- Andres & Kevin to decide feasibility
- Dynamic shared memory allocation
- take off the list, not clustering-specific and concerns also Postgres
- XID feed
- really part of exportable snapshots
- SSI support for Hot Standby
- Kevin & Simon
- Commit Hooks
- Simon Riggs to Spec
- needed for updatable FDW
- Lock Database - stops new connections to DB or new transactions
- Postgres-XC only maybe, other projects may need it for dropping DB securely
- Michael to create definition => Lock database definition
Status Updates by Project
Slony: Jan Wieck
- Status of slony is improving. It is in beta for 2.1 and it has improvements all over the board.
- Steve and Jan developed a clustertest framework that allows them to test the features of slony (moving sets, changing the shape of subscriptions, etc.) while the application is under load, checking consistency etc., Performance wise slony has always performed well when used as intended but an issue with backlog has been fixed for 2.1 and will be backpatched to 2.0.
- On usability most people don't know how to use the wait for event command, Steve has worked on teaching slonik to automatically know when to wait.
PGPoolII: Toshihiro Kitagawa
- PGPool has query based replication, load balancing, automatic failover and connection pooling. It is an OSS project managed by the pgpool global development group. Website, mailing list on pgfoundry. Pgpool II, pgpool admin, pgpool HA (heartbeat scripts). Version 3 is stable and 3.1 is under development. It has had streaming replication since verison 3.0. It dispatches write queries to the primary node and read queries to either node according to the load balanced wait. If a node goes down pgpool II an send all queries to the remaining node. 3.1 beta is hoping to be released soon.
- New features in 3.1 (SQL parser was imported from postgresql 9.0), it can understand new syntax introduced in 9.0 such as 'create index with a name'. , follow master command directive can better support streaming replication. Call management facility, ability for pgadmin and PHPpgadmin to control pgpool. Can use regular expressions in white and black function lists (controls which functions pgpool thinks performs writes to the database). Ability to use syslog, ability to specify the lifetime of the system catalog cache.
- Future plans: on in memory query cache (pgpool II already has a query cache) but it is not fast and does not have automatic invalidation (GSOC project to solve this). Ability to manage multiple pgpool II instances sharing the same database cluster this will require exchanging information.
GridSQL: Mason Sharp
- GridSQL enables grid capability for dw workloads. This was an EDB project, Mason and Jim have both left EDB. Jim has forked GridSQL to a new project called Stado. He stripped out some EDB specific code.
- Working on adding PostGIS support.
- It will benefit from unlogged tables in 91.
He next wants to work on high availability and integrate with streaming replication.
Postgres-XC : Michael Paquier
- XC is a multi-master cluster based on Postgresql 9.0. Last release is 0.9.4 on March 2011. Major changes are that it is now BSD licensed so they could gather more people. Merged in code from 9.0.3 in. SQL support most of DDL, DML and SELECT single step queries (order by , limit, offset).
- Regression tests looking for possible bugs and comparing with PostgreSQL. DBT-1 benchmark test 3.6 scale factor for 5 servers.
- Next release is for June 2011 with support for session parameters, grant, revoke, group by data distribution extensions and single prepared sql/plpgsql queries. They will also add some HA features such as when a node crashes with a partially committed txn they will be able to clean up the txn.
- A configurator and a cluster manager will also be provided to make setup easier. These are written in ruby.
- Node reconnection features that allows you to reconnect to new GTM when it fails over, without losing live transactions.
Also a standby transaction manager.
- Plan is to release 1.0 in April 2012. Should include PITR ,temp tables, cross node operation optimizations and multi-step prepared statements.
- Reviews, contributions and ideas are welcome.
- License will be changed to PostgreSQL license
Bucardo (Selena Decklman)
- Bucardo is a multimaster replication system written in perl, trigger based.
- Last week Greg released version 5 (alpha code) that takes away the two master restriction (you can have any numer of masters) it uses the listen/notify payload system, and has other changes.
More work needs to be done to re-implement the sequence support so it works with many masters.
Skytools (Marko Kreen)
- They have not yet released a final version of 3.0 but Skype is using it live and they will/should release this to the public. Currently available on github.
- Skytools 2.1 has a generic queue that replication is built on top of. In 3 they have cascaded replication and to do this they implemented a generic cascaded queue.
- In version 3 they have handlers/plugins that can be done to change things (ie table partitioning). They have implemented multi-master that will merge row updates from different databases. They were not trying to solve generic multi-master but more for having fast failover between masters.
Binary Replication (Simon Riggs)
- in 9.1 synchronous replication made it into core thanks to a lot of teamwork from a large number of people.
- Something amazing happened in that after 400 posts of arguing on the configuration we ended up with a simple configuration that can be explained in less than 10 minutes (Josh).
- A lot of the efforts over the last 7 years was to be able to get synchronous replication and by having transaction controlled replication we have something beyond what the big commercial players offer.
- Repmgr is about to release the next version that has fully automated multi-node failover. They are using this in production for a few customers already.
- There are already 2 systems in production using it (one in Finland)
- Simon is thinking of turning his attention to more logical replication
- They have suspended development efforts
- It is a master slave binary replication system that is a fork of postsgresql.
- It has master slave replication, failover. It has a master that accepts writes with a custom queue (a set of files, one file per txn) when the transactions come they get put into the queue and data is then extracted and sent to a middleman process that is responsible for communicating with slaves. *This process (MCP) forwards the data to the slaves. The slaves then restore the data from this queue.
- They added system catalogs for storing replication data, these catalogs got replicated to the slaves as well. They added custom hooks into postgresql to collect this data. They do not require the nodes to be constantly online but can go offline and perform updates in batches.
- Reliability vs performance tradeoffs made early in development: The mcp process became a single point of failure. Before suspending the project they were working on getting round-robin mcp processes.
- They felt that 9.0 internal replication was pretty good and since they were maintaining a fork keeping up with the changes was a lot of work so they decided to suspend the project.
The code is BSD licensed and is available in github
- BSD license has been declared as different from PostgreSQL license. Even if content of each license is the same, the words used for each license are different, making it different (Josh, judgement decision on PostgreSQL license).
- A a consequence, cluster products should have a PostgreSQL license and not a BSD one.
Status Updates by Project
Last year a set of clustering related features was discussed at 
We have a design, there was a patch floating around by Joachim Wieland but the patch was not general enough and was not committed to 9.1. The postgresql-xc team might be able to look at this in the future but so far they have not gotten to it. The consensus in the room was that this feature is needed for parallel queries and useful for binary replication systems. Joachim's patch uses a file for storing the snapshots and this might not work for exporting the snapshot list to a standby server.
Global Deadlock Information
The conclusion of the discussion in Japan was that it was unclear if they this was needed or if a timeout based approach would be sufficient. No one in the room has worked on this since then. A few people stated that there are other ways of resolving conflicts (depending on the system/architecture)
API into the parser
This was raised in the past and there was some opposition to making the parser code more complicated than it is now. A few people felt that exposing the parser API would require doing something other than the current grammar generated parser. This could be marketed as a parser simplification. Non core projects (phppgadmin, pgpool would like this). Other non postgresql projects are grabbing/reusing our existing parser. For most applications the raw parse isn't useful anyway it will have needed to pass through the catalog access (the rewritten parse tree).
Start/stop archiving at runtime
There are some optimizations that only work when archiving is off, so implementing this needs to make sure that any transactions using the optimizations are done before archiving is on. This just needs someone to do it
Function scan-push down
Parts of these are being worked on by postgresql-xc The EDB version of postgresql can do some of these things but that code is proprietary though some parts of this did get included in postgres-xc that is now BSD licensed. A GSOC student is working on some components of this
- Nothing beyond discussion/theory has happened on this.
- The question remains where to hook this in, tcop would be logical but sometimes utility statement code does not go through here. There are also issues with cached query plans
What would this receive as input arguments for it to be useful for triggers to do things? If the code were cleaned up so everything called tcop and then outFuncs were written for everything in the parse tree.
- Jan has worked on the theory but has not made progress on implementation.
Modification trigger into core
- Skytools had tried implementing something that was function based on it was ugly
PgQueue might be suitable as a generic queue
Exporting commit order of transactions
- Work on this has been given up on
- With serializable transactions we might want to look at apparent order of execution not just order of commit
- Would be valuable for replication as it could eliminate the need for "Syncs" (Slony) or "ticks" (Londiste)
- No one in the room had looked at this. This seemed more appropriate to discuss at the postgresql developers meeting
- This is related to snapshot exporting.
- If you want to share the same transaction context in a different node you need an XID feed to accomplish this.
- Postgres-XC needs this but it might not be relevant to other projects
Parse tree of utility statement.
SELECT INTO does not create completion code.
DDL Trigger needs to offer parse tree of utility statements. But parse tree is binary. ASCII string format might be better, but would need to implement it.
Maybe just look add after data structure. Maybe have parameters on what would be output.
Might need both before and after triggers.
Simplest case is create table. Second is create partition, if its empty.
Put triggers on schema and database level.
Need to funnel everything through processutility.
Some utility statements execute recursive processutility calls. Also multi-event statements.
- todo list of statements & other things which need clean access to processutility
- backparsing of utility command
- specification for triggers, trigger parameters.
Triggers: want parameters like oid, statement, etc. to return certain things. Also want WHEN clauses.
Need need attributes for triggers in order to do DDL triggers, and OLD and NEW etc. doesnt work. Maybe new system catalog.
- 3 or 4 use cases pgpool, Syntax colouring, people trying to determine if queries are read only,
- output from the parser after rewrite (for pgpool case). Did not determine what the best output of the API is. Some set of rows specifying the type and position.
- It is not at all clear what output pgpool would want, maybe the parse tree?
- Do we need some sort of compatibility assurance for future version, they do not think that they need this and no promise of compatibility would be made