- 1 pgCon 2015 Cluster Hacker Summit
- 1.1 Attendee RSVPs
- 1.2 Suggested Sessions
- 1.2.1 Pgpool-II: toward next major version 3.5
- 1.2.2 Horizontal Scalability and Sharding
- 1.2.3 Slony
- 1.2.4 Bi Directional Replication & Logical Decoding
- 1.2.5 pg_shard v2.0 and Lessons Learned from NoSQL Databases
- 1.2.6 FDW Enhancements
pgCon 2015 Cluster Hacker Summit
This year's Cluster Hacker Summit will be part of the PgCon_2015_Developer_Unconference. As such, this page will be used to coordinate sessions to propose for the Unconference, and eventually to list an agenda for the Clustering Track.
The Cluster Summit covers both general PostgreSQL clustering, as well as PostgresXC and PostgresXL development.
As it is part of the Developer Unconference, Clustering sessions will take place starting in the afternoon of Tuesday, June 16 through 5pm on Wednesday, June 17. If you are participating, and will not be able to make it on Tuesday, please note that in your attendance comments.
- Josh Berkus (both days)
- Koichi Suzuki
- Tatsuo Ishii
- Yugo Nagata
- Steve Singer (arrive tuesday mid-afternoon)
- Jan Wieck (arrive tuesday evening)
- Shigeru Hanada
- Ahsan Hadi
- Ashutosh Bapat
- Bruce Momjian
- Etsuro Fujita
- Tetsuo Sakata
- Amit Langote
- Kyotaro Horiguchi
- Ozgun Erdogan (Wednesday)
- Marco Slot (Wednesday)
- Simon Riggs
For each session below please provide a title and a moderator/leader/speaker for the session.
Pgpool-II: toward next major version 3.5
Firstly we report the project progress since last year's Cluster Summit: introducing pgpool-II 3.4. Then we explain the current status of pgpool-II 3.5 which is under development.
Session Leader: Tatsuo Ishii
- "pgpool-II Development Status Updates" (Yugo Nagata)
- "pgpool-II 3.5 How it will look like?" (Tatsuo Ishii)
Horizontal Scalability and Sharding
Session Leaders: Ahsan Hadi, Ashutosh Bapat
What is the future for Slony development? Are users interested in a new Slony based on Logical Decoding? Who's going to work on this?
Session Leaders: Steve Singer, Chris Browne, Jan Wieck
Leader: Steve Singer, Chris Browne, Jan Wieck
Note-taker: Josh Berkus
In the past year Slony development has been pretty stagnant. Steve has worked on a prototype with Logical Decoding. Works for a demo, but not all features work, and performance was not all that impressive. Part of that is that how much we've optimized the Slony log. Logical slony has some lower write overhead, but the latency for assembly on the other side is not insignificant.
The stability of slony and features are OK, but will it survive with modern features of Postgres. With some features it will be different. It does something which others don't.
Chris: only takes a few hours of work per year to keep up with Postgres releases. But we've been forgetting how to do releases and management.
Jan: mentioned limitations of UDR/BDR. But most users are on older versions. Even if 9.6 has everything built in, people will still be using older versions for 5 years. Slony was built to allow upgrading. If it doesn't make sense to maintain them then we won't.
Slony is still useful for upgrading across architectures and character decodings. Josh mentioned the early stage of development of UDR. Steve said that it would take more than one year to get slony working with Logical Decoding and to make it as stable as trigger-based replication.
Chris asked what the issues were with LD and Slony development. Jan pointed out that for partial replication ... where you're replicating a few low-volume tables ... then saving all the WAL for the replication slot is a big loss. So LD and classic Slony log table need to coexist.
Josh said there's two reasons why he wants LD: (1) easier installation and (2) bloat in the queue tables caused by long-running transactions (3) removal overhead.
Re: bloat Jan mentioned Grittner's snapshot-too-old work. Josh said that work is partly because of Slony. You can't even truncate segments if there's a long-running transaction. Chris mentioned the idea of per-table logs.
Moving the reporting queries onto the replica can fix this issue, but it just moves the bloat around. More discussion around this. Doing this against a non-forwarding replica would prevent bloat.
- Parallel initial data copy.
- Call Slony as a library (mostly Python)
- Add tables to multiple versions of databases
- No table locks
- More visibility commands via Slonik, once it's a library
- Cancel subscribe set in progress
Libraries: one way to get around the multiple library issue is to expose the function API. But according to Jan the function API is not complete. Steve would rather have Slonik as a shared library. Calling the slony SPs directly isn't safe. Dave Cramer asked if we do a Python library how are we going to call it from Java? We'll create a C shared library.
Valentin suggestes exposing the shared libaries via stored procedures. Chris pointed out that the lack of autonomous transactions prevents this. But if the library is wrapped in functions it can be exposed. Or create some kind of REST API. Jan said that they're planning on supporting pgAdmin4, so an abstracted libary would support this. This is a good idea. Some people use GUIs, but some don't.
Steve asked if people are using it for DR. Josh mentioned one multi-region case. Valintin said more availability than DR.
Jan said that more visibility functions for Slonik is a must once it's a library. And it would make the API more stable because you wouldn't have to use the Slony catalog.
Steve asked if Slony could take one exclusive lock at a time would help. One user said no, they don't get enough downtime. But others said yes. Rod really wants an add trigger command which takes an access exclusive lock. That's a major postgres issue in terms of locking and tables. Jan speculated how that would be possible. Valintin tries to resolve this by setting statement_timeout and having it fail. The Slony team has some idea of how to do this. This is similar to concurrent index creation. Jan will look at doing CREATE TRIGGER CONCURRENTLY. Suggested that there should be a general lock state where you ask for a lock concurrently. This should be a function call. Kevin suggested that we should call this "deferrable" rather than Concurrent. With timeouts.
Steve appealed for people to help implement some of these things. Jan will sign up for LOCK TABLE DEFERRABLE. Josh said that he's help test/spec the API. Biggest issues with syntax is varadic arguments.
Chris asked if anyone was using logshipping, and someone said yes, and they're using the daemon to apply the logs.
Chris asked about DDL. Change in 2.2 which allows passing in the DDL string instead of needing a file. PostgresQL 9.5 will have DDL event replication. Do we care about building this? Chris said the most easy change would be to allow dropping a table to automatically drop it from replication. Rejecting DDL not run through execute script would be even better. It should be an option, which you can override. Josh voted on blocking DDL.
Steve's concern with blocking DDL is that people do this for legitimate reasons. Josh mentioned stupid dev tricks where they bring slony down by running rake on the production cluster. Also a library would be compatible with rake.
Do people see Slony as still being relevant in 2-3 years if BDR/UDR succeeds? Hard to be sure, since it's not mature yet.
They then set some targets:
- Parallel initial data copy: Slony 2.3+, requires Postgres 9.3+
- Call Slony as a library (mostly Python): Slony 2.3
- Add tables to multiple versions of databases: works with library
- No table locks: Postgres 9.6+
- More visibility commands via Slonik, once it's a library: Slony 2.4?
- Cancel subscribe set in progress
- Prevent DDL: Postgres 9.4, slony 2.3
Not likely to work on Slony + Logical Decoding, because it's bigger than all of the features above. Making it stable would take years, and it doesnt' perform better. Valintin is working on LD systems and building libaries on top of LD like a python library and replication to Kafka. Steve's code is on Github. Particularly, being able to switch between LD and triggers would be really complex.
Josh mentioned the issue around WAL log volume from the slony buffers. One user requested the ability to execute a command on a group of slaves. So an kind of EXECUTE SCRIPT on a group of nodes.
Bi Directional Replication & Logical Decoding
Including DDL replication, Online Upgrade, Logical Replication, Bi Directional Replication etc
Session Leaders: Simon Riggs, Andres Freund
BDR and UDR etc.
Note-taker: Josh Berkus
This will be about development of BDR, not using BDR. Will be only 10 minutes about BDR.
BDR stands for Bi-Directional Replication, has been put together across a couple years. Has been released a couple months ago. Is being used in production at some sites. Have released 0.9.1 bugfixing 0.9. But it's not 1.0 yet. We want to get it into Postgres.
BDR allows replication to flow in two directions. It's logical replication which permits making changes to the data as it's replicated. Translating the WAL stream (transaction log stream) and uses Logical Decoding to take action and stream changes. Works on a commit-by-commit basis.
Each server has a sender process which talks to an apply process on another server. These are implemented as background workers, a 9.3 feature. 9.4 with the BDR plugin supports one-way replication.
Two-way replication requires handling conflicts. This requests patches on postgres, so they have a "spoon" of Postgres (not quite a fork). Most of this is in 9.5, but there's a couple things still waiting for 9.6. Sequence AM was not included in 9.5, also WAL messaging, to send non-transactional WAL messages. Implemented logical replication, then built upon that to make bi-directional replication possible. Now building a system to handle DDL so that it can be replicated. The DDL is hard because we need it in "absolute form". The DDL deparse code is still in a module.
The other part of the system is zero-downtime upgrade. This uses UDR. It works with version 9.4.
Discussion of logical vs. binary replication. Grant asked what about parallelism on the apply side of the replication. Andres tested this, and it wasn't really a problem; the applier is much faster than the writes on the origin nodes.
Kevin took issue with the assertion that commit order of application doesn't allow seeing anoninalies. He brought up an example case where that's not true with batch processing.
Discussion of features required in core started.
Open items for including features in core are:
- SEQAM - ready for commit
- WAL messages - need to discuss and have a flamewar but otherwise there.
- Metadata - where do we store metadata for the replication system? Connection information?
- Control - still use functions, or implement special-case DDL?
- DDL Replication Code
Metadata is currently stored in a mix of security labels and metadata tables. Is this a conflict with RLS? Shouldn't be, but it's a bit of a hack; it's done because it's extra data which is created and deleted with a table.
- create group
- join group
Robert asked a question about synchronizing timestamps. This motivated some of the patches to 9.5.
Should we have full DDL for this, or should we have functions? Simon thinks functions are fine. Haas likes DDL because it's more self-documenting. Simon argued that there's been a lot of changes to the functions. It's been iterating. But once it's in core (Haas), people expect very stable APIs, so you won't be able to change them anyway. Some discussion about dump and restore followed. There are some things which can't be restored. Replication slots is a good example of this, Haas feels like that's kind of unfinished. There's a pretty good argument that you want to be able to restore your replication sets.
The problem of deprecating APIs (Smith) already exists. We can add more arguments. Which model do you want? As soon as you put it into syntax, it's a lot harder to change parameters etc. There's also the question of to what extent we want to keep backwards compatibility of replication stuff.
Do we need a generic concept of a supervisor worker, because people keep reinventing this concept?
What about the metadata? We want it to work even if people rename tables, etc. Does this work reasonably well for 100,000 table cases? Should work with relcache, should be fine.
Currently subscribe/group uses pgdump, which requires passing in connections string so that we can dump out the database. That's not the only problem with the dependancy on pgdump. Dependancies on external binaries is kind of an issue. Abstracting out pgdump has been a TODO forever. Slony needs self-connection information too. You could have BDR GUCs, but that didn't work really well. This is different for each database.
pgdump is used to create the initial snapshot of the data and structure. You can use pgbasebackup instead, but that copies everything. Other database tools do it table-by-table. Getting sufficient administration tools into core is critical. We don't want to have 5 separate sets of tools like we do now most of which are buggy. Grant says: make the APIs really well, it's easy to build the tools on top of the APIs. He doesn't want tools which work on the base stuff. We have different sets of tools because they have different use cases.
pg_shard v2.0 and Lessons Learned from NoSQL Databases
Session Leaders: Ozgun Erdogan, Marco Slot
Ozgun explained how pg_shard is put together. There's a metadata node, which connects to a bunch of backend nodes. Each backend node contains multiple shards in one database. The shards are tables.
In pg_shard 2.0, the metadata will be fully distributed.
The metatdata node tracks where shards are located. And shards can be redistributed.
So there are a few proposals for how to distribute metadata. There are several use-cases they are trying to answer:
- NoSQL use-case on the eventual consistency model
- real-time analytics over log data
- SAP Hana-like use case. ACID-compliant scalable RDBMS database.
Not like OracleRAC which is shared disk.
The proposals for sharing metadata:
- replication metadata to all nodes assuming communtative writes ... that is write order doesn't matter. So replicate change statements between all nodes. Use BDR.
- Shard health is decoupled shard health from metadata. Delegate health to replication groups. Could be enhanced by streaming replication. Basically failover between pairs of nodes.
They explained the first proposal. If they get inserts onto one table, if an insert fails, that node is marked invalid. Josh questioned whether or not this would ever become consistent. You would need to buffer the writes and replay them or resync-from the one healthy replica. Also requires that events can never conflict. It pretty much only supports the insert-only use case, because all writes have to be incremental. This is the AP proposal out of CAP.
For the 2nd proposal, then RDS could handle that for us. The 2nd proposal relies on having small replication groups, which would fail over in the event of a node failure. Streaming replication could be used between replicas. You'd need small groups with at least 3 nodes.
Josh made a third proposal, using RAFT-like semantics to share metadata and make it mostly consistent. Various issues were pointed out with this.
Alvaro suggested requiring quorum every time you do a read. Some discussion of Paxos etc. ensued.
Session Leaders: Shigheru Hanada and Esteru Fujita
Note-taker: Josh Berkus
Enhancements proposed for 9.5:
- Inheritance Support: Committed: foreign table can be parent or child of other tables
- Update push-down: Returned with feedback: updated against Foreign tables without fetching data from the remote node.
- Join Push-down: API committed: allows joining on the remote server.
Update pushdown requires certain conditions in the Update statement. Also it requires a new FDW API, called from nodeModifyTable.
Currently joins are performed on the local server, which can be very slow. If both tables on on the external server and joins are supported we should be able to join over there. The FDW API is committed, but pgsql_fdw changes were not committed. The major issue was "how do we construct the remote query?"
Should we use a parse tree? They would like to support in Oracle and MySQL FDWs. A general SQL deparser would be idea for this, but we don't have one.
We also want sort push-down. But there is a problem selecting the key for the sort. Josh asked why. Shigheru explained that FDW sees only plan tree, and the plan tree generates path information for each key, which includes multiple candidates.
Other possible enhancements:
- sort push-down
- aggregate push-down
- more aggressive join push-down
For sort push-down, we also need to mark a Foreign Scan as sorted. But problems: limiting sort key candidates. Do we need to introduce FOREIGN INDEX concepts? Should we have FDW catalogs? Also, how can we be sure that sorting on the target and the local server are identical (collations etc.)? And what about pre-sorted join results? He asked for ideas on how to implement this.
Tom suggested that if you took the overhead of doing an explain, you could check and see if it's doing a merge join on the remote node. It might be expensive to see the explain plan.
He doesn't have a really concrete idea how to implement aggregate push-down. Maybe they should implement a new FDW API, and replace the Aggregate node with a ForeignScan. Issues include how to determine the semantics of the GROUP BY clause. Also how do we map local functions to remote ones? THere's stuff in the SQL standard for this but not very well defined.
More aggressive join push down would support doing a foreign nestloop scan. One way to do that is with local small tables we can push materialized data cross the FDW and join against it. Or we could do a temporary table or VALUES statement. If we know that the table is replicated on the remote side, we could join against it.
Paul asked about extended types, like for PostGIS. Geometry operators aren't allowed to pass down through pgsql_fdw. He had to hack pgsql_fdw in order to pass those through. Maybe when you declare a server, you could declare which extensions are installed in the server, which would be checked in FDW. Shigheru thinks this is a good idea. Right now we don't push them down because the operator might be different on the target.
Is an extension a useful unit for this? Tom and Paul think yes. Also we don't actually need to check versions. We also want to create mappings for individual functions though.
Marco asked about CSTORE_FDW. The FDWAPI requires us to read row-by-row, which kills some of the advantages of the column store. Josh asked about COPY protocol; it would be good to copy into remote tables.
They also talked about pushing down pre-aggregates instead of finished aggregates. That is, count/sum instead of AVG. That way it will work with partitioned foreign tables. Basically, we would export the transition function somehow, like a MapReduce system. No idea how to do this. Also, how would it work with non-postgres systems?