PgCon 2013 Developer Meeting
From PostgreSQL wiki
A meeting of the most active PostgreSQL developers is being planned for Wednesday 22nd May, 2013 near the University of Ottawa, prior to pgCon 2013. In order to keep the numbers manageable, this meeting is by invitation only. Unfortunately it is quite possible that we've overlooked important code developers during the planning of the event - if you feel you fall into this category and would like to attend, please contact Dave Page (firstname.lastname@example.org).
Please note that this year the attendee numbers have been kept low in order to keep the meeting more productive. Invitations have been sent only to developers that have been highly active on the database server over the 9.3 release cycle. We have not invited any contributors based on their contributions to related projects, or seniority in regional user groups or sponsoring companies, unlike in previous years.
This is a PostgreSQL Community event. Room and refreshments/food sponsored by EnterpriseDB. Other companies sponsored attendance for their developers.
Time & Location
The meeting will be from 8:30AM to 5PM, and will be in the "Red Experience" room at:
Novotel Ottawa 33 Nicholas Street Ottawa Ontario K1N 9M7
Food and drink will be provided throughout the day, including breakfast from 8AM.
The following people have RSVPed to the meeting (in alphabetical order, by surname):
- Josh Berkus (secretary)
- Jeff Davis
- Andrew Dunstan
- Peter Eisentraut
- Dimitri Fontaine
- Andres Freund
- Stephen Frost
- Peter Geoghegan
- Kevin Grittner
- Robert Haas
- Magnus Hagander
- KaiGai Kohei
- Alexander Korotkov
- Tom Lane
- Fujii Masao
- Noah Misch
- Bruce Momjian
- Dave Page (chair)
- Simon Riggs
Proposed Agenda Items
Please list proposed agenda items here:
- 9.4 Commitfest schedule and Commitfest tools.
- Parallel Query Execution (Bruce, Noah)
- logical changeset generation review & integration (Andres)
- utilization of upcoming non-volatile RAM device (Kaigai)
- pluggable plan/exec nodes (Kaigai)
- to offload targetlist calculation, sorting, aggregates, ...
- GIN generalization (Alexander)
- An Extensibility Roadmap (dim) (http://pgsql.tapoueh.org/temp/extensibility.pdf) (15 min)
- Representing severity - derive severity from SQLSTATE (Peter Geoghegan - see http://www.postgresql.org/message-id/CA+TgmoZEjq7va+SfDZQwk6E4emEWThENNyxfqEGhB3iuoT1OJw@mail.gmail.com) (10 min)
- Error logging infrastructure - store normalized statistics about errors in a circular buffer (Peter Geoghegan). Arguably this could be discussed alongside SQLSTATE item. (10 min)
- Failback with backup (Fujii Masao - related discussion is: http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtbJgWrFu513s+Q@mail.gmail.com)
- Volume Management (Stephen Frost - wiki page will be forthcoming before the meeting)
- AXLE Project - Big data analytics for Postgres (Simon Riggs) - an overview of the feature plan, how project works and what community can expect (15 min)
- Incremental maintenance of materialized views (Kevin) - differential REFRESH and infrastructure for counting algorithm (30 min)
|08:30 - 08:45||Welcome and introductions||Dave Page|
|08:45 - 09:45||Parallel Query Execution||Bruce/Noah|
|09:45 - 10:15||Pluggable plan/exec nodes||KaiGai|
|10:15 - 10:30||Volume Management||Stephen Frost|
|10:30 - 10:45||Coffee break|
|10:45 - 11:00||Utilization of upcoming non-volatile RAM devices||KaiGai|
|11:00 - 11:30||Logical changeset generation review & integration||Andres|
|11:30 - 11:40||Representing severity||Peter G.|
|11:40 - 11:50||Error logging infrastructure||Peter G.|
|11:50 - 12:30||Incremental maintenance of materialized views||Kevin|
|12:30 - 13:30||Lunch|
|13:30 - 14:15||GIN generalization||Alexander|
|14:15 - 14:30||An Extensibility Roadmap||Dimitri|
|14:30 - 15:00||Failback with backup||Fujii|
|15:00 - 15:15||Tea break|
|15:15 - 15:45||9.4 Commitfest schedule and tools||Josh|
|15:45 - 16:45||Goals, priorities, and resources for 9.4||All|
|16:45 - 17:00||Any other business/group photo||Dave Page|
- Dave Page, EnterpriseDB
- Andres Freund, 2ndQuadrant
- Kevin Grittner, EnterpriseDB
- Dimitri Fontaine, 2ndQuadrant
- Andrew Dunstan, PostgreSQL Experts
- Noah Misch, EnterpriseDB
- Bruce Momjian, EnterpriseDB
- Fujii Masao, NTT Data
- Tom Lane, Salesforce
- Magnus Hagander, Redpill Linpro
- Robert Haas, EnterpriseDB
- Josh Berkus, PostgreSQL Experts
- Kaigai Kohei, NEC
- Jeff Davis, Teradata
- Alexander Korotkov
- Peter Geoghegan, Heroku
- Peter Eisentraut, Meetme
- Stephen Frost
Bruce Momjian is looking at where Postgres is and hardware changes, and it's time to look at parallelism. Unlike the Windows port and pgUpgrade, there's no clear "done" with Parallelism. We're going to have to do a lot of small things, but not one big feature. Concern anout code cleanliness and stability. What is going to have to happen is that we'll attack one small thing, and build the infrastructure for parallelism.
Robert Haas is talking about EnterpriseDB's commitment to parallelism. The two things EDB wants is materialized views and parallel query. The way we're approaching this is the same way as 2Q approached logical replication for the last release cycle. We're doing this as a company, and we have buy-in from our management. So far there's a wiki page on parallel sort and Noah's posted some stuff to pgsql-hackers. The first part is to get a credible worker system in place, and then we can tackle parallelising particular things.
Stephen Frost pointed out that users are currently ad-hoc implementing parallelism in their middleware code. Bruce said that there was a basic set of steps for all parallel tasks. There's a false sense that threads automatically give you infrastructure for parallelism. Bruce doesn't think that's true. Having the worker/marshalling stuff sprinkles all over the code would be really bad, so we want central infrastructure.
Jeff Davis pointed out that there were different approaches to parallelism. One is "cluster parallelism". Do we know what approaches were taking? Cluster parallelism involves making the parallel tasks according to data partitions. It's popular in data warehousing. Robert Haas doesn't expect to get that far in one release cycle.
Haas: People come up with great ideas for PostgreSQL, and they do two things: either they figure out how to do it without modifing the query planner, or they fail. So we looked at index building, which wouldn't require dealing with the query planner. But the general problem of parallel query planning, we have to solve harder problems. I don't want to get bogged down in those sorts of questions at the outset, because there's a bunch of stuff to get done to execute parallel jobs in general.
Josh Berkus suggested implementing a framework for parallel function execution because then users could implement parallel code for themselves. It would help the Geo folks. Noah thinks this is possible today, but isn't specific how. Tom argues against exposing it to users in early iterations because the API will change.
There's a few things you need:
- and efficient way for passing data to the parallel backends, probably using a shared memory facility, because sockets are too slow.
- some logic for starting and stopping worker processes. Custom background workers aren't quite what we need for this. Also different from Autovacuum, which is a bit kludgy.
- you need to be able to do stuff in the worker processes as if they were the parent process. They need to share the parent worker's state, and there are a lot of state things which are not shared. If the master takes new snapshots or acquires extra XIDs, not sure how to share that. Some things will need to be prohibited in parallel mode. Threads don't solve this. Syscache lookups are also a problem, but we need them.
Noah wants to target parallel sort, specifically parallel memory sort. This hits a lot of the areas we need to tackle to make parallelism work in general. We need a cost model as well. How are we going to mark the functions which are safe to run in a parallel worker. We don't want to just call functions *_parallel because that will change. Maybe there will be an internal column in pgproc, as a short-term solution.
Peter E. asked about timeline. For 9.4, we want to at least have an index build which runs a user-specified amount of parallelism. It needs to be reasonably fast.
Peter G. asked about having a cost model for parallelism. Right now we don't have costing for how long it takes to sort things based on the number of rows. Sorting a text column in bad collation can be 1000X as expensive as sorting integers, for example. We might pick a single operator and make that the cost reference operator. Perfect costing isn't possible, but we can do some approximates. The initial operations we choose for parallelism will be very long operations. Startup costs are too high otherwise. We're not going to parallelize something that's 200ms. Something that takes 10s or a minute or a couple minutes.
Haas thinks that a lot of people will be appalled for starting up a parallel worker. That can be optimized later. It's OK for the initial version to be unoptimized. Even if it takes a full second to start up a new backend, there are sorting tasks which take large numbers of seconds. Those are existing issues which we'll hammer on as we get into this space; we may fix starting up a new connection speed in the process.
Josh pointed out that taking a hour to build an index, it's probably an external sort. Noah posted a patch to allow larger internal sorts, over 1GB. Andrew pointed out that a process model would tie us to certain large operations. Threads would add a lot of overhead to everything, though. We'd have to rewrite palloc. Haas things we can get the minimum unit down to something fairly small. Andrew pointed out that on windows process creation is very expensive. Haas doesn't want rewrite the entire internal infrastructure.
With Threads, everything is shared by default, with processes, everything is unshared by default. The process model and explicit sharing is a shorter path from where we are currently. Parallelism helps with CPU-bound processes, but not IO. Josh argued with Kevin that there are some types of storage where this isn't true. Kevin just pointed out that if the resource you're using the most of isn't bottlenecked, then it's not helpful to parallelize. Haas pointed out that parallelizing seq scan on a single rotating disk won't help, as opposed to parallelizing scan from memory, which would be much faster. Our cost model isn't up to this; we might even have to have a recheck model where the executor notices things are slow and switches approaches.
Bruce pointed out how Informix switched to threads between 5 and 7 and it killed the database. Parallelism will take Postgres into new markets.
Andrew pointed out that prefork backends will help us form new connections if we can get it to work. Haas pointed out that we're going to have to cut nonessential issues to avoid taking forever.
Pluggable plan/exec nodes
Kaigai is working on GPU execution. When he worked on writable FDW, pseudo-column approach for foreign scan node returning an already computed value, but that was rejected, because the scan plan needs to return the data structure as its definition. So Kaigai wants to add an API to add a plan node to the exeuction node, allowing executor to run extension code during query execution. When plan tree tries to scan large table with sequential scan, and the target list has a complex calculation, we can have a pseudo-column which does this calculation on a GPU.
Kaigai is talking about planner and executor. Haas doesn't understand how we would have pluggable planner nodes, as opposed to executor nodes. How would you allow it generate completely new types of plan nodes? We can replace existing plan nodes, but new types of nodes would require a new extensibility infrastructure. To do this, we need two new infrastructures to inject plan nodes and executor nodes. But Kaigai is mainly focused on is replacing existing scans and sort nodes. He didn't investigate the difficulty on planner extension yet.
Peter E. pointed out that 65% of the work will be the ability to add new nodes at all. Replacement will be MUCH easier. However, the ability to add new nodes would be very useful to PostgreSQL in general. Tom thinks that it could be done. Haas pointed out that we have a lot of wierd optimizations about what plan node connects to which other plan node. Tom doesn't think that we have that many. Noah says we'll probably use a hook.
For a new executor node we have a dispatch table, it's easy. Plan nodes could use a jump table too. Right now we have function volatility markers; for nodes we'll need the same thing. But that's a problem only for expression nodes.
This was discussed in the cluster meeting. PostgresXC wanted pluggable nodes for cluster scan, as do some other people. So a general pluggability infrastructure would be good. If we have pluggable scan nodes, we can plug in cluster scan as well as GPU scan.
Jeff Davis pointed out that range join could be our first pluggable node. Haas pointed out that opclass support requirements might make it difficult; there are easier cases. Range join might need to be hardcoded. Pluggable planner stuff is hard.
This would also maybe get people who fork Postgres to stay closer to the core project and implement extensions instead of having an incompatible fork which then doesn't work with others.
Right now we have tablespaces. Having some more automation around using them would help. Like we want the indexes on a separate tablespace from the heap; there ought to be automation for this. Somebody hacked up something like this ... maybe Depesz, in 2007.
Haas asked if having indexes on a separate volume was actually faster. Frost asserted that it was. Josh brought up that with new small fast storage there's reasons to want stuff to move around again. Also, index-only scans. If I only have one column, then I can do index-only scans, so I want to put the index on faster storage. Josh pointed out that indexes-separate worked back when at OSDL.
Stephen Frost pointed out that they have pairs of drives, with a whole lot of pairs. Stephen asked about whether or not we'll ever have things like Oracle Volumes. Kevin said that that configuration works on raw devices, but not so much on complex filesystems. FRost says that for specific workloads, it really works to parallelize everything for massive joins.
Several people asserted that modern RAID is fairly efficient. Josh asked if any default automated rules would work for a general class.
Frost explained Oracle Volumes. They can be used to go to raw drives. Volumes are disks or drives or files. You can have multiple volumes under a single tablespace, and Oracle will manage them together. Do we want to do that? Maybe we should just use LVM.
There's also some other things we could do with volumes, like compressed volumes. Noah has seen tablespaces abused 5 times as much as used properly. We should be careful that what we're adding is really useful. People want things like encrypted & compressed tablespaces. Every time something like this comes up, Tom or someone says "use a filesystem which provides that." There are some advantages to having the database do that, but there's a lot of development effort.
Noah suggested that event triggers would do this. Frost says that they already have code, they want to simplify it. Josh points out that there aren't simple rules for this; most DWs don't have rules which are as simple as "indexes go here, tables go there". A lot of this is misplaced Oracle knowledge. Josh brought up the backup issue again.