PgCon 2013 Developer Meeting
A meeting of the most active PostgreSQL developers is being planned for Wednesday 22nd May, 2013 near the University of Ottawa, prior to pgCon 2013. In order to keep the numbers manageable, this meeting is by invitation only. Unfortunately it is quite possible that we've overlooked important code developers during the planning of the event - if you feel you fall into this category and would like to attend, please contact Dave Page (dpage@pgadmin.org).
Please note that this year the attendee numbers have been kept low in order to keep the meeting more productive. Invitations have been sent only to developers that have been highly active on the database server over the 9.3 release cycle. We have not invited any contributors based on their contributions to related projects, or seniority in regional user groups or sponsoring companies, unlike in previous years.
This is a PostgreSQL Community event. Room and refreshments/food sponsored by EnterpriseDB. Other companies sponsored attendance for their developers.
Time & Location
The meeting will be from 8:30AM to 5PM, and will be in the "Red Experience" room at:
Novotel Ottawa 33 Nicholas Street Ottawa Ontario K1N 9M7
Food and drink will be provided throughout the day, including breakfast from 8AM.
Attendees
The following people have RSVPed to the meeting (in alphabetical order, by surname):
- Josh Berkus (secretary)
- Jeff Davis
- Andrew Dunstan
- Peter Eisentraut
- Dimitri Fontaine
- Andres Freund
- Stephen Frost
- Peter Geoghegan
- Kevin Grittner
- Robert Haas
- Magnus Hagander
- KaiGai Kohei
- Alexander Korotkov
- Tom Lane
- Fujii Masao
- Noah Misch
- Bruce Momjian
- Dave Page (chair)
- Simon Riggs
Proposed Agenda Items
Please list proposed agenda items here:
- 9.4 Commitfest schedule and Commitfest tools.
- Parallel Query Execution (Bruce, Noah)
- logical changeset generation review & integration (Andres)
- utilization of upcoming non-volatile RAM device (Kaigai)
- pluggable plan/exec nodes (Kaigai)
- to offload targetlist calculation, sorting, aggregates, ...
- GIN generalization (Alexander)
- An Extensibility Roadmap (dim) (http://pgsql.tapoueh.org/temp/extensibility.pdf) (15 min)
- Representing severity - derive severity from SQLSTATE (Peter Geoghegan - see http://www.postgresql.org/message-id/CA+TgmoZEjq7va+SfDZQwk6E4emEWThENNyxfqEGhB3iuoT1OJw@mail.gmail.com) (10 min)
- Error logging infrastructure - store normalized statistics about errors in a circular buffer (Peter Geoghegan). Arguably this could be discussed alongside SQLSTATE item. (10 min)
- Failback with backup (Fujii Masao - related discussion is: http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtbJgWrFu513s+Q@mail.gmail.com)
- Volume Management (Stephen Frost - wiki page will be forthcoming before the meeting)
- AXLE Project - Big data analytics for Postgres (Simon Riggs) - an overview of the feature plan, how project works and what community can expect (15 min)
- Incremental maintenance of materialized views (Kevin) - differential REFRESH and infrastructure for counting algorithm (30 min)
Agenda
Time | Item | Presenter |
---|---|---|
08:00 | Breakfast | |
08:30 - 08:45 | Welcome and introductions | Dave Page |
08:45 - 09:45 | Parallel Query Execution | Bruce/Noah |
09:45 - 10:15 | Pluggable plan/exec nodes | KaiGai |
10:15 - 10:30 | Volume Management | Stephen Frost |
10:30 - 10:45 | Coffee break | |
10:45 - 11:00 | Utilization of upcoming non-volatile RAM devices | KaiGai |
11:00 - 11:30 | Logical changeset generation review & integration | Andres |
11:30 - 11:40 | Representing severity | Peter G. |
11:40 - 11:50 | Error logging infrastructure | Peter G. |
11:50 - 12:30 | Incremental maintenance of materialized views | Kevin |
12:30 - 13:30 | Lunch | |
13:30 - 14:15 | GIN generalization | Alexander |
14:15 - 14:30 | An Extensibility Roadmap | Dimitri |
14:30 - 15:00 | Failback with backup | Fujii |
15:00 - 15:15 | Tea break | |
15:15 - 15:45 | 9.4 Commitfest schedule and tools | Josh |
15:45 - 16:45 | Goals, priorities, and resources for 9.4 | All |
16:45 - 17:00 | Any other business/group photo | Dave Page |
17:00 | Finish |
Notes
Attending:
- Dave Page, EnterpriseDB
- Andres Freund, 2ndQuadrant
- Kevin Grittner, EnterpriseDB
- Dimitri Fontaine, 2ndQuadrant
- Andrew Dunstan, PostgreSQL Experts
- Noah Misch, EnterpriseDB
- Bruce Momjian, EnterpriseDB
- Fujii Masao, NTT Data
- Tom Lane, Salesforce
- Magnus Hagander, Redpill Linpro
- Robert Haas, EnterpriseDB
- Josh Berkus, PostgreSQL Experts
- Kaigai Kohei, NEC
- Jeff Davis, Teradata
- Alexander Korotkov, MEPhI
- Peter Geoghegan, Heroku
- Peter Eisentraut, Meetme
- Stephen Frost
Parallelism
Bruce Momjian is looking at where Postgres is and hardware changes, and it's time to look at parallelism. Unlike the Windows port and pgUpgrade, there's no clear "done" with Parallelism. We're going to have to do a lot of small things, but not one big feature. Concern anout code cleanliness and stability. What is going to have to happen is that we'll attack one small thing, and build the infrastructure for parallelism.
Robert Haas is talking about EnterpriseDB's commitment to parallelism. The two things EDB wants is materialized views and parallel query. The way we're approaching this is the same way as 2Q approached logical replication for the last release cycle. We're doing this as a company, and we have buy-in from our management. So far there's a wiki page on parallel sort and Noah's posted some stuff to pgsql-hackers. The first part is to get a credible worker system in place, and then we can tackle parallelising particular things.
Stephen Frost pointed out that users are currently ad-hoc implementing parallelism in their middleware code. Bruce said that there was a basic set of steps for all parallel tasks. There's a false sense that threads automatically give you infrastructure for parallelism. Bruce doesn't think that's true. Having the worker/marshalling stuff sprinkles all over the code would be really bad, so we want central infrastructure.
Jeff Davis pointed out that there were different approaches to parallelism. One is "cluster parallelism". Do we know what approaches were taking? Cluster parallelism involves making the parallel tasks according to data partitions. It's popular in data warehousing. Robert Haas doesn't expect to get that far in one release cycle.
Haas: People come up with great ideas for PostgreSQL, and they do two things: either they figure out how to do it without modifing the query planner, or they fail. So we looked at index building, which wouldn't require dealing with the query planner. But the general problem of parallel query planning, we have to solve harder problems. I don't want to get bogged down in those sorts of questions at the outset, because there's a bunch of stuff to get done to execute parallel jobs in general.
Josh Berkus suggested implementing a framework for parallel function execution because then users could implement parallel code for themselves. It would help the Geo folks. Noah thinks this is possible today, but isn't specific how. Tom argues against exposing it to users in early iterations because the API will change.
There's a few things you need:
- and efficient way for passing data to the parallel backends, probably using a shared memory facility, because sockets are too slow.
- some logic for starting and stopping worker processes. Custom background workers aren't quite what we need for this. Also different from Autovacuum, which is a bit kludgy.
- you need to be able to do stuff in the worker processes as if they were the parent process. They need to share the parent worker's state, and there are a lot of state things which are not shared. If the master takes new snapshots or acquires extra XIDs, not sure how to share that. Some things will need to be prohibited in parallel mode. Threads don't solve this. Syscache lookups are also a problem, but we need them.
Noah wants to target parallel sort, specifically parallel memory sort. This hits a lot of the areas we need to tackle to make parallelism work in general. We need a cost model as well. How are we going to mark the functions which are safe to run in a parallel worker. We don't want to just call functions *_parallel because that will change. Maybe there will be an internal column in pgproc, as a short-term solution.
Peter E. asked about timeline. For 9.4, we want to at least have an index build which runs a user-specified amount of parallelism. It needs to be reasonably fast.
Peter G. asked about having a cost model for parallelism. Right now we don't have costing for how long it takes to sort things based on the number of rows. Sorting a text column in bad collation can be 1000X as expensive as sorting integers, for example. We might pick a single operator and make that the cost reference operator. Perfect costing isn't possible, but we can do some approximates. The initial operations we choose for parallelism will be very long operations. Startup costs are too high otherwise. We're not going to parallelize something that's 200ms. Something that takes 10s or a minute or a couple minutes.
Haas thinks that a lot of people will be appalled for starting up a parallel worker. That can be optimized later. It's OK for the initial version to be unoptimized. Even if it takes a full second to start up a new backend, there are sorting tasks which take large numbers of seconds. Those are existing issues which we'll hammer on as we get into this space; we may fix starting up a new connection speed in the process.
Josh pointed out that taking a hour to build an index, it's probably an external sort. Noah posted a patch to allow larger internal sorts, over 1GB. Andrew pointed out that a process model would tie us to certain large operations. Threads would add a lot of overhead to everything, though. We'd have to rewrite palloc. Haas things we can get the minimum unit down to something fairly small. Andrew pointed out that on windows process creation is very expensive. Haas doesn't want rewrite the entire internal infrastructure.
With Threads, everything is shared by default, with processes, everything is unshared by default. The process model and explicit sharing is a shorter path from where we are currently. Parallelism helps with CPU-bound processes, but not IO. Josh argued with Kevin that there are some types of storage where this isn't true. Kevin just pointed out that if the resource you're using the most of isn't bottlenecked, then it's not helpful to parallelize. Haas pointed out that parallelizing seq scan on a single rotating disk won't help, as opposed to parallelizing scan from memory, which would be much faster. Our cost model isn't up to this; we might even have to have a recheck model where the executor notices things are slow and switches approaches.
Bruce pointed out how Informix switched to threads between 5 and 7 and it killed the database. Parallelism will take Postgres into new markets.
Andrew pointed out that prefork backends will help us form new connections if we can get it to work. Haas pointed out that we're going to have to cut nonessential issues to avoid taking forever.
Pluggable plan/exec nodes
Kaigai is working on GPU execution. When he worked on writable FDW, pseudo-column approach for foreign scan node returning an already computed value, but that was rejected, because the scan plan needs to return the data structure as its definition. So Kaigai wants to add an API to add a plan node to the exeuction node, allowing executor to run extension code during query execution. When plan tree tries to scan large table with sequential scan, and the target list has a complex calculation, we can have a pseudo-column which does this calculation on a GPU.
Kaigai is talking about planner and executor. Haas doesn't understand how we would have pluggable planner nodes, as opposed to executor nodes. How would you allow it generate completely new types of plan nodes? We can replace existing plan nodes, but new types of nodes would require a new extensibility infrastructure. To do this, we need two new infrastructures to inject plan nodes and executor nodes. But Kaigai is mainly focused on is replacing existing scans and sort nodes. He didn't investigate the difficulty on planner extension yet.
Peter E. pointed out that 65% of the work will be the ability to add new nodes at all. Replacement will be MUCH easier. However, the ability to add new nodes would be very useful to PostgreSQL in general. Tom thinks that it could be done. Haas pointed out that we have a lot of wierd optimizations about what plan node connects to which other plan node. Tom doesn't think that we have that many. Noah says we'll probably use a hook.
For a new executor node we have a dispatch table, it's easy. Plan nodes could use a jump table too. Right now we have function volatility markers; for nodes we'll need the same thing. But that's a problem only for expression nodes.
This was discussed in the cluster meeting. PostgresXC wanted pluggable nodes for cluster scan, as do some other people. So a general pluggability infrastructure would be good. If we have pluggable scan nodes, we can plug in cluster scan as well as GPU scan.
Jeff Davis pointed out that range join could be our first pluggable node. Haas pointed out that opclass support requirements might make it difficult; there are easier cases. Range join might need to be hardcoded. Pluggable planner stuff is hard.
This would also maybe get people who fork Postgres to stay closer to the core project and implement extensions instead of having an incompatible fork which then doesn't work with others.
Volume Management
Right now we have tablespaces. Having some more automation around using them would help. Like we want the indexes on a separate tablespace from the heap; there ought to be automation for this. Somebody hacked up something like this ... maybe Depesz, in 2007.
Haas asked if having indexes on a separate volume was actually faster. Frost asserted that it was. Josh brought up that with new small fast storage there's reasons to want stuff to move around again. Also, index-only scans. If I only have one column, then I can do index-only scans, so I want to put the index on faster storage. Josh pointed out that indexes-separate worked back when at OSDL.
Stephen Frost pointed out that they have pairs of drives, with a whole lot of pairs. Stephen asked about whether or not we'll ever have things like Oracle Volumes. Kevin said that that configuration works on raw devices, but not so much on complex filesystems. FRost says that for specific workloads, it really works to parallelize everything for massive joins.
Several people asserted that modern RAID is fairly efficient. Josh asked if any default automated rules would work for a general class.
Frost explained Oracle Volumes. They can be used to go to raw drives. Volumes are disks or drives or files. You can have multiple volumes under a single tablespace, and Oracle will manage them together. Do we want to do that? Maybe we should just use LVM.
There's also some other things we could do with volumes, like compressed volumes. Noah has seen tablespaces abused 5 times as much as used properly. We should be careful that what we're adding is really useful. People want things like encrypted & compressed tablespaces. Every time something like this comes up, Tom or someone says "use a filesystem which provides that." There are some advantages to having the database do that, but there's a lot of development effort.
Noah suggested that event triggers would do this. Frost says that they already have code, they want to simplify it. Josh points out that there aren't simple rules for this; most DWs don't have rules which are as simple as "indexes go here, tables go there". A lot of this is misplaced Oracle knowledge. Josh brought up the backup issue again.
Utilization of upcoming non-volatile RAM devices
Kagai is just presenting an idea and asking for opinions. Wants to consider the impact of these new devices. Wants to boost the performance to write the transaction logs. Once we have mmaped to the nvram device, then we can consider that completion of log writing. Non-volatile ram will be exposed as main memory. There are Linux patches which allows mmaping to these devices. Once we have these devices, it dramatically reduces the overhead to write the transaction log.
Kaigai said that this topic comes from the forks on ram storage devices. We wanted improvement of row-writing performance. Also appliable for general fast SSD storage. Wants to add fork around row-writing and create extensions where we maybe have nvram and mmapped region for the xlog.
Josh points out that this dovetails with transactional memory, which allows lock exchange for multiple K of data and multiple instruction. Like we might not need as much from the WAL. Others disputed that.
Kaigai wants to implement this for the write-ahead log. Then later we can look at the heap. NVRam has limited write cycles, so we don't want to use it for main storage.
Logical changeset generation review & integration
Andres gave a progress report on current status of the changeset generation patch. It's mostly done, but we need to integrate it and it needs substantial review. He needs someone to help, not from 2Q -- someone independant. How can he proceed to get the pieces into core?
Haas suggested that they could swap for review on parallel query. The patch is currently about 12,000 lines. Peter G says he can take a shot at it, but he wants someone else.
Frost asked how much was additional stuff there is vs. changes to existing behavior. There's new kinds of TOAST tuples. There's mapping relfilenode back to the OID, which is controversial. Andres says that it's safe the way they do it, and if it's broken in logical replication then it's already broken in Hot Standby, so we should fix it. There's some changes to heapam.c, like preventing full page writes from removing the tuple contents.
Haas points out that Heikki objects to most of the functionality in core. Andres says that he can reduce complexity in some areas. Noah says that having zero implications for nonusers it can't possibly work. If it's a good thing, we're going to have to accept some code complexity cost to do it.
The write-ahead log currently is designed to do crash recovery. We've pushed that for hot standby and PITR, but this is pushing it even further, we're adding another level of pain. TOAST tuples are a good example of this; how we crash recover them doesn't work for logical replication. To what extent do we use the WAL format we have now, and how much do we modify it? What's the performance cost?
Some things don't need much intervention. But we have to look up stuff in the catalogs, and do catalog time travel to figure out what the object references mean. Peter E asked about simply creating a completely separate log for logical replication. Andres says that the tried this, but it had a huge extra cost, because you have to add a whole bunch of extra data to it. That would double synchronous writes, at least. So are we not going to investigate this? The logical replication stream would need to include catalog updates.
Haas points out that the kludges for this aren't as bad as hot standby. Magnus points out that MySQL has to do 2PC between the binlog and the innodb log. But a lot of people are going to want this feature.
Steve Singer has written 80% of the code for basing Slony on LCR. 2Q is also working on their own replicated solution for this.
So the other modification we need to deal with xmin for pg_catalog and user tables differently. But this had some benefits for general Postgres efficiency. Can we split this up at all? The earlier patches can be reviewed separately.
One other thing we devised for this is logical replication slots. You need to know where readers last left off. We could maybe use this for binary replication as well. You can decode on a hot standby, not just on a master. Primary key updates are handled. We only have the TOAST tuples if they change, though. We have before images, but they have some limitations. Kevin asked about rolling backwards to an earlier state.
Haas will review some of the patch, but not the whole thing. We had the same issue with streaming replication, not enough reviewers. Haas would like to get it done early in the first commitfest. Andres will submit but is not sure we can commit it in the first round.
Representing Severity
Peter G. agitated about that there's no principled way to discover whether a "severe" error occurs, such as corruption from errors in xlog.c. There's no way that DBAs can confirm that an error of particular severity has occured, except grepping for particular strings. There's multiple distinctions here. Such as there are various "can't happen" errors which our developers want to see as soon as possible. Heroku also wants to bring errors to the attention of the lists automatically. The tail-n-mail approach is not scalable.
Haas's object is whether it's possible to agree on categorization about whether or not something is severe. We have hundreds of ereports, so we need a very simple categorization scheme which everyone can agree on. Tom points out that "can't happen" and "severe" aren't necessarily the same thing. Maybe we could use SQLSTATE and tweak some things. Everyone thinks this is the way to go.
It would be very useful to be able to filter the volume of logging too. We need to log sqlstate though. We might in addition put the severity concept into the log. We'd have to clean up how we use SQLSTATE too. We could supply some default filters, but users would need to tweak them. Filtering the log is a separate feature from fixing the SQLSTATE. That can be done later.
Error Logging Infrastructure
Peter G: As a complimentary feature, it would be great to be able to store errors in a circular buffer available, like pg_stat_statements. Peter E is doing a presentation related to this tommorrow using logging hooks. You can already do this.
Peter E argues against doing more aggregated views like pg_stat_statements, because it doesn't provide historical or granular data. With logging hooks, you can build this more granularly. It already exists in 9.2. The average DBA can't write code in C, logging hooks allow you to work around this, you can get log data in JSON. Peter E argued against having a hardcoded system for this.
Josh pointed out the efficiency argument. Robert pointed out the usefulness of "last message repeated 351 times". Peter G envisaged doing pg_stat_errors as a contrib module. This could be build on logging hooks. Dimitri pointed out that you can also do this with FDWs and CSV logs. Some discussion about how does Oracle Enterprise Manager do this? PEM uses CSV logs.
Discussion about alternative approaches ensued.
Heroku can feed back some information about errors using log aggregation. Frost would love to have performance information from Heroku.
Incremental Maintenance of Matviews
Kevin: Matching other DBs who have matviews will require a 5 year plan. We have a foundation in 9.3, you can declare a view, and it's materialized. It's just a foundation at this point. For 9.4, we want the first level of incremental maintenance. We can't get all the optimizations in, we just want to make the infractructure for the simplest cases in there.
For the first commitfest, I want a transactional REFRESH using the approach on the lists. There will always be cases where incremental maintenance won't work. So we want transactional refres first, so that we can analyze whether it pays to update using the delta. If every row is touched, that's a loser. So we want to fall back to the full refresh if the delta is too large.
He won't speculate on what we can get into each subsequent release. Planning to use the counting algorithm, it's very well established. Will only handle the simple cases at first, and will work through complex cases and see how far they get.
Need a new system column, count_t, which counts the number of ways a give tuple can be derived from source data. It's a kind of reference count. There's a completely different algo for recursive views. Handling NOT EXISTS is a separate implementation too.
What he wants to get in these first releases is how to implement the count_t column, how to implement other stuff in the first commitfest. Then the other CFs can be refinements. Peter E points out that count could overflow. The new column will be added only for new matviews which are specified as "incremental". We'll probably have other stratgies for incremental update, and it'll only be used by the counting strategy. You'd have to do an ALTER table to change that.
Thinking of doing the incremental change as synchronous in the transaction. Initially will not work for async maintenance. You need a delta with a count which can be plus or minus, you need a snapshot for before and after the change. You need to generate the deltas with each change to the base table, but you don't need to update the matview immediately.
There was discussion about how this applies to expensive, infrequently updated materialized views.
The extra column would only be present in incremental materialized views and the deltas. There would also need conditional code in execution nodes which know how to handle the column. Haas questioned if there was some way we could add a non-system column. How do the executor nodes know not to display the system column. It's not like SELinux because it wouldn't be present in every table.
Maybe we need a general notion of invisible columns. That would be generally useful. Then we could have extra state information. Bucardo could use it. A lot of things want to hide columns from "select *". We should add this concept.
There was discussion about having stable system views. This change will break things like the ODBC driver, because it gets the list of columns from pg_attribute. We can start with just matviews. But right now drivers assume that anything with a positive attnum is visible. But there's no patch for logical column numbers now. Alvaro has a patch for it, it's complete but it's bitrotted. This is just like attisdropped; it broke a lot of things. Alvaro might update it, he's not sure he'll have time. But we have to get this done in CF1.
If the next version is 10.0, though, we could break things. So maybe it's a good time.
Lunch Notes
Robert suggested getting rid of the v2 protocol in the next version. He thinks there's hacks in COPY which slow it down and are required for support of v2. But people want to see numbers on that. v2 is also used for JDBC, but that's to fix the plan cache issue which is fixed in 9.2. We need to see concrete improvements, though, and then there'd be an easy switch. We also should know what drivers are using v2, like pygresql.
GIN generalization
Alexander talked about general advances in GIN. Many people use external full text search because it's faster. One of the reasons GIN is slower is that it's used only for filtering, forcing the executor to page all of the results, and ranking. And it doesn't do LIMIT.
So how the extended indexing solution works: there's positional information, so that we can calculate relevance. So there's now a structure to add extra infromation to the vector for any additional information. We can extract positions of words. To keep the index small we're using run-length compression. The index is then about the same size as the uncompressed index, with a lot of extra information.
The problem is that we use the infrastructure from KNN. GIN declares that it can do ORDER BY with an operator, so you can get ORDER BY and filtering from the same index. Noah asked what ordering operator he's talking about. GIN calculates distance on item vectors, and then sorts by an array, and then executes gin_get_tuple(), which is new. It can then return them one-by-one for sorting. It's kind of like KNN, but it works differently, but it's much cheaper than getting the tuple from the heap. We need to tell the planner that we can return the original tuple from the index.
This will be a new operator class. And we modify the operator classes for tsvector.
Peter E asked for applications outside tsearch. Alexander said that it would also work for similarity searches. Can also be used for regex index. You can build a regex, index it, and then search for text strings which match that regex.
Andres pointed out an alternate method for doing this, but it requires improvements to the planner. Haas encouraged maybe taking this approach. If we assume that the planner can get an expression from the index, then we need sorting before we go into the heap. That's going to be hard. That's a real issue for any steps which happen between checking the index and the heap. You could see tuples which aren't visible. This might not be a problem for the GIN case, but it's liable to be a problem for the general case.
Maybe this is related to bitmap indexes. Btree operators need to be able to deal with invisible tuples, so this isn't new.
Discussion of implementation methods ensued. There was question about what doesn't work in KNN-Gist infrastructure. Like it doesn't work without a WHERE clause. But this could be generalized. This will break amproc, but it's not like anyone makes access methods outside of core. Maybe we should get rid of pg_am and just have a jump table, pg_am just isn't really useful. We can't drop it because the oids there are used as keys elsewhere, and there's really not much cost.
Another infrastructure question is that currently GIN only has to know one data type. Now when we introduce additional information, we need to know datatypes. SP-GIst needs 3 datatypes. That's why we have spgist_config. For GIN we'll need a new configuration method to return this datatype. So there's a question of breaking older stuff. Can you support both opclasses? Storage parameter would define it, so it would appear in conflict.
This should probably be GIN2 with new access methods, etc. We'll be changing the storage format of GIN indexes. But that would require supporting the original GIN indefinitely. We could bump the version number on the GIN node page. Two choices: write the code so that it supports the old format, or add a whole new access method. Are we changing the operators or the access methods? Changes are mostly in GIN, a lot less in the operator class. Peter E pointed out that the compression could be a separate feature. Alexander says that the the change is not easily separable because of some of the performance optimizations.
There was more implementation discussion about the methods for doing online upgrade if we change GIN. Rewriting on selects is out. In general we want to deprecate the old GIN, but we'll need a couple versions to do it.
Frost raised the idea of general infrastructure to support upgrade-in-place. Like we could have a flag in pg_class. But several people dismissed this idea as not really helpful.
An Extensibility Roadmap
See slides linked from this page.
We've already did some extensibility with create extension and create event trigger. Now I need the extension templates. I want reviewers and committers to understand where I want to go.
DDL Execution Plans is after that ... like EXPLAIN ALTER TABLE. Dimitri wants CREATE EXTENSION to fetch code from the internet. But core doesn't want that, so Dimitri wants to do that using extension templates and event triggers. Then he wants a PL/C language which compiles on-the-fly. This would allow creating extensions from source dynamically. But the source is in a place only root has access to.
Dimitri wants extensions to follow replication. There's some security issues with that. There's issue with having such things like downloading in core and having it not suck. There's a lot of reliability issues. Haas isn't keen on the whole idea. But Dimitri doesn't want to code all of the extension features in C. But this is already a problem with external PLs.
Dimitri wants to be able to build full set of extensions using only a PostgreSQL connection and superuser access. Josh talked about the developer-push issue. But there's security issues with arbitrary C code, so a lot of people have issues with it. Peter G suggested we could somehow sandbox the C code. Haas says that this is possible, but you need kernel-level help.
Josh says 3 problems: developer push, relication, and backup. But what if we solved it for non-C cases, but not C cases? Some people thing it would be useful. Extension Templates for packaged user code would be very useful for in-house code, so it can be packaged as an extension.
Failback with Backup
Fujii hears complaints from users: why do we need to take a full snapshot for failback? If the database is very large, this can take a long time. Three problems: timeline ID mismatch, WAL record mismatch, database inconsistency.
Timeline ID: we can't start the old master as a replica because of the timeline ID. This is resolved now in 9.3.
WAL Record Inconsistency: if the master crashes after writing a WAL record but not sending it to the standby, then it has WAL records the standby doesn't have. We can resolve this by removing all WAL records in the old master before starting it up in standby mode. Old master will then try to start recovery, and them start replication to retrieve the WAL records. One problem: last checkpoint record may not be replicated to the standby. Could we make the checkpointer wait for replication? If the Standby goes offline checkpoint would hang.
Frost asked how large of a user problem this is? People cited examples.
Haas suggested a holdback for datafile sync. Magnus pointed out that if we distinguish between failover and switchover. So we could have failback only for switchover circumstances. Jeff pointed out that you have to break replication and resync standbys after upgrade. He also thinks that we can use the WAL to help a diff between datafiles. Controlled switchover would fix the upgrade case too maybe.
We could create a list of dirty blocks from the WAL. But there are a lot of special cases which would prevent rollback.
Database inconsistency can happen if the master crashes. There can be some missing database changes. One solution is implementing undo logic, but that's very very difficult. Fujii's approach would be to add some write points to the master. Before the master writes the database to the disk, it writes the WAL to disk. It could wait for replication of WAL before writing to the LSN. Except for hint bits. And what if the replica goes away?
Fujii wants to solve both controlled switchover case as well as the crash & failover case. The controlled swithover case is a lot easier to solve. You could freeze out existing connections, finish writes, and then switch over. There's a lot of similar ideas in Oracle QS.
Josh suggested that the crash case isn't any different if we're already in synch rep. Others disagreed. Haas suggested that we could use Andres' infrastructure of slots for replication. Noah suggested that we could delay hint bit application until we're after walflushlocation. Maybe we should only solve this for controlled switchover.
Jeff: for checksums, we already store the first modified record after the checkpoint. We could store very abbreviated information the WAL for checkpoints. There would be some significant cost for hint bits on a seq scan. If you want to be able to turn the WAL log into an undo log. You only have to copy certain pages. You could log only the first change since the last checkpoint.
9.4 Commitfest schedule and tools
The CF schedule will be the same as last year. This time do what we said we'd do, not what we actually did. So include the triage periods etc. Josh will run first CF.
Discussion about patch development during commitfest. Ending a commitfest takes a lot of work. There's a lot of work for moving stuff along. We could make a commitment that -hackers will not argue with the CF manager. Josh also wants to have a commitfest assistant. He called for volunteers. The author ought to be the one to take the patch out, it shouldn't wait on the CFM. Each patch deserves one solid review per CF, but not more than one.
Kevin points out that the biggest issue is finding reviewers and getting the reviewers to do the review. Josh asked if having reviewers put their name down is actually helpful. Maybe we need to be really aggressive about taking names off. There's a big difference between Noah doing a review and some new reviewer doing a review. Josh suggested clearing reviewer after 5 days. Frost suggested removing them earlier. Maybe more tightly tying them to the archives would be good.
Lots of discussion ensued.
We should tell people to not put their names down until they are ready to start reviewing. The 10K line patches are not the problem, the 10 line patches are.
Josh discussed replacing the CF tool, either with Gerrit or Reviewboard or something new. Frost suggested Github. Josh discussed some issues for reviewers. Discussion ensued, but not recorded because the secretary was speaking. Josh also wants automated patch build. Other people suggested automated apply.
Josh will develop a spec for a new tool. He will run it past the CF managers for comment. Then post it to -hackers for comment. Haas is skeptical about 3rd-party tools. He hates git stuff. Frost discussed using forks in github. We should be able to track a git branch in the tool.
Goals, priorities, and resources for 9.4
- Dave hopes to start writing pgAdmin4 this year.
- Kevin will work on materialized views.
- Dimitri: Extension templates and event triggers.
- Andrew: most JSON work is complete, the rest can be extensions. Wants to work on grouping sets.
- Noah working on EDB features (see above), and bug fixes.
- Bruce is working on pgPool to make it better and more usable, maintainable and debugged.
- Fujii is working on PostgreSQL replication. Would also like to work on pg_trigram for Japanese.
- Simon (via Skype) Tablesample patch, COPY tuning, and security reviews (SEPostgres).
- Tom will take an interest in pluggable storage. Maybe also Autonomous transactions.
- Magnus will work on a new CF tool. Also more improvements to pg_basebackup.
- Haas will be working on side issues on matviews and parallel query as they come up.
- Josh will be working on new CF tool. With time/funding will work on automated testing.
- Kaigai will work on row-level security. Also GPU stuff.
- Jeff will be working on convincing Robert to remove PD_ALL_VISIBLE. Also corruption detection. And Range JOIN, or inclusion constraints.
- Alexander will work on new GIN.
- Peter G will be working on error stuff (per above). Also wants to work on fixing archiving falling behind and panic shutdown. Also doing a little bit of work on upsert.
- Andres will be doing logical replication.
- Peter E has the transforms feature outstanding. And wants to move the documentation to XML. Wants to improve test coverage.
- Frost will be doing pg infrastructure. Wants to help with CFs. Also fix some stuff with optimizer.
- Greg Smith is working on autovacuum style I/O limits for other statements.
New Committers
The core team has selected the following new committers:
- Jeff Davis
- Fujii Masao
- Stephen Frost
- Noah Misch