PgCon 2008 Developer Meeting

A meeting of the most active PostgreSQL developers and senior figures from PostgreSQL-developer-sponsoring companies is being planned for Wednesday 21st May, 2008 near the University of Ottawa, prior to pgCon 2008. In order to keep the numbers manageable, this meeting is by invitation only. Unfortunately it is quite possible that we've overlooked important code developers during the planning of the event - if you feel you fall into this category and would like to attend, please contact Dave Page (dpage@pgadmin.org).

This is a PostgreSQL Community event, sponsored by EnterpriseDB.

Time & Location

The meeting will start at 10AM, and will finish at 5PM, or earlier if we run out of things to discuss! The location is:

Arc the Hotel
140 Slater Street
Ottawa
ON K1P 5H6

(613) 238-2888

Arc the Hotel

Google Maps

Attendees

The following people are currently expected to attend the meeting. If you have been invited and intend to be there, please add your name to the list:

Oleg Bartunov
Josh Berkus
Neil Conway
Jeff Davis
Pavan Deolasee
Andrew Dunstan
Peter Eisentraut
David Fetter
Magnus Hagander
Alvaro Herrera
Tatsuo Ishii
Marko Kreen
Tom Lane
Heikki Linnakangas
Denis Lussier
Michael Meskes
Bruce Momjian
Dave Page
Simon Riggs
Jignesh K. Shah
Teodor Sigaev
Greg Stark
Andrew Sullivan
Koichi Suzuki
Itagaki Takahiro

Agenda items

The following agenda items are being proposed. Please add items and comments below so we can develop a useful agenda.

Review of commit-fests
Patch management
Buildfarm
Performance regression monitoring
Partitioning roadmap
Using multiple CPUs per query
Vacuum roadmap, including but not limited to
- use cases we've improved recently versus ones which remain problems
- autovacuum
- improving HeapTupleSatisfiesVacuum
General strategy for dealing with platform-specific performance tweaks
Build system - gmake vs. VC++ vs. cmake & friends
Steps toward SQL/MED
Configurable TOAST compression
How to deal with sponsored features (e.g. the materialized views proposal)

pgCon 2008 Developer Meeting Minutes

10:15 to 17:00 May 21, 2008

Introductions

Andrew Sullivan introduces himself, thanks Dave Page & EnterpriseDB. In oder to get through the agenda, if we get into a contentious discussion, then we'll identify that, take it to a list, and move on. Before we start, we'll introduce ourselves.

Jan Wieck, Core Team, Afilias, focussing on COPY, commit enhancements.
Takahiro Itagaki, NTT, developed spread checkpoint, working on stability & usability.
Koichi Suzuki, NTT Open Source, synchronous replication.
Alvaro Herrera, Command Prompt, improving VACUUM.
Tom Lane, Core Team, Red Hat, trying to to clean up the planner for 8.4
Heikki Linnagas, Dead Space Map
Magnus Hagander, Windows work, pgAdmin
Oleg Bartunov, U Moscow, Indexing, GIN & GiST
Teodor Sigaev, same as Oleg.
Robert Lor, Sun Microsystems, DTrace monitoring.
Pavan Deolasse, EnterpriseDB, HOT, working on performance stuff.
Marko Kreen, Skype, Skytools & replication.
Jeff Davis, working on external sorting.
Bruce Momjian, Core Team & EnterpriseDB
Peter Eisentraut, Core Team, Credativ, Community process
Jignesh Shah, Sun Microsystems, TPCE
David Fetter, SQL/MED
Greg Stark, EnterpriseDB, working on unfinished patches
Dave Page, Core Team & EnterpriseDB, pgAdmin & Windows
Josh Berkus, Sun Microsystems, minutes, postgresql.conf
Tatsuo Ishii, SRA OSS, working on recursive join
Andrew Dunstan, consultant, buildfarm author, working on LISTEN/NOTIFY
Simon Riggs, 2nd Quadrant, working on MERGE, HOT Standby
Michael Meskes, Credativ, ECPG, promoting PG, recursive join

Josh is taking notes, but we could use a backup person.

Review of Commit-Fests

Bruce is very happy with the way the commit-fest is going. The first one didn't work very well. However, the 2nd one went fairly smoothly, easy to see what was going on, the wiki worked well. The commit fest is finished now -- Magnus just needs to finish one patch.

Is overwriting other people's wiki changes a problem? People aren't sure. However, in general, the wiki is good enough. One problem was that people were continuing development work and distracting from commit-fest. We could put a footer in the e-mail.

Open items for May commit fest. We need a forcing function for closing a fest ... should we have a two week limit. Marko: no we need to look at all patches. Currently blocking on not looking at the patches. Need to go after reviewers. Greg and Simon said that they need to be assigned stuff because they don't necessarily know what to take. Josh suggests a daily status e-mail, and that Tom / Bruce need to nag people. Andrew S. suggests that we need to make it clear that we can have reviewers who are not committers. Simon suggests that there's a difference between people who will read code and people who won't; the commit-fest should be the second level of review.

Perhaps we should have a list of reviewers or obscure patches. In the IETF, they've started taking people's names -- we should have a call for volunteers. Bruce did quite a bit of nagging in the 1st commit fest, but it didn't help. The 2nd one Bruce didn't do much and it went better. Bruce didn't see a major problem ... thinks it will get better on its own. What prevents people from reviewing? We don't know. We used to have a lot of patch review with people jumping on it on their own. Patch volume has also gone way up.

Daily status e-mail not popular -- shouldn't do it. What about the volunteer idea? Need a manager for each commit-fest. Josh will happily manage the July commit-fest.

Bruce: one more issue for the commit fest. What do we do with the items which were rejected? Need to have a list of patches which are almost done. We also might decided not to say "rejected". Someone needs to make the process ok with the submitters. The rejection e-mail comments really matter -- don't say "this patch is crap." We want to divide into "returned for more work" and "rejected". What do we do with patches that "need more work", though? Do NOT put them automatically on the next list.

Maybe we should have a "limbo page". Tracker? No tracker discussion. Josh suggests that we just contact the author, then appeal to patches. Bruce says the TODO list supplies better organization. Josh will try his strategy for July. Also, replace "Claimed By" with "Reviewers". But what if you want to "Claim" something? Put a comment.

Should we have more than four commit fests? That's tentative. We won't know how it comes out until 8.4.

Community Mailing List Management

What about merging hackers and patches? Seems to be generally OK. Some concerns about attachments. Putting patches into the wiki seems problematic. General discussion about merging various mailing lists. SQL & Interfaces, Genearal and Admin, Performance & other stuff. We should come up with a set of lists to be retired or consolidate. Should discuss this online later -- Peter E. will follow up.

How to Deal with Sponsored Features

A week ago, someone offered to pay a hacker to work on materialized views. PeterE talked to him, and apparently others contacted him in private. People think this is a non-problem. There's an understanding problem on the part of companies; we need a technical spec sponsorship first. 95% of people who go away with a workaround.

Maybe we should document the process of adding a new feature to PostgreSQL. David F volunteered to document this.

Do we have any way to collectivize funds? We can use SPI. Simon doesn't want all sponsored development to be public, there's a lot of low-level stuff that he doesn't want to argue about. There's also some bureaucracy with non-profits. Tom thinks we can do it privately, Josh cites GiST, OpenOffice as counter-examples. Greg says there are two questions, how hackers should get involved, and second the mechanics.

Andrew points out that the management of software development always gets underestimated. Maybe we should have a registry of PostgreSQL coders available for freelance hacking.

Koichi points out the issue around developers guarenteeing that patches would go in. Also, some projects just turn out to be a bad idea. And we need to pay for technical specs. Also, sometimes individual developers need a company to be an intermediary. So you just need to pay a commission. Also, funneling it through SPI could be a political problem and there could be community fights.

BuildFarm & Performance Regression Testing

Nobody knows who put this on the agenda (it was dave). Sun will contribute to the buildfarm.

Tom would like an easier view into the history. It's possible to get notifications on the buildfarm? Yes, there are mailing lists.

Is the buildfarm the right platform for performance regression testings? Josh suggested that we do small tests on the BF machines, and we will have a community-owned benchmark rig in Portland. EnterpriseDB has DBT2 rigs. Sun has some internal stuff.

Simon thinks the BF would test basic operation peformance to see that we don't mess up low-level operations. Josh wants a list of operations we want to test. Simon wants dedicated equipment. We need a testing framework for performance regression. Some people want complex tests that run all day. But there are simpler tests we can run which are fast. Josh says that most BF machines are not dedicated, and Tom cites need for a simple test which developers can run.

Some kinds of testing ... caching algorithms, database maintenance, etc. require running tests around the clock and won't work for this. We need one lightweight test which tests fast low-level things. Then we need some heavier tests that run really long on a few dedicated machines. We also need people who will test for performance on particular patches. Jignesh wants to run a timed series of tests to show different versions and how those are doing.

Can we build stuff into the current regression test suite? Sure, but would need to be optional. Maybe pgBench would be a better basis. Size of test would depend on which machine you're running this on.

We have three performance items here:

Request for big 36-hour tests for major work.
Request for medium-sized test of up to 20 hours a day for dedicated testing machines.
And then a small 1-hour low-level test for buildfarm members etc.

Heikki will lead small test thing. Big testing is already happening. We need to cover both whether we're moving forwards, and whether we're moving backwards. The buildfarm has made regression issues much more managable, we need to the same for performance. EDB does not have such a think for EDBAS. We will need software for running tests and need people to analyze the results on a regular basis. Building software will slow down this.

Sun has open sourced a test harness (Faban). We also have pgUnitTest. Do we have historical versions of pgBench? Probably doesn't matter, pgBench isn't very good. Maybe we need pgBench2. Maybe we need something else. We also need to have detailed output. The rest should go to a mailing list.

Partitioning Roadmap

We have a patch which is waiting on coming up with a roadmap. No clear consensus on what we need. Simon created the patch based on fixing some severe problems in the field. Stopped because of the amount of planner changes required, syntax changes won't fix it.

What problems are we trying to solve?

Can't rewrite queries with a partitioned tables, get really bad plans.
Query plan caching
Managing paritions -- "building out of spare parts"
Triggers for updates and inserts

As a consolation, other databases (MySQL, Oracle) have had to rewrite paritioning several times. Jignesh talked about paritioning in the storage layer. Josh pointed out that people like being able to do DDL on individual partitions. Alvaro: should we still have parititions in pg_class? Folks think yes. Also, Tom thinks we need to improve this incrementally.

Simon says that there are two different cases: completely automated and very detailed and complex. That we probably can't keep very complex and make it easier to manage and more efficient. Maybe we should not use inherited tables. Andrew thinks that people like the current flexibility because they're working around, not because they really like it.

We should start new partitioning as a new feature. Some features of inheritance are not useful for partitioning (multiple inheritance, etc.). But we'll want to have partitions accessable individually for some purposes. Or do we? And maybe right now we're not getting the simple case users. More discussion about feature.

Will we will start implementing a new, separate partitioning feature not based on inheritance. Or subclassing inheritance? Are we willing to have only range partitioning, or do we want hash partitioning? Range partitioning seems to be enough.

Will we allow parititioning on a expression index? Lists of columns? Multi-dimensional data types? Maybe we should use opclass. The method should be programmable so it support spatial in the future. But date range partitioning is the biggest use case.

How big of a problem is the DDL scripts etc. Simon thinks not as big as the planner issues, because you can work around them. Two different groups of users with two different primary problems: DDL for web/simple apps, for real DW, the plans.

What would the syntax look like? Bruce describes some complex syntax. Jignesh describes the DB2 syntax. Autocreation of partitions is problematic. Josh says people want "PARTITION ON (<expr>)". Tom points out that people want "partition this existing 6TB table". Also, the problem with <expr> is how do we match that to a query range.

More detailed discussion. Take to mailing list.

Multi-CPU Queries

Lots of people are interested in parallel query. Also parallel COPY and pg_dump.

We really haven't excelled at having a single query max out resources. The next couple of years we'll need to focus on this because the proprietary databases do better than we do in this. Informix could do this. Jan outlined an idea from a few years ago, in which the executor breaks out nodes.

For parallizing pg_dump, Simon has been working on it. Their idea is snapshot cloning, where multiple sessions can share the same snapshot XID. We need to write to multiple files, and then have pg_restore do restore from multiple files. There are some issues with locking.

Even multi-threading the executor will bring up a lot of the same issues as multi-threading the whole db. Libraries, dependencies, etc. Simon says we can use processes. But there's a lot of overhead for processes. Some queries are not parallelizable.

Jignesh brought up asych I/O. Maybe that's easier than parallel query. Greg says that Oracle parallel query is easier to use than parallel server. Is that the same topic? Jignesh thinks so. Points out that splitting up the executor has a penalty for OLTP. But are servers actually CPU-bound? Several people say yes. Simon says that 2-4 times scalability with parallel query. Bigger than that is very hard.

We'd also have to make it configurable. We really want both asych and parallel query. Asynch is more useful for OLTP. A lot of discussion on different approaches ensued.

We can do parallel pg_dump and pg_restore for 8.4. And maybe some kind of index/scan readahead. Parallel query is longer term. Who's working on it.

Platform-Specific Optimization

People keep coming in with platform-specific optimizations for PostgreSQL. Generally these show huge improvements, but are specific to an OS-HW combinations, and aren't very tunable. This means that we have to re-write these things.

But doing some things completely independant can actually be more code. So are we at a stage where we need to do platform-specific optimizations. Sometimes we've already had to, like semaphores. Asynch I/O may require this. Posix_fadvise works differently from Linux AIO. Jan mentioned "write barriers".

Other examples: in explain analyze, would be good to have DTrace output to find out how many I/Os each operation did. Why can't we just do that using the current generic approach.

We maybe can't come up with a general strategy. Or policy.

(some discussion missed)

Direct I/O is another OS-specific issue. But not sure how to resolve. We need it for really large shared buffers, but it's completely different for each OS. This is a good example of such an issue ... we'd need a two completely different checkpointing code paths.

Vacuum Roadmap

From years ago, vacuum is the most hated part of PostgreSQL. All of the little fixes (FSM, autovacuum, etc.) may be "putting lipstick on a pig". Maybe we need to handle stuff fundamentally differently. The alternative is to go to an undo system. The idea is to make maintenance happen offline. Free Space Map, Dead Space Map, Dirty Space Map. Currently vacuum has to visit every page.

So can we do one massive rewrite, or should we continue fixing it piecemeal? Greg thinks piecemeal. Pavan suggested some other fixes. Jignesh mentions the lack of predictibility in seq scans. In general, everyone thinks that fixing individual problems is the approach.

More issues we need to address:

Scanning whole index regardless of how many changes
Hint bits in tuple headers
Dead Space Map should address very large tables with few dirty pages. But does not fix very large indexes. So we need to do a lookup. Discussion about how to lookup index pages ensued. A serious issue for GIN
Index rescanning also prevents Synch Scan from helping vacuum. As well as vacuum_delay.
Update to a large portion or the entire table. Smaller segmentation? Maybe not.
Constant load databases ... vacuum in memory? Problem is that rows may still be visible.
Hint Bit setting. Maybe the bgwriter can set the hint bits.
Long-running transactions: Alvaro thinks we can fix that by having those get a snapshot rather than just an XID. People really liked this.

Greg says there are use cases we've covered and ones we haven't covered. He thinks chipping away at the use cases is the right approach. It's like chipping away at a block of stone.

More ideas: reference counting. Having to rescan the whole index seems to be one of the biggest problems. When we UPDATE a whole table, maybe re-write that? Maybe only if the lock the table? Marko did concurrent CLUSTER which wasn't completed. Could we have segments which are different sizes? That would break CTIDs. But Tom thinks that's fixable.

Simon says we need more detail on real-world use cases. Greg Stark will follow-up on the use cases.

Build System

Do we want to use something different? cmake vs. VC++ etc.?

The current VC++ port for Windows builds breaks all the time.

The current makefiles contain a lot of copy and paste, which Peter wants to refactor. He expects us to be able to use cmake in the future if we want to. That would allow us to more easily produce OS-native make files. Jan thinks this sounds like the old imake, which is dead for good reasons. Peter disagrees. KDE has changed to use cmake and it's working for them. cmake use is growing.

In general people were positive to the idea. There was a lot of wondering about specifics. Like, what does it depend on? It's in C++. Seems to be OK.

Peter will get started on this soon, with Magnus' help.

SQL/MED

David Fetter: looking for a way to let Postgres talk directly to other DBMSes and data stores. Using middleware is very insatisfactory. Neil Conway wrote a patch which surfaced qualifiers for RULES. It would be nice to have qualifiers in userland as a good-enough for doing this. Already fixed DBlink for this.

The patch surfaces the WHERE clause as a string currently. Heikki doesn't think that this is a long-term solution. EnterpriseDB implemented this for Oracle. Doesn't do joins either.

Another issue is estimating rows from the remote query.

People would rather have a more robust implementation. But is anyone working on it? Marko says this would also help PL/proxy. Jan proposed something more sophisticated using cursors where the remote qualifiers would open a cursor. But we also want the more brute-force interface.

This is a more general issue about stored-procedure based views.

Are there any other use cases for this? Marko thinks there is one for local access, such as ones for dynamic query generation. Simon and Heikki talked about pushing down aggregates. Simon doesn't see any way around parsing a plain text qualifier to pass down conditions. Heikki says this needs to also be handled by a plugin which does interpretation.

So, do we accept the Conway patch the way it is now? Seems ok to do the stop-gap.

Jan thinks that showing the node tree will work better. But others don't agree with him -- it wouldn't work for PL/perlU. But Jan thinks it would work to give it a pointer to the parse tree and the range, we'd need to add an access function for the PL.

TOAST

Alvaro: Postgres uses an lc algorithm, but some people want to use other compression algorithms. How does it compare? If it compares poorly, we may want to switch to lz0. However, lz0 has GPL code and doesn't really fix the problem.

The real use case is storing big text documents. RIght now they're working around it with BYTEA, but that's not satisfactory. Maybe we just offer a bzip contrib module and they can use updatable views. But pluggable compression stuff would open up other new uses.

Jan thinks that we need to have some handles to fine-tune TOAST per column. Or maybe even according to estimated compression gain. Some discussion about TOAST headers and different ways to do this ensued.

Should just deal with this through the hackers list. We need to see some performance numbers. Jan told a story about originally implementing TOAST.

Jan also suggest having a separate data type for each compression algorithm.

Next step is to see some numbers.

End Remarks

Bruce: Wow, there's 26 of us in this room. Most everybody in PostgreSQL development is here. I'm excited. A lot of people travelled a huge distance. Some people are volunteers and made the time to be at the conference. It's really great that we've been able to talk on a technical level for 7 hours and make a lot of progress. Many of you have been working on Postgres for 7+ years, mostly as a volunteer. As long as we can keep that kind of dedication up, our potential is unlimited. A lot of stuff about the project hasn't changed, the culture, what drives us. We're really going to change the world, or at least our portion of it. Being associated with the PostgreSQL project has really been a highlight of my life. We'll be associating with a lot of other users over the next couple of days, but this is the core. This is what drives us forward.