FOSDEM/PGDay 2016 Developer Meeting

A meeting of the interested PostgreSQL developers is being planned for Thursday 28th January, 2016 at the Brussels Marriott Hotel, prior to FOSDEM/PGDay 2016. In order to keep the numbers manageable, this meeting is by invitation only. Unfortunately it is quite possible that we've overlooked important individuals during the planning of the event - if you feel you fall into this category and would like to attend, please contact Dave Page (dpage@pgadmin.org).

Please note that the attendee numbers have been kept low in order to keep the meeting more productive. Invitations have been sent only to developers that have been highly active on the database server over the 9.5 release cycle. We have not invited any contributors based on their contributions to related projects, or seniority in regional user groups or sponsoring companies.

This is a PostgreSQL Community event.

Meeting Goals

Review the progress of the 9.6 schedule, and formulate plans to address any issues
Address any proposed timing, policy, or procedure issues
Address any proposed Wicked problems

Time & Location

The meeting will be:

9:00AM to 5:00PM
Brussels Marriott Hotel

Coffee, tea and snacks will be served starting at 8:45am. Lunch will be provided.

RSVPs

The following people have RSVPed to the meeting (in alphabetical order, by surname) and will be attending:

Joe Conway
Dimitri Fontaine
Andres Freund
Magnus Hagander
Petr Jelinek
Peter Geoghegan
Kevin Grittner
Álvaro Herrera
Heikki Linnakangas
Tom Lane
Bruce Momjian
Dave Page
Dean Rasheed
David Rowley
Craig Ringer
Simon Riggs
Teodor Sigaev
Tomas Vondra

The following people have sent their apologies:

Josh Berkus
Jeff Davis
Andrew Dunstan
Peter Eisentraut
Stephen Frost
Etsuro Fujita
Amit Kapila
Kohei Kaigai
Robert Haas
Fujii Masao
Noah Misch
Michael Paquier
Masahiko Sawada
Pavel Stehule

Agenda

Time	Item	Presenter
09:00 - 09:10	Welcome and introductions	Dave
09:10 - 10:10	9.6 and beyond - improving the process: Do the commitfests still work? Should we adjust timing? When should the code be branched? Should we be more strict with commitfest dates, and if so, how do we remain fair to patch submitters?	All
10:10 - 10:30	9.6 Release Schedule	All
10:30 - 11:00	Coffee break	All
11:00 - 11:45	Future of storage	Álvaro et al
11:45 - 12:30	pglogical, BDR and logical decoding	Petr, Craig, Simon
12:30 - 13:30	Lunch	All
13:30 - 15:00	Future shared infrastructure for tightly and loosely coupled clustering and multimaster replication: Multiple sync replicas Sequence replication, sequence access methods Exposing transaction management and lock management for distributed xact managers/lock managers ...	Petr, Craig, Simon
15:00 - 15:30	Tea break	All
15:30 - 16:30	Testing: Getting in-core and buildfarm test coverage of replication/failover/promotion (re multixact 9.3 issues etc) Better testing infrastructure in general (crash safety, performance, etc)	Tom, Craig (?)
16:30 - 17:00	Any other business	Dave
17:00	Finish

Minutes

Present:

Joe Conway
Dimitri Fontaine
Andres Freund
Magnus Hagander
Petr Jelinek
Peter Geoghegan
ï¿½lvaro Herrera
Tom Lane
Bruce Momjian
Dave Page
Dean Rasheed
David Rowley
Craig Ringer
Simon Riggs
Tomas Vondra

Expected to arrive late due to travel:

Kevin Grittner
Heikki Linnakangas

Missing:

Teodor Sigaev

9.6 and beyond - improving the process

Tom: ppl moved straight to 9.6 development when 9.5 was branched

Andres: 9.4 was the same

Bruce: Too many commitfest?

Craig: Lot's of pressure to get more stuff done.

Magnus: More fun to build new features

Dave: PG was a hobby in the early days. Now we're all making a living from it.

Peter: Why was release in January, not December?

Andres: We didn't do anything until deadlines were set?

Dave: So do we need pre-defined deadlines?

Magnus: Backing out large patches causes serious delays

Craig: Open items on RLS caused delays

Simon: Coordination is the bigger problem. Individual issues are small but mount up.

Joe: Finding time is hard and sporadic

Dave: Is it harder for everyone now this is a paid gig, not a hobby we pickup whenever we like?

Others seem to agree.

Andres wonders if Magnus has graphed commitfest submissions, expecting activity has gone way up.

Tom thinks the level has remained around 100 for a while. Simon suggests 80 patches in the beginning, up to 120 at the end of a cycle.

Andres thinks more patches, less review.

Dave: Is is that employers prioritise new features, but not patch review?

Andres: Should we force reviews in return for submissions?

Dean: We should encourage more patch review/reviewers to help find bugs early.

Alvaro: We can't force people to review

Andres: We can force people to write tests, but they'll end up being stupid ones in some cases.

Peter: People are only interested in reviewing what they consider to be important.

Simon: Should we have official "reviewers"?

Dave: That might have the opposite effect - not my job!

Andres: Should we credit reviewers at the end of the release notes?

Dave: It's certainly worthy of credit

Bruce: I've been one of the big poo-pooers of that. It's a slippery slope of saying those names are their for credit.

Dave/Peter: That's not a bad thing.

Simon: Regardless of patch credits, reviewer credit should be given.

Simon: We need a bar for inclusion

Where do we get the list from? Commit logs? Not everyone includes reviewers. Commitfest app? Often only contain the principle reviewer, who may remove themselves when they're done.

Magnus: We should properly credit in the commit logs

Dave: We can have review of release notes to make sure we didn't miss anyone

Dave: This doesn't fix everything. Let's resolve to include reviewers on commit messages, and list them on release notes

Joe: Let's define a commit message structure

Action point: Include reviewers in commit messages and list them on release notes. AND TELL EVERYONE!! (all)

Action point: Bruce to propose commit message template (Bruce)

Tom: Can't just let commitfest drag on. Need to close out firmly on the correct date.

Andres: The author should be responsible for pushing things to the next commitfest.

Simon: Need to handle patches for which we get no response from the author - it's not rejected, and may well be wanted, but is currently inactive

Peter: Need a culture of really tracking status of patches on the cf app so people can see a good overview from one page. Need to handle patch dependencies

Dim: Readme file for patch series?

David: Should the CF app remind reviewers if they haven't submitted review notes?

Andres: That has the negative effect on me

Magnus: Personal contact from the CF manager is better

Alvaro: CF app has a mail feature, and that has worked for me to nag people

Craig: We need to avoid CF manager burnout, and keep the process simple.

Petr: Would be good to have activity log updates for my patches

Andres: RSS feed doesn't work well. Update notifications would be good.

Tomas: Would like to see enforced review for patch. Problem with balance though - big patch for 2 small reviews?

Action point: Magnus to add a subscribe button to the CF app, to allow users to receive email updates when metadata is changed. (Magnus)

Simon: We have CF managers, but we have noone managing releases. I propose a team of three people for an RM team - 3 for redundancy and voting on challenges. Team would be able to veto patches.

Magnus: Would they take over after the last CF?

Simon: Not sure

Peter: This could have meant that UPSERT wouldn't have gone into 9.5 which would have been bad.

Dave: But for every UPSERT, there are 10 patches that wouldn't have been whipped into shape in time.

Joe: The release team should decide when we branch, based on current status

Tom: Should we stop branching early to keep focus on the current release?

Petr: Commitfest schedule should be based on branch date

Simon: Fixed dates are helpful for planning holidays etc.

Tom: We already avoid Christmas etc. MAybe we should keep the existing schedule, but just drop the July/September fest if we're delayed. We should probably drop July regardless.

Simon: Make July a "Review fest"?

Action point: Propose release management team to release mailing list (not a public list), and form for 9.6. (Simon)

Action point: Propose dropping July commitfest and making September conditional on progress. (Simon)

9.6 Release Schedule

Magnus: Do we want to try to get back to a September release?

Bruce: If we end up on 3 CFs, is that enough? If the last is in March, we still won't hit the date

Dave: That's what the release team is for - to progress things though March - Sept.

Magnus: Need to get out of cycle of slippage.

Dave: We should release end of September

Tom: Beta in June, release in September

Magnus: What about the betas? We have too much time between them.

S'i'mon: Do betas need to be full releases?

Dave: Yes, as we're also testing installers, build systems, test systems etc.

Joe: How can we encourage beta testers

Dave: Give out t-shirts for good bug reports

Peter: I like that idea

Simon: Present shirts at a conference

We could have custom T-Shirt

Action point: Investigate SPI funding and t-shirt design and logistics

Action point: Propose September release, May beta (before PG Con?) to release (Dave).

Future of storage

Alvaro: et al is Tomas, David and Simon! We're working on columnar storage. Posted a patch on hackers, but not happy with the result. Want to improve massive query performance - looking for 10x - 100x increases.

The current patch shows maybe 10 - 25% improvements. Current patch is essentially vertical partitioning, by moving data off the heap into another relation. Not really columnar storage - just moving a column to it's own relation. Looking at a new approach based on experiences.

One of the first ideas is to split the concept of a tuple descriptor into 2 pieces - one is coming from the main table, the other a smaller descriptor for each column store on the table. Proposed here as this would require splitting up pg_attribute, and wants buy-in before doing so.

One option is to split up pg_attribute

One option is to have a new storage abstract layer which can handle columns which are not part of the vestigial heap in HeapLockTuple.

Andres: Wonders how much this would help as you'd still need cmin, cmax et al.

Alvaro: That data could be centralised on the heap to some extent (paraphrased, not entirely convinced I understood correctly)!

Simon: We don't want to radically change Postgres. Look at Monet - they proved columnar worked, but you have to effectively turn the database off to load.

Tomas: Initial patch was written to avoid breaking as many things as possible. This is the first step to abstract the locking. In the next step we need to do more radical restructuring.

Andres: Theres a reason why practically no column store supports features like HeapLockTuple, but making the API more general has other advantages.

Tomas: In the next step we could do locking of blocks of tuples

Simon: If we accept restrictions on DB functionality, we'll end up with something so far from Postgres that you'll end up choosing one thing or another - it'll be a spearate product.

Tomas: Greenplum only supports append-only columnar stores

Alvaro: Would want to make incremental changes to storage/catalogs as DDL support is added

Alvaro: Another proposal is to allow multiple tuple datums to be stored consequetively

David: [Splitting pg_attribute]. The idea is to have a physical and logical descriptor so things can be easily rearranged into an efficient storage order.

Alvaro: The new design allows us to have attributes stored in different places, which really doesn't work well with just a couple of other columns on pg_attribute.

Bruce: You should be getting much higher performance. We don't do two things - columnar and graph. Is it about compression, header, row format?

Simon: Vectorising the executor has a massive performance benefit

Dim: This is what Greenplum does for seq scans etc.

Simon: Have someone experienced to review our work, but can't talk to him yet as he's on a review committee for funding a project of 2ndQ's. Restriction should be lifted soon.

Simon: Vectorised column storage can be up in the 100s of % performance increase.

Andres: Just vectorising has given 300%

Simon: If we go down the road of allowing restrictions, we'll end up like MySQL with MyISAM and InnoDB.

Petr: Sucky updates are fine - just don't use columnar for regularly updated stuff.

Tomas: Columnar updates can be fast if done in a batch manner. No good for OLTP of course.

Tomas: I'd be happy with 25x performance if I get updates

Alvaro: Was looking for any objection to restructuring. Seems like there is none - will move forwards with multiple patches.

Tom: Looked at this at Salesforce. Putting in anything that looks like a storage manager API is a *much* bigger task than you might think. There is much more chance of making this work if you can avoid changing catalog access and thus having to touch DDL code.

Dave: Maybe we need a wiki page to write up the current evolution to the patch, so people know how the current state was reached.

Simon: I disagree with Stonebreaker - having restricted features in lots of systems may be lucrative, but we want everything in Postgres.

Joe: Columnar storage is not a feature - it's a solution to a particular problem. We should know the use cases, as maybe there are other solutions.

Alvaro: This is why we want a generic infrastructure for this, to allow future alternative storage options.

Craig: We don't want restrictions, like you can't use BRIN indexes or FTS on a table with columnar storage.

Action point: Setup a wiki page to describe the project and work to date (Alvaro & Tomas)

pglogical, BDR and logical decoding

Simon: We've submitted pglogical for 9.6. Wanted to discuss whether people feel it should be committed to this release, and discuss roadmap of future items. When we originally wrote BDR, we were driven by our funding model. Now we need to get that into core so we started with pglogical. Once we have data transport into code, then we add multimaster.

Craig: pglogical came from the guts of BDR. We took the parts that were usable with PG 9.4, and turned it into a data transport mechanism that allows replication in a flexible arrangement of nodes. This can allow others to build multimaster, sharding, DW etc. Hooks are present to allow filtering. Initial code allows selective replication, online upgrade (bar sequences), data merge into DW. Looking at adding audit feature where changes are fed into an audit table or text fie.

Petr: Working on data transformation

Craig: Looking at adding selective replication within a table, e.g. only replicate data for a particular customer, sharding.

Peter: So you're puting logical on a level field with Slony?

Craig: Yes. And to make ETL easy.

Kevin: Can you support disjointed multi-master, where different nodes contain different data sets?

Craig: Yes

Kevin: What about wheere a change on one node can cause a delete on another?

Craig: You mean like a re-shard? [should be possible]

Dim: What about skipping columns in replication to hide them?

Craig: If we can't do it yet, it should be easy.

Petr: We tried to make the plugin so you can use it on it's own to send data to things other than Postgres

Craig: We've tried to avoid having too many plugins on plugins, but we could make the wire protocol pluggable as well. We're really trying to make it so people can use this for everything.

Andres: One of the dangers here is making the output plugin too complex.

Craig: That's what's so cool - it's actually really simple! I'm really suprised how the output plugin naturally formed boundaries and allowed for hooks.

Craig: pglogical is another part of us getting bits of BDR into core, We'll eventually wrap BDR around it.

Petr: No action items - this is really a status update. We'll talk after lunch more infrastructure we need.

Andres: We have 10 minutes now, lets talk about sequences etc.

Craig: We need the ability to decode a sequence advance out of WAL.

Petr: Craig is working on sequence advance. We also need to work on sequence access methods for multimaster, clustered setups etc. We need the generator to not be locally owned. Have taken multiple approaches to storing sequence AM data in the catalogs.

Craig: We need this for sequences across multimaster and sharding, as well as an idiot proof gapless sequence.

Heikki: What are the AMs you need? Cluster-wide, gapless

Petr: Gapless locks the sequence until commit to ensure numbers don't end up unused.

Andres: Every German will thank you for that!

Lunch

Future shared infrastructure for clustering etc.

Craig: Wanted to talk to the Russians about their work exposing the transaction and lock manager. First though, failover slots.

Kevin: I was invited to Russia to talk about SSI with an eye toward how that could be spread across multiple nodes.

Peter: There's a restriction on SSI with parallelism - you can't use them together. Probably something to do with predicate locks.

Kevin: On a related note, afterr SSI went in Berkus told me a customer had implemented their own sharded solution, but performance wasn't greast (10x latency)

Craig: If you have data distributed across slow wans on a 4 node MM cluster, if one node fails you can't switch over because you can't create new slots at the right point in time. Failover slots provide a minimal way to allow logical replication to play nicely with HA. Patch is mostly done, needs ability to follow a timeline change and review.

Simon: Interested in transaction manager work, but it should be in core, not an external extension.

Heikki: Simon; how far do we want to go to allow people to write custom extensions? WAL logging in extensions seems popular. Do we want it though?

Simon: Opens the door for people to write patent-encumbered extensions

Dave: I don't really care - we're a BSD project. The question for me, is would it be useful for other OSS projects like PostGIS?

Andres: cstore_fdw could use such a feature for replication and crash safety

Simon: I want a way to skip a broken index during recovery for example

Heikki & Andres question if this would even work. How would you know what is broken during recovery?

Craig: What scares me is lack of disk space checks and crash testing.

Discussion moves onto database consistency checking - Heikki says people don't want it until they need it.

Dave: EDB customers often have it as a check-list due diligence question (that's why EDB wrote pg_catcheck). Peter wrote a tool for checking btree consistency that users have used.

Peter: It's loosly based on pageinspect. It takes shared locks on buffers one at a time.

Various: We need a secret option hidden in the docs without which people cannot run pg_resetxlog!!

Craig: We need someing like "rm -rf / --including-root"

Tom: That works fine until Google archives a post with the magic parameter in it.

Action point: Post new version of btree consistency checker patch (Peter)

Action point: Add warning notice and confirmation requirement to pg_resetxlog (Craig)

Action point: Reword delete backup label hint (Kevin)

We need a safer mechanism for start/stop backup...

Magnus: We could disallow disconnection during backup - i.e. if you disconnect, pg_stopbackup() is run automatically.

Kevin: Need some reliable way of telling the difference between a tarball and a crashed datadir.

Tom: By definition you can't, or tar is broken.

Magnus: We need a new robust API fornon-exclusive backups

Simon: Keep but deprecate the existing API.

Need to find a better way to ensure users have the required xlog in backups

Craig: Our docs are in the wrong order. pg_basebackup should be first, ahead of manual methods.

Action point: Re-arrange backup docs page (Bruce)

Andres: We could rename pg_xlog to pg_wal

Simon: pg_clog to pg_commit

Magnus: Renaming pg_xlog will break all backup scripts

Bruce: If we're telling users to check their scripts for renamed directories, we should tell them what else to check as well.

[Much discussion about trying to figure out the difference between a crashed data directory and incorrectly created backup]

Magnus: We should include links in the docs to trusted backup management tools and encourage users to use them rather than roll their own low level processes.

Magnus: We need to make sure people are aware that their backups are broken if we rename directories, e.g. by changing pg_startbackup() so it barfs unless the update their scripts.

Action point: Finish sanitizing the backup API (Magnus)

By a show of hands, most people favour renaming pg_xlog/pg_clog and risking breaking user scripts.

Heikki: I have no issue renaming pg_clog as that shouldn't break anything

Kevin: One user of ours had a filesystem configured for a huge default allocation size, thus pg_clog took a huge amount of space. Not really something that would break though.

Action point: Submit patch to rename pg_clog/pg_xlog (Bruce)

Action point: Allow tablespaces to use relative paths to avoid issues during testing with multiple instances on one box (Andres)

Testing

Tom: We need to start think hanrder about testing infrastructure. We have buildfarm and isolation tester, but have no performance farm or crash safety testing. Would be good to get Heikki's test tool into common use.

Dave: Wasn't Stephen (Frost) working on the performance farm?

Joe: I think it's still on his mind but don't know much more.

Heikki: I found my tool useful when hacking on xlog stuff, and found some existing bugs. I ran it just before 9.5

Alvaro: We could setup a buildfarm animal to run the test.

Heikki: We need a data generator to do this testing for different index types etc.

Alvaro: In BRIN page evacuation is not tested, but other coverage is complete. We should have a machine running tests constantly.

Heikki: Didn't Peter E have something running?

Kevin: It only ran make check I think, not make check world.

Heikki: We need the workload, and the regression suite which we should keep adding to.

Kevin: We don't want to overload make check to do this, but what about make check world.

Kevin: We could have a numeric level for make check, to add more and more tests.

Tom: This may require infrastructure that machines don't have, so it doesn't make sense to use the same targets.

Alvaro: Michael Paquier had a patch to run tap tests with a master and standby

Andres: I had a test for multixact testing but it wrote ~500G of data. Is this sort of thing worth keeping?

Kevin: Yes, so we don't lose it. But in a different target.

Alvaro: Need to ensure modules don't write data to the same path. May need a BF fix.

Heikki: We need to ensure these tests don't bitrot.

Heikki: I also ran a test for SSL stuff, but that's probably broken now.

Andres: We can ask Andrew to fix that.

Heikki: We didn't do that because it uses TCP connectivity, which is a potential security issue. We could have animal owners enable if they're happy.

Joe: Would be nice to have an easy way to identify "special" animals on the buildfarm.

Simon: Maybe a different view of the BF database to show animals doing certain tests, e.g. all SELinux tests.

Magnus: Access to the BF database should not be an issue for known community members.

Peter: Jeff Janes had a useful test for UPSERT. Originally simulated torn pages, then checked everything was consistent.

Heikki: Would be nice to polish that up.

Andres: Over three releases that test has found bugs

Craig: I'd like to look at using Docker or KVM for simulating power loss

Heikki: Many of these tools are only interesting whilst writing code, and not in the long term

Kevin: But someone will likely modify that code again in the future. I'd like to see them run at least yearly as a check.

Joe: make annual check? :-)

Various: Stress testing often needs to be run for periods of time before bugs are seen

Action item: Heikki to look at polishing his test tool (Heikki)

Action item: Alvaro to push Mr. Paquier's patch for recovery testing (Alvaro)

Action item: Alvaro to setup machine for public 'make coverage' html reports (Alvaro)

Andres: Concurrency primitives are woefully untested. This is difficult because it can take a long time to see, and we often don't run such tests on non-intel.

Dave: And presumably this can be difficult on virtualised environments anyway, e.g. PowerKVM where NUMA node affinity may be configured in different ways

Kevin: Performance testing is hard. Machine config, kernel versions etc. can make a huge difference

Dave: We should do simple baseline testing on a per machine basis and do comparative benchmarks. Having both stable and daily updated machines could show things we break and things the OS vendor breaks (or fixes)

Thomas: I'm willing to spend some time on this work if we can get machines.

Action item: Dave & Tomas to look into getting some basic hardware and writing a framework to get started (Dave & Tomas)

Alvaro: Can we use the buildfarm

Tomas: I don't know Perl

Dave: Neither do I. Something new in Python would probably be quicker and easier to write.

Magnus: The BF schema probably isn't good for this anyway.

Kevin: Flame graphs are very helpful

Tomas: I don't think this should be for diagnostics, just regression testing

Craig: Does anyone see a problem with asking some BF owners to run Docker or similar for crash testing?

Dave: Might be difficult to fit into BF framework

Heikki: Wouldn't hurt to ask people though

Joe: Has someone been doing fuzz testing?

Dave: Yes, Greg, with libfuzzer

Joe: Can that be scripted?

Andres: Probably not worth it - Greg may have exhausted the usefulness

AOB

Dave: Simon had a couple of topics, plus we may want a quick meeting of the security team

Simon: One item was the optimiser roadmap. Should we consider specific optimisation cases from TPC-H for example? We haven't even started on TPC-DS yet (which has 100 queries)

Heikki: Is the question should we bother because some of this work may be long and complex and not pay off in the real world?

Simon: There are various branches to this - sharding etc. materialised views

David: Parallel query - we're adding more brawn, but not brain. Planner improvements here may apply only in a small number of cases, but have a massive effect on some OLAP queries. I feel like we're in an OLTP world in the optimiser, but moving to an OLAP world in the executer.

Simon: This is not to talk about specific decisions, but what we see happening and where we want to go, and making sure we go in avenues that make sense.

Tomas: We rely on costs being accurate, which reflect in some way on query runtime. We could enable some optimisations only when we expect overall cost to be expensive. David proposed 2 phase optimisation - do it as we do now, and if the numbers remain high, try again.

Heikki: That sounds good, but lets look at specific optimisations first.

David: We have some of that now - e.g. left join removals.

Simon: It would be useful to begin documenting what we do already and why

[Discussion on specific optimiser cases]

Simon: We're not allowed to keep adding optimisations that keep adding a microsecond each, but where does that take us?

Heikki: We need to test cases to know what we need to look at

Joe: Maybe for us we can just say "this will be long running - optimise the hell out of it" or treat as normal.

Tom: It's like self tuning - if you're searching one table you're not going to spend time doing join optimisation. If you have lots, you spend more.

Bruce: I want to talk about 9.6. Big three things - seq scan and join parallelism, and FDW sort push down. We have open items parallel computation of sorting and aggregates, Peter's faster sorting, Tomas' multi-variant statistics, pg_logical, auditing, high concurrency performance, relation extension lock, snapshot caching, partitioning syntax and join pushdown in the FDWs themselves.

Action point: Bruce to add a link to his slides to the meeting wiki page (Bruce)

Agenda Items

Please list any agenda items below for inclusion on the schedule.

9.6ff Release Schedule
Future of storage (Álvaro Herrera et al)
pglogical, BDR and logical decoding (Petr, Craig, Simon)
Future shared infrastructure for tightly and loosely coupled clustering and multimaster replication (Petr, Craig, Simon)
- Multiple sync replicas
- Sequence replication, sequence access methods
- Exposing transaction management and lock management for distributed xact managers/lock managers
- ...

Possibly also to consider:

Getting in-core and buildfarm test coverage of replication/failover/promotion (re multixact 9.3 issues etc)
Better testing infrastructure in general (crash safety, performance, etc)

FOSDEM/PGDay 2016 Developer Meeting

Contents