PGConf.dev 2024 Developer Unconference

From PostgreSQL wiki
Jump to navigationJump to search

Notes for PGConf.dev 2024 unconference sessions.

Protocol Enhancements

Notes by Jacob C.

Starting statement: This is the Postgres Protocol, used by things that are not Postgres

  • We may have things that are interesting from the protocol perspective that Postgres cannot make use of
  • Would the community _only_ be interested in changes that are usable by Postgres?
  • Do we have to have something in Postgres if the protocol supports it?
  • Or can we add things that are useful for other engines?

Immediate controversy! (Animated crosstalk.) "it is not our job to make things easier for others" "but can't we?"

What if there's something that the client side does that the server doesn't? Something that's mediated by the pooler or other intermediary, and the server shouldn’t have to care?

Topology-aware protocol messages

It would be cool if the protocol carried information about the topology

  • Server could send the client somewhere else during switchover
  • Amazon added the ability to figure out the topology during failover, and the client could then switch on the fly
  • (Amazon also added federated authentication. Sounds like they added it all into the wrapper (JDBC))

Rolling restarts of pgbouncer would be nice, to shove the client to another bouncer

  • It would be awesome if the server could tell the client (or the pooler!) that it is going away

Hannu was talking last year about baking topology into the WAL stream

  • There is value in baking the WAL stream, LSN, timeline into the protocol
  • Clients could figure out read/write splitting, figure out what's going on through failover, and make sense of situations where the standby is behind a primary

How to Negotiate

Negotiation is definitely already possible via parameters and minor versions but we have to decide how to use them

  • Knowing the difference between a setting for the middleware and a setting for the server would be nice
  • Clients should be able to mark extensions as critical or not
  • If it's required for _everyone_, then we bump the protocol version

Impacts on the Ecosystem; Software Lifecycles

Are we talking about something specific or something general?

  • If we create protocol negotiations, we need to standardize, based on actual use cases. It's hard to design without killer apps
  • Capability negotiation needs a reference implementation

How to frame negotiation on the wire

  • Sybase uses a bitmask for feature negotiation
  • Text-based is flexible but verbose; makes it easier to have concurrent innovation across the community, without stepping on each other

Can we implement client-only features, and mock the server functionality without implementing it in Postgres?

  • Other drivers can pick it up and implement it; the middleware can handle it until the core server gets support

Amazon wants to be able to use the actual Postgres driver and be done

Grab Bag

Multithreading? QUIC?

DataRow is too big, too verbose

Could we have an asynchronous notification interface?

Jelte wants to be able to change session-level settings during the pooler multiplexing

  • Is there still use for all these things if the pooler were to move into the server?
  • Yes. We will still want to bounce connections. We will still want to control topologies on the client side.

It would be nice if The Protocol lived in one place in the server

  • and it's confusing that the server side has something called "libpq" that isn’t really libpq


New testing frameworks

Jacob Champion: Working on a new authentication type and want to test it, but testing it was very painful. Kept hearing people say "I wrote a thing in pytest" or major features where the mailing list included "I don't know how to test this thing; we just won't test that part." - Don't want to make this a rant about Perl. - Don't stop writing tests in Perl; it's better to have tests than not to have them.

Tristan: Pytest is pretty great; could another test harness be introduced into the tree?

Michael P/Peter E.: IPC::Run is unmaintained and runs poorly on Windows.

Melanie: IPC::Run ... I was writing a TAP test, and I wondered if there are other frameworks that would have make bg psql easier/more reliable. More magic stuff would be great!

Jelte: In general Pytest works a lot better/more usable than the Perl test. Hard to know what failed, what log file to look in, etc.

Michael P: Async tests even harder. Want to be able to maintain a connection, do stuff, then use that connection some more to pull data back.

James: Does everyone else find regress tests painful?

Melanie: What is inherent to TAP? Jacob: TAP is just the protocol. Our stuff is based on Perl Test::More...

Jelte: The output of our tests is terrible. Peter E.: I wrote it. Conversation leading up to is was basically: We only have regress, and we want to be able to test X. Debian had written a higher level testing thing, and that worked. And so we took it. Back then Perl was already a build-level dependency, so the argument was that we shouldn't add a new dependency. Some of the pain is that we needed buildfarm support -- changed to output format that pg_regress had, so that the buildfarm client could read. Afraid that we'll have to do the same problem with Pytest. The weird files are to tie into that.

Jelte: I don't mind having the files; I just want to see the output also.

Peter G.: I have a bash script, point it at a .diffs, and it brings it up in my editor...

Jelte: TAP tests have a million files...

James: I brought up regress test output because having to match stuff between files on arbitrary lines is so painful.

Melanie: That's not good, but TAP is so much worse that we have to focus on the worst thing.

Heikki: Isolation tester...was based on a Python test we had, but we didn't want Python, so I rewrote it...

Melanie: Two categories: - We hate the output. - Things you can't test. - Dependencies.

Tristan: Problem with Python dependency stuff is that you have to install Pytest to get it; Python has unit test framework built-in.

Peter G.: What's the scope: I have frustrations testing things when implementation details (like alignment) might affect my tests. Culturally we don't seem to see much a place for that; tests are assumed to work the same on every platform. For coverage purposes there's no place for writing tests that exercise 32-bit...would we be willing to shift to having a place for that?

Jelte: We have tests that don't work on Windows.

Heikki: It's hard to write these kinds of tests and make them conditional on those things.

Jacob: I want you to be able to: - Write tests before the features. Perl doesn't prohibit that, but it's so people don't, and our tests are so focused on E2E acceptance tests, that you tend to write the feature and then backfill the test. - Get code coverage organically. - Isolate failing tests -- be able to run a single test. James: Also the tests themselves aren't separate/isolated from previous tests. - See what failed. Know what line, what the line contained, what the comparisons were, have diffs... James: Introspection by default (RSpec does the same thing). - Compose tests. Pytests has fixtures -- things a test depends on. If the current worker can't satisfy that, the test suite can skip those tests. The test suite knows how to spin things up and tear them down.

   - James: Convince it will be fast enough.
   - They can be cached for multiple tests; ability to run a single test reduces amount of time spent during development even if the whole suite takes long.
   - Can also compose tests with tables (multiple parameterizations).

- Control over peer behavior: mock out network layer. Perennial problem where we can't test a client that doesn't support GSS against a server that does, because you can only build it once. Want to be able to test "A server shouldn't generate this, but what if it does."

Generating certs would be nice. Peter E.: There were discussions about pregenerated and committed certs or generated certs would be better. Maybe we need both.


What do others want?

James: Mocking, unit tests, spy on methods, pass in a Path tree and see what happens at a certain point in the planner, ...

Jelte: Couldn't review pgbouncer PRs because the tests were bash, and it's not possible to read it and know if it's testing the things. Peter E.: And those tests are more readable, faster, easier to run. Need facilities for low-level testing. Need to keep in mind the overall runtime, but I'm not worried about in this conversation.

Tristan: Developer tooling is a huge market now, because it means people can get things out the door faster, and Perl has no developer momentum, but Python has a lot of that: language servers, a great debugger, editing code I can go to the definition.

Peter G.: I would love a usable Perl language server, but there isn't.

Tristan: Also Perl is inherently untyped; I'd love it if we had a static type checker on a new test suite.

Heikki: We should go with something with a huge userbase already, and Pytest seems to be the most popular one. We'll have to build a bunch of utils on top of it to do things like construct buffer pages, but that's like anything else.

Michael P.: I like the ability to spin up and spin down rather than initdb all the time -- it improves the runtime.

James: Can also do test randomization to expose isolation issues.

David W.: Seems like there's consensus on what we want; maybe there should be discussion now on how to get there.

Jelte: What's the minimum we need?

Michael: Spin an instance, set configuration, take backups.

David W.: Need Python first, and Pytest, etc. Then you need the base fixtures.

Peter E.: There's a bit of a mess in how you install it. But "pip install pytest" is similar to having to install IPC::Run, so I'm not too worried about it.

Tristan: Something that came up on the list before is that safe_psql calls a psql client, and what if we had a Perl postgres client that we had bindings for, so we don't have to spin up a client each time.


Build Farm

build farm -- Andrew Dunstan


in the beginning no other oss projects had CI

we had a problem, something would break on some platform and we didn't find out until much later

command prompt provided first servers

we had a client and a way to send requests and that's essentially the architecture we still have

now the qestion is do we still need it?

  • now there's CI frameworks that test many of the same sorts of things
  • Robert and others have said they are frustrated they might break one or the other
  • in a way we have 2 build farms, that can be confusing
  • there's things our build farm does that a CI frame work doesn't do. like cross version upgrade testing. this is hard to do in CI frameworks
  • when something breaks, finding out what went wrong in CI frameworks it's difficult
  • on buildfarm animal we can repeat the experiment and attach a debugger. need owner to get access to machine, but it's harder in a CI framework

my answer would be we still need build farm and we can customize it to our needs


are we doing it the right way?

well it's old.

lots of perl, couple thousand lines of code in the buildfarm client

couple hundred LOC in the web app, it's showing its age

i don't have an answer to that but here's where i want to get into discussion

what else do we need?


Robert H: the thing about the build farm that i think is frustrating, other committers i talked to feel same way, it feels like it turns into a 5 alarm fire when you break the build farm

whether we change that

  • culturally - we don't care
  • or timeframe for fixing is longer
  • or we fix through testing - we can test before pushing to real branch

i don't know the fix but i think there are multiple committers who won't push anything on a friday

Peter E: or in an afternoon

Robert: if we have committers who rule out half of 4 days and all of the 5th day as times to commit patch, then we need to find a way to not turn it red or have people not get upset. otherwise we're throwing away lots of human bandwidth


James Coleman: as someone spends most time on app development and not on PG dev, i want to describe scenario from our perspective

someone opens up a PR that targets main branch, you are not allowed to merge until all CI and tests are green on that build after patch is merged


the way build farm works, you don't find out you broke freebsd until after you merge and tests run

running tests before merge doesn't guarantee, for example someone has a test assuming year starts with 1970 and year rolls over. can't fix all of those but can get the 90% of cases where you know it's green before you merge

you might have 5 bell alarm occasionally but a lot less



michael p: i worked a lot in the last 5 years with CI

you can enable on your own copy of PG, you can make sure not to commit until it's all green

this saved my ass multiple times

i catch a lot of things


James: merge is better because it has to run on latest code not just your latest pull



Andreas: there are some tests, the worst of that is pg_indent breaking things. we should fix that by testing in CI

there are a few tests that are bad for CI - cross version pg_upgrade doesn't make sense in CI

there are also the free cycles we have from CI. i set it up so you can give repos more bandwidth

we have to think about how we want to allocate that, i don't know who would manage that

if it's a problem it's not that hard to do that for limited people, for large group we might have higher costs


topic that came up recently on MLs - robert brought up how valuable the commitfest app is

bc there's a lot of stale patches that don't get attention from original authors

previous topic - the only way to run against build farm is to throw my patch on the commitfest app in order to test on those machines

which leads into this - is the commitfest app doing what we want it to do


andreas:

clarification: commitfest is not run on build farm

there's a readme but it's not discoverable. maybe i should add it somewhere more discoverable

it's been an issue for awhile. there's a lot of info we have around and nobody that's new to postgres has a way to learn


david w

the build farm is ok but with things being opaque were losing out on ability to have people work on stuff

i translate a lot of this as i want github PR where i automatically get CI run

whether we want to use github idk but it's a pattern that already exists and is tied to testing


how many people would like us to work with PRs

maybe 3-5 hands (not many)


it doesn't mean patches don't go to mailing list

what's next?

james: you don't need to go that far

even if we just had if only committers had access to git repo and committers can open PRs and use that to run things and hit the merge queue so it's guaranteed to run in order

whether or not that's desirable you can have PRs that trigger that behavior without moving everything in that direction

Rob: i think there's a potential pitfall here where we create processes where i feels like we're asking commmittter to do one more thing on top of everything else they already need to do

but point we're swimming against the grain is also valid

i had to do a PR one time and thought it was hard and i didn't like it but if i did it more often it would get easier and probably better ways to do it

bt i's good we're talking and thinking about this because original point about practive vs reactive is exactly right

if you found out you WOUL D have broke build farm and it didn't happen nobody is upset

if you find out you DOD break build farm higher probably someone didn't ike it

Dave w: i think there's opportunity to reduce committer overhead with PRs

i made a patch to do docs i didn't know anything about the SGML and i broke it and someone had to explain to me how to test it

then i can go for help to the lists i'm not waiting for committer to test and it turns red, i know sooner it fails


basic question - where do CI cycles come from that get used to test in CFF?

free tier cirrus CI at the moment we use credits from google

CI is common thing in multiple projects i've used it myself and it's way easier than what we're doin gnow

second i get visibility into what the CI is

instead of you telling me it broke something then i find it and fix it

CI makes it much easier for everyone to work through


is build farm testing more platforms and environments than cirrus, and weirder ones. more and stranger

if there's a way to get those testing through the CFF app

we can get platforms added

there's stuff we can't test in CI because we don't have availability

you can' run aix on a vM to my knowledge

i don't think there's any power CPU you can get at the moment

there's stuff like that

i don't think number of breakages like that are a big problem


i think i might be worth considering github actions in the repos you don't need to sign up for cirrus CI there aren't limits on compute as long as repo is public

there are limits on platforms

this discussion is which is the best CI providers

github is the defacto standard for most users and projects these days you don't need to sign up or pay anybody


to original question - of build farms we can find a way to make sure most/all patches go through CI before they get committed

it will also mean breaking it is no longer this massive fire bc it happens less freq and weird platforms

it will get rid of some of that build farm emergency pressure


build farm stops testing at first failure, that's not a problem anymore

when we test with meson we run all the tests

if it breaks 4 things you're likely to see 4 failures so its not as bad as that


you would need probably more bandwidth than you have on those machines

the way CI uses VMs i think you randomly run code from somewhere it's a bit more complicated you can still make it work but its a question of bandwidth it uses a lot of cores


a thought i had as peter was mentioning what CI infra is best github actions runner code is available so you can host it on your own hardware if the build farm is able to host github actions maybe you could test more of those platforms using a consistent API consistent CI

we have at prev talks allowing uncommitted patches on build farm through some alternate branch or whatever but the problem with that is as a buildfarm owner myself we aren't really volunteering to run stuff that hasn't been vetted by a committer

CI is run by people who have practice running untrusted code

buildfarm on an owner's machine is exposed so i don't think there's support for running besides master branch so there will always be cases that look fine in CI but we throw it on buildfarm and not so fine

there is provision in the client now for supporting other branches so you can set the branches to build a regex and point it at particular repo maybe we can point it at a repo where committers will commit experimental patch the original intention was to support security patches before they got committed on the client side the facility is already there you just have to trust the repo you're pulling from arbitrary patches is not the way to go given our distributed nature


i have another topic in prev session we talked about new testing frameworks that has always been a point of struggle with the buildfarm client at least at one point has hardcoded list of tests that exist you had to way until update list

if were going to add any more stuff more of the struggle in prev session was funding actual cause and outputs of test cases build farm client is not the best way

i don't have an answer i'm just relaying this as a point point we prob need to do some work there in the client or in the interface to make that easier

one thing that's also with CI at the moment is collecting all the things i was thinking to make it easily visible i was thinking a script that collected an overview of what happens and put that in the repo instead of build farm separately rather than different collection things might make somethings easier


what else do we need?

i was doing analysis of build farm meson versions from what i understand the farm has a pg instance where you can query logs committers have access but nobody else

rob: i don't have access, very imp people have access


i was going to write web scraper but clicked on hooks it would be great if there as a read only instance select * from build farm


how big is it - can we publish a daily dump?

many TB history going way way back

reason it's so big, error collection it stores about everything logs are 10s of MB in some cases

if you only collected metadata maybe thats more reasonable to publish


i think the idea of a web interface is good brings me to point 5 - i'm a single point o failure in all this

i'm looking for volunteers to help me

a good place to start is a web app that we enable on the read only backup to let people extract stats and oher info from it

and i don't personally feel like writing that app if someone wants to become part of the effort they'd be welcome


one idea i had and i know asking build farm owners to do anything is a huge ask but for instance on the build farm website many animals list compiler version an dlinux versions its freeform metadata

it would be useful to me if it was more structured such that i could write a good web scraper to get this info instead of seeing how many different ways people write debian 10

client interface to generate json description of environment so i don't need to check logs to see what version of perl and gcc version

id be interested to help contribute that if owners would be happy to export

you shouldn't put a. lot of faith in those manual tags if i want to know compiler version i check configure log to make sure it prints out the real deal i don't think we have os version thats not true the uname output is accurate for platform

but yeah it'd be nice of that was more structures

we could parse that output server side its really expensive

i don't think parsing freeform config log is the best way to do it we have all the artifacts on the build farm website all we need to do is add another file meta.json generated on check world or whatever they run such that its' appended to the upload

separate script we can add a little piece to setup we can do it for every run we'd probably throw away or update personality automatically

we want to keep history of when it changes then you can say bug showed up when it changed

going back to we interface presumably we're going something you can run arbitrary queries on

its going to be hard to let people run queries wo letting them break everything but you can build interface that lets you do useful searches witout running arbitrary queries otherwise you ahve to ask andrew for sql access

common things people are doing is looking for a specific failure limited ot certain artifact or time period or looking for error message if you can do top5 things people want to do

if we can think of things there's already some query interfaces adding a few more columns there would be depending on your perl

seems like a few people here are interesting in contrib improvements to buildfarm client can you explain the process deployment mythology and help these people gets tarted

theres two ways they get started they put a PR on github or they email me a patch

is there a way to test? setup own server? there is a separate repo which has got a vagrant setup so you can setup sample server

basically the way to test stuff is setup own copy of buildfarm client you dont have to be official

thats to test client

but if you want to test server thats totally different we should probably do it with docker i thin i had a docker file but last time i went with vagrant setup

how does build farm push to animals? it does not get pushed we publish and then owners have to fetch from github or server

they have to remember? we send out an email every time theres a release

would there be any appetite in the modernization of the web server? to rewrite in python? bing in perl puts a lot of people off should be possible nothing terribly magical i'm absolutely open to it


one thing would be good work towards everything build farm does be in common scripts or code its painful if you have failure bc script runs in a way you cant locally repro bc you cant run buildfarm script failures can be build farm specific

i'm not sure how were going to do that

one example is the type extraction use to be only in buildfarm ode but now

yeah i want to get rid of that

another example we used to have a lot fo stuff to adjust things we were doing upgrade for we migrated that it doesn't belong to build farm anymore more stuff like that


any more discussion?

one thing a lot of people other than tom are not aware of as he mentioned you can run build farm client yourself it doesn't have to run you can upload a result against your own uncommitted patch using argument from0source on command line on your local dev environment i use it regularly on eat own dog food principle it can be quite useful it does not upload results


reason we don't want to use github actions logs are not readily accessible we cannot link to it because you need to be logged in

would it be helpful if we made the default github workflow here ot enable the full CI workflows so you would get a new fork get told what to do rather than having to find that readme

probably yes

it would require a .github directory nate is about to add this anyway

you put a file into this directory if you then tell some information theres discoverability you have to know you can enable it

do we have many people who fork from github repo? it's not material for that purpose

we do get quite a few patches on the github repo which all get automatically closed


well the last point i want to make is that i'm not going to be around forever so if someone would like to get involved take on some of the maint

i should be grateful and welcoming

Increase Community Participation

Session leads:

  • Stacey Haysler
  • Joe Conway

Core premise: We need more people. We ALWAYS need more people. How to get more people involved?

Challenges

  • People aren't aware there even ARE opportunities to help (there is no central list, and we often forget to ask)
  • Even if someone is aware of an opportunity to help, it's often not clear HOW to help (nor what the expectations and/or time commitment are)
  • People assume contributing code is the ONLY way to help (which couldn't be further from the truth)
  • As a result, the contributors we attract tend to be developers, not folks with complementary skills (Business Analyst, Marketing, Event Planning...)
  • As a result, there's a lack of adequate support for non-code contributors, and also a lack of recognition.

Attracting and retaining new contributors

The point about calling for volunteers is more important than someone might realize. We often don't ask. As a result, we don't create opportunities for new folks with desire and energy to get involved, and the core leadership suffers from burnout and feeling they need to stay contributing unsustainably, indefinitely.

The volunteer organizations I’ve seen that have good turnout are SPECIFIC about their asks. Trying to inspire an “I can do that!” feeling.

Ideas:

  • Spell out the various roles/opportunities to volunteer in a central place (as discrete, "well-scoped" as possible)
  • Call out—in particular—forms of non-code contributions (conference organizing, fundraising... [Joe had a whole list on a slide :-)])
  • Call out specific skill requirements
  • Call out specific time requirements. How long do I need to do it? What’s the time commitment? ("average time is X per week," aim for a small value of X. :-))
  • Clearly spell out how to get started (e.g. go to X web page and fill out X form)
  • “Urgency” might also be good to highlight - we need help in X area right now: We’re trying to get a release out, etc.
  • Consider anointing one or more "volunteer coordinator" folks to do personal 1:1 outreach and help get new contributors started
  • Spell out what folks get OUT of contributing — networking opportunities, consulting opportunities, career development... describe the impact
  • Maybe equivalent to news on the main website, “Help needed”
  • For code contributions, we used to have a "TODO" list (apparently still do in the wiki, but it's unmaintained). This was a place that listed things that people could contribute to, and then they were moved to "done" with the person's name on it. Needs to be maintained. BUT maybe this could be a volunteer opportunity.
  • We have a variety of newsletters, social feeds, etc. that struggle for content. Summarizing what help is needed _now_ could be a recurring segment.
  • Explicitly call out "snack sized" contributions for someone with a smaller bit of time to contribute. (e.g. folding name badges at the conference)

Bottom line: Make the process easy, for both the volunteer AND the person who's managing volunteers.

Also, there should be a concerted focus on attracting non-code contributors. Vision: What if our conference organizing team staffed with admin staff, professional event organizers?


Recognition

How do we reward people for their worK? Recognition. As humans we need it, it’s part of our wiring. We don’t do this a lot, outside of conferences, a new feature announcement. But LOTS isn’t recognized. We don’t announce / make noise about the helpers — make a big deal of them.

This is important for non-code contributions in particular. We need to RECOGNIZE these types of contributors the same way as code contributors (names on page, coins…). Folks involved in events typically get on stage and get applause, but outside of that not much recognition here. We need to find a way to better show those things are important, too.

Ideas:

  • Extend community directory to also cover folks doing things "around" the community, not just those in technical roles. Helps an incoming person with those skills see a) those skills are valued, and b) who they might talk to
  • Trophies (developer coins)
  • Do that for non-code contributors as well (we do, a small amount)
  • Gifts, swag — you like tea? here’s a box of tea!
  • A contributor recognition feed on the blog that’s like “every week we highlight some contribution that someone did” something different every week: someone who contributed a patch, someone who organized an event, someone who did X.
  • Once a month, create some sort of summary of “this thing was done by this person, that thing by that person” — could be syndicated in newsletters / social.
  • Syndicate content from people already highlighting these things on the PostgreSQL website. (5mins 5mins of Postgres by Lukas Fittl — "Give that guy a medal", PostgreSQL Person of the Week ) Combine forces.


Burnout

How do we support the people who ARE involved? Burnout, neglect…

Ideas:

  • Check-ins to see how people are doing (individually or as a group)
  • More intentional "off-list" time to connect on a personal level — zoom meetings, conferences
  • Retrospectives — here's what's working, here's what's not, and make changes
  • Make sure people have what they need to do their jobs
  • Consider making requirements less stringent for less security-critical roles (e.g. sysadmin vs. wiki editor)

Stepping Away/Retirement

We need to allow people to step OUT of roles: Retirement / change fields / win lottery — what do we do? Folks should be allowed to develop and change.

So used to seeing certain people in roles… Ok that’s taken care of! That person always does that… Not fair to that person in the role.

Succession planning… not everyone’s going to do everything forever — who’s going to take care of that?

Engaging end users

How do you get someone who’s kinda interested in PostgreSQL, but not thinking about it like us, engaged in how to become part of the community instead of just a “user” of it?

PostgreSQL is used by ALL these people… but only 100 people from NY in the NY conference, for example.

Would love for many people to feel like they want to be involved in the COMMUNITY. Not just the stuff around code.


Training/Mentorship of Developers

The stuff above about non-code contributors is true, but also realizing more and more that we’re not doing a good job to help people to become successful developers, either.

People who are doing the work today are kind of the people who’ve succeeded in spite of the obstacles, vs. following a good process.

Part of that is because we don’t *have* a process. We have a mailing list. Show up, make a go of it, if you succeed.. congrats! You’re a Postgres developer! :D

Imagine if your job worked that way. You’re shoved into a room with a bunch of people already having conversations… You do a good enough with conversations you get paid, otherwise you're out.

Not going to have developers if the experience is you get thrown into a swimming pool. :(

Even people succeeding at it think it sucks. Everyone says: "I’m glad I did it because of the outcome, but didn’t like what it took to get there."

Mentorship can be very rewarding. They’re motivated to learn something from you… improve/get better. Those interactions tend to be higher-quality on average. You know who you’re dealing with, agreed to deal with them. Versus son the mailing list, anyone can show up. You need to guess what level they’re up so you don’t patronize / go over their head. One of my favourite things is interacting with people where you feel there’s something really constructive happening… you impart knowledge, and they’re grateful, excited.


Ideas:

  • Some sort of formal mentorship program (this can actually be extremely rewarding for experienced people, too!) Just be mindful of "opt-in" so people who want mentors can get them.. but don’t feel insulted … and that mentors AGREE to mentor.
    • Perhaps have an "intermediary" such as a Community Manager playing match-maker here.
  • Build a closer relationship with extension developers - these aren’t just the next generation of PostgreSQL hackers, but they’re the people causing _other people to use pgsql+ and are a big part of why pgsql is successful — we describe pgsql as the “extensible database”
  • Put together some sort of formal training for new people (junior -> senior), cross-training.
  • Another mentorship idea: “Mentoring ring” — 5 people who got together every 2 weeks, pick a topic, talk about it… in a small group, can have vulnerable conversions about things you’re struggling with… I’ve talked to engineers “I. Know I should mentor, but don’t know how, don’t know if I’d be good at it…” — this could be a way for people who haven’d one it before to try it in a low-pressure way.

Joe's walkthrough of proposed website changes

(Hopefully there is a link or slides or something we can put here, it was cool :))

PostgreSQL has a Contributors Committee. This is a group of people who decides who gets on the PostgreSQL contributors page. They have some “draft” / “proposed” changes to make.

Thought process:

At FOSDEM, it was raised that there's no page that tells you how to get documented on the contributor page.

Started documenting that, and started calling out specific kinds of contributions. Talk about the many things people do in order to become recognized.

This is just documenting what happens today, but got me thinking… what should we be doing to make it better?

  1. Create a central directory for each committee / team involved in project governance. List who is on the committee, provide URLs + short descriptions of what they do, and clear instructions for how to volunteer. (Even though I was involved for 25 years, I have a hard time finding it (Some don't have a page anywhere, some are on the wiki, some are on the website... no central way to find it)
  2. Move governance team wiki pages to the main website for better coherence.
  3. Make use of "badges" on the contributors page. For example, Tom Lane could have badges for “core team” and “committer” and “security team” and … We could also gave badge for “stewardship” (non-code), “code contribution” (non-committers) — badges could be for roles, and also activities ("Release notes for PgSQL 16")
  4. This also gives someone the ability to retire. Your name/badges are still there, just greyed out.
  5. Use automation to help surface "long tail" contribution activity. Debian community has software that's very activity-based. Scrapes various feeds of data on the web and feeds a board to see who’s actively contributing in past 12 months. (Right now, in order to get on the contributor page, you have to slog through contribution for many years until other people feel like you should be there.) This would be a way to recognize people who are just showing up, and submitted patch that got committed, or volunteered at a conference… Show lots of measures in an automated dashboard.. check 7 different things.. not just mailing list, lines of code.. capture a wide range of peoples' contributions
  6. Call out "Contributors" vs. "Major contributors" who are tenured, been around a long time… we can recognize them in a different way.

Feedback?

Observability

(Sorry, this one is rougher, as I didn't know all the terms. :\)

A "Self-Driving Database" is the holy grail of the database field — a database that can identify problems before they happen and respond in real time to avoid failure, without the need of human intervention.

But in order to have this, you need ``perception`` — current + historical values of the types of data that might point to a problem — in order to make decisions.

What we want to do: PostgreSQL has the ability to be monitored itself.

Challenges

  • Right now, there is no single way to track operational metrics in PostgreSQL. Various bits of info are scattered all over the place… global variables, shared memory, hash table, procarray(?), logs, etc. Some of them functional things used for some purpose (tuples), others kept as metrics. The only thing that makes them into metrics is the reporting/stats table / views.
  • Because of this, people need to come up with their own solutions if they want the ability to monitor the health of their PostgreSQL system. This adds complexity and administrative overhead. Don’t want to say to people: Just to get good observability, you need to load 20 different things.
  • Kicking the problem to third parties / extensions causes additional challenges — if you are monitoring RDS, and don’t have access on the node itself, you don’t have exporter metrics, so how can you collect this important information, in addition to pgmetrics? How do you get CPU core if you have middleware living somewhere else? System catalog?
  • However, built-in has challenges, too... If it’s on a port, you’re going to be able to iterate so much faster… You won’t have this much churn between existing minor version of PostgreSQL. Has to be stable. You want to build and build and build on it. Makes more sense as an extension, so you can guarantee compatibility with minor versions vs. forcing an upgrade.
  • Another challenge with "side-car" approach (there’s a talk about this at PGSQL NY) is if it’s an extension / external project you can’t write tools that rely on these, because they don’t know which of these metrics, how they’re going to be defined on your system. Whereas if you had a basic set of metrics that core is exporting under stable names, running PGSQL 17, you have EXACTLY this set of metric and what they’re named, you can rely on them, people can build tooling around this.
  • We have lots of tooling that can get one or another information… but don’t have a framework we can expose to end users. Most end users never use this info because it’s in a format more relevant to developers, not end users. ALL consultancies are always bolding a set of scripts to wrap around. (e.g. a script that shows this replica has this lag from this primary)
  • PGStat Activity is too noisy... SELECT * FROM pgstat_activity, it’s practically irrelevant.
Need to “weight” statistics, in some kind of build system.
  • Database crash: Interesting thing about that state: almost nothing works. Can’t access tables, catalogs… system is in inconsistent state, and needs to get back. An external exporter can’t connect to the database yet, and cannot scrape metrics. But a built-in exporter CAN recover these metrics and tell the exterior world about problems. (If the thing we come up with doesn’t work in a crash recovery situation, maybe shouldn’t be pursuing it?)
    • However, if you want THAT level of tracking, you need something very, very low level. NOT a regular connection. And beware of how much info is captured/exposed there... you want "What's going on, when's the site coming back online" not for someone to be able to read raw SQL statements, which might have security implications.
  • Less severe case: System is under load. Existing tool wants to connect, no connection, system is still up, some connections being answered, but can’t get connection to actually monitor.
  • For large data, we have a large database which was never adjusted to be ___ properly. It’s a big deal to be able to guesstimate on the size of a BLOB. Right now we have to use approximate methods, scan the whole table, and get estimated results…. If we want proper results, it’s a no-go, because it could be multiple TB and we have to visit each table to get this.
  • A lot of really good information is only showing up in logs, and not captured for end user consumption. No one thought to put in a view, or they did but the rest of the code didn’t get the memo, so just treating it as a functional thing, and didn’t realize it was of interest to developers.
  • Vacuum gathers stats about what Tuples it cleaned up.. that would be super useful to get into Prometheus, but not, because not part of stats framework.

Possible solutions

  • Take observability metrics and put into observability stack. OpenTelemetry is the default standard: Prometheus, visualized in Grafana
  • Use PG Exporter / PG Bouncer / some middleware to extract stats from cache log and put entries into a separate time series database for analysis
  • Have an accompanying Go application with a built-in metrics endpoint at /metrics which exposes internal status directly to observability stack
  • PGLoad(?) — quite similar to CPU usage, but is session active time % CPU core.
  • Extension in pgDb wrapper ILPHp statuses(?) Used to report health statures to the master, replicas… in this approach it starts a background worker, and you can get internal status
  • Make existing PG Exporter as a background worker. Listen under another port (9999) and CURL from it
  • All the metrics, configure them through internal config table, grab the metrics, define dynamic labels, find metrics lines. Do this directly from Prometheus, don’t need to maintain extra stuff, and even RDS could use it.
  • Split metrics into two types: stuff that's moving fast (e.g. modeling, other stuff that changes a lot over time), and stuff that is "key" (you want available in any database — meta info on tables, views, etc.). Put the latter in core, have a mechanism for "registering" the former that can build on top of it.
  • Create user-friendly aggregation around the stats that are already available. Users find the current metrics tracking unfriendly, and ignore its existence. (Basically, distinguish developer-friendly metrics for benchmarkng and a more usable format for end users)
  • Debug logging so when system crashes you can simply start logging. Not another port. Not exposed security access.
  • Have some API/Structure… register all the variables/things that are metrics, update them going through some piece of infrastructure… everything that’s a metric then becomes something in the log, also reported as a metric, also shows up in traces AUTOMATICALY so we don’t need to think about it. What would such a thing look like? Some simple, some more complex (Per-rel ID, per-query, stats). Some simple abstractions would make things much more reliable through different channels.
  • See how others are going about this, e.g. Oracle AWR(?)
  • Add distributed tracing support. For example, DataDog gives ability to modify query to give trace ID as a comment on the end.
This is an ugly hack, but it halfway works. (Then you have to query pgstat activity all the time, and hope you find your query and can join it on something that captures a trace 75^% of the time…) PGSQL could add the same mechanism, have separate output (separate log stream, separate interface)

Sidebar: Vacuum

  • Possible to leverage Vacuum for this?

Vaccuum collects information operatively about table indexes, usage statistics… how much process pages, scan pages, durative(?) pages, which records were not processed, and/or deleted ... It needs to track a variety of concerns to send information to the replica. It is enough to have a picture of how the system works, for example for a month.

You might have a stable stats with the table over 4 weeks… but then one day when we collect a report this table has a lot of updating / deleting processes… vaccuum needs to process data more carefully and longer… Administrator needs to see these changes and needs to make a decision to just use data for operation for this table, and change the vaccum setting. All the vacuum statistics that we have are around functionality and consuming resources, and they depend on workload of a database in my opinion. So we need to have access to this information to make decisions.

Vaccum gathers cumulative statistics for indexes, tables, and there is a view that helps us to resolve this information for databases.

(There is a patch on the list from Alena Rybakina)

(Sounds super helpful! Lots of things in log files that are not easy to get to.)

Sidebar: Another way to track metrics

I upload an export (pgexporter) it is insured by original pgsql exporter, but has capabilities to identify all the existing catalogs as metrics (pgstat tables). become dynamic labels: index scan, XXX///

If your system catalog is determined, your metrics are determined

Maintain file from 6.3 => 17 of major versions Whenever major version dumps you have to make a little change to metrics designation. But it is usually backward /forward compatible .

If something change and two columns… Change definition. Min version = 17

By this approach, you can define a stable set of metrics for your data catalog.

This approach works well… effective API because names, format already defined as system catalogs.

Open Questions

  • PostgreSQL core or extension? (Or a mix of both?)
  • How "deep" do we want failure detection to go? Down to a crash recovery situation?
  • What should we use as the "base"? (Logging Collector? Replication Protocol? Statistics stream you can subscribe to?)
  • (Probably also many other questions :D)

Next steps

  • Form an Observability Special Interest Group (SIG) (volunteers onsite wrote their emails on a whiteboard :))
  • Objectives:
    • Get these thoughts organized into some kind of organized list of priorities.
    • Bring proposal to the list. “This stuff would be very useful to all these people for all these reasons.”
    • Remember: We're driving toward a PATCH, not just more discussion. :-)
  • How will we work?
    • (Async) New mailing list(?)
    • (Monthly) SIG core members meet over video (summarize meeting in writing for those who couldn't make it)
    • (Quarterly) A progress update posted to higher-visibility channels (mailing list, website...)
    • (Each Release) Put out "beta" release notes to inform end users of planned changes ahead of time

JIT compilation

Notes by Matthias van de Meent. Topics are discussed subjects, not necessarily in order of appearance.

The session started with a short introductory talk by Pierre showing a copy-and-patch JIT implementation; with no notes for that segment. The slides are available here.

JIT caching

  • We need an efficient cache key computation for expressions (rather than that we have to use expensive nodeEqual -related code).
    • Any hash that is unlikely to create hash conflicts should be sufficient, worst case we pick a cryptographic hash.
  • We need to drop pointers (?) from compiled expressions, so that LLVM can generate optimized code, rather than failing to optimize due to its use of pointers.
    • Daniel has a patch using offset-based pointers at the expression state, which would fix above issue.
    • Andres isn't very happy about the specific implementation; "it doesn't feel quite right". Andres would appreciate someone else looking at this for a fresh eye to find a potentially better way. (Pierre volunteers)
    Splitting out the modifiable state from expression nodes would be a good start regardless of JIT progress.
    Example: Pointers in expressions are bad.

Tiered compilation, tiered JIT

  • We need a better Tier 1 than LLVM, it wasn't a great choice for T1 when implemented, but has gotten way worse of a choice since. LLVM JIT compile time has doubled many times since PG11's first JIT implementation.
  • Costing:
    • Different costing? Need policies and changes to core PG to do costing well.
    • Execution count -based costing for triggering JIT proved expensive with overhead during execution (data: Andres Freund).
      Maybe implement timing-based trigger that checks if we're still executing after some time, and trigger background compilation if so.

How about: JITing of execution tree?

  • Does not make much sense right now: we'd need an opcode-based executor (rather than recursive tree walker).
    This would also make batching much more easy to implement.

Related forward-looking sidenote: Switching stacks when you're deep into a stack is expensive, but with opcodes switching to a different connection's state should be much cheaper as we wouldn't be in a stack. Thus, once we have threading and connection pooling it would be beneficial to have an opcode-based executor.

Cranelift

  • Rust is reportedly planning on using it for unoptimized builds, but stay on LLVM for optimized builds

New JIT areas

  • Sort code: "this shouldn't be too difficult".
  • Aggregation/hashing; combined expression vs. one JIT per attribute?
    -> Look at ExecExpr[ession] in loops
  • COPY ... FROM
    "Has significant issues with allocations, per-attribute copying, etc. Implementing a JIT would make sense only after a rewrite that fixes the architectural issues."
    • FORMAT CSV
      "This should probably be rewritten before JIT starts to make sense"
    • FORMAT BINARY
      "Has issues with type receive function API that is inefficient; we should optimize that API first"
    • Issues with receive/send functions, which seem to be optimized for max allocation count, not optimized for read/write performance.
      Maybe register send/receive functions with special flags in pg_proc for backwards compat?
  • EXPLAIN option for jitted/jit-able epressions/evaluations.
    Right now, this isn't abundantly clear from explain output; as JIT isn't just for projections: missing items include (but are not limited to) hash operators in e.g. Hash join nodes.

Multithreading

The discussion around the Multithreading implementation has it's own Multithreading wiki page.

Improving extensions in core

  • API/ABI stability
    • Core is focused on core ABI stability
    • David: No "statement of stability" in Core
      • David/Jeremy/Tom: coding guidelines, style guidelines
      • useful to have docs in core about what's stable and what's not, what you should compile against or not, and ABI guarantees
    • Abigale: there are hooks, but no overall concept for extensions
      • Tom: Peter E. is working on tests for extensions stability
      • Jeremy: nothing is preventing people from installing incompatible versions
  • Abigale: extensions manager needed, in Core
    • More smoke tests needed
    • Grant: extension update during major PG version upgrades is a major problem
      • Chris: does this require a compiler on the host when upgrading extensions
      • Jim: how do we solve problems with problematic extensions, like when they be need to be loaded in specific orders
      • Abigaile: there are tools to solve this problems with tests
      • Andreas (to Jim): you only need a dependency graph for problematic extensions
    • Abigaile: I want to categorize the errors
    • Kaiting: Are we also talking about a package repository
      • Abigaile: yes, ideally
      • Andreas: a central repository (for packages) will also improve security, by providing a safe central place
    • Kaiting: how do you maintain the dependency graph
      • David: manually
      • Chris: is it source based, or binary based
        • David: yes
    • Paul: Instead of having hooks, having rails to call hooks from multiple extensions
    • Chapman: What's left to be solved which PGXN does not solve
      • Abigaile: couldn't work with PGXN, doesn't work, using multiple extensions in PGXN doesn't work
      • Abigaile: currently it's not possible to test multiple extensions
      • Abigaile: have a description of what the extensions is modifying
      • PGXN v2 can solve these problem
    • Hannu: we can only do extensions matrix if we also cover extension testing
    • Jeremy: like that Abigaile started to categorize the extensions, and what exactly they are extending
      • That's right now it manual effort
      • Jim: instead of extension manager picking the end user, build it for the DBA, use PG.org
        • Works better, scales better, allows Devrim to build better packages
        • Chris: it solves the problem of solving other operating systems
        • Yurii: we can look at the `.so` files and use this for categorizations
        • Abigaile: wrote a script, listed all the hooks, kept track of the hooks and SQL functions to categorize
          • Can be improved, is regex-ing on SQL files right now
          • Being in Core has a lot of power which is right now not there
    • Kaiting: central repository, use a fork and use an extension to categorize
      • Yurii: this can be done with an extension
    • Yurii/Abigaile: Coalition
      • Group to be created to manage extension efforts
        • Write documentation how to write extensions
        • fork of PostgreSQL, if there is something not working
      • Yurii: the fork is there to test changes in extensions
      • David: there is a GitHub organization, maybe a mailing list
        • Andreas: how about the Slack channel
          • David: it has grown, but it feels like I am talking and others are listening
    • David: inline extensions
      • There was a patch in 2014 which provides inline extensions, without files
    • Hannu: Aliases
      • If parsing fails, instead of failing out, pass it to another parser
    • Hannu: other languages
      • Jim: pl/Rust is complicated, because it needs crates and compiler
      • Jim: TLE allows WASM without file access (whole package, including control file)
    • David Wagoner: immutability support
      • Christoph B. had a patch
      • Second dir allows immutable images, and a second dir with data and extensions
      • Hannu: having a virtual file system (Fuse) allows reading from the database
        • David: sounds complicated

Table AM API

  • Alexander: currently the row must be identified by ctid (32 bits block number, 16 bits offset number)
    • Some code path assumes that row versions are identified by ctid
    • Introduce `RowRefType` and `TableAmRoutine`
    • Allows ByteA row identifier
      • Some indexes can support that
      • Bitmap scans should be disabled, or replaced
    • Use cases:
      • Index/organized tables
      • LSM indexes
  • Hannu: why can't you just use the existing methods, keep ctid as it is, and add your identifier on top
    • Matthias: you need this for indexes
  • Jeff: it the proposal also to support update chains in read-commited level
    • yes, planned
  • Matthias: does it imply that you also work on a table AM that does not use block storage
    • Alexander: currently we have pluggable storage manager
  • Hannu: it's already a lot of changes, why are you trying to change the table am, and not try to change the foreign table wrapper
    • Alexander: don't have an answer
    • Jeff: can't have an index on it
  • Andres: concern is that this is a massive change, without a lot of benefits for most users, unlikely to succeed, stability issues, needs more benefits
    • Chris: what is the story if not pluggable types are supported
    • Andres: you also need to support PostgreSQL
    • Heikki: can index-organized be a first step
    • Matthias: don't start here, start with the APIs, this can support index-organized tables, and also removes Vacuum for index
    • Alexander: avoid a large patch, which changes everything
      • We will update this
    • Jelte: making a path for index-ordered tables is more useful for many people
  • Hannu: why are you using table am, not index am
  • Chris: we need an ability to tie certain planner nodes to certain table am, in order to get column based tables
    • Jeff: should we have more than one relfilenode per table
    • Chris: you almost want something which is vectorized, once you get into different table am types for different access patterns when it's no longer a heap table