PGConf.ASIA2018 Developer Unconference

From PostgreSQL wiki
Jump to navigationJump to search

An Unconference-style event for active PostgreSQL developers will be held on 10 Dec, 2018 at Akihabara Convention Hall, as part of PGConf.ASIA 2018.

Unconference Tokyo Round Time Table

Time Okinawa Hokkaido
10:30-10:50 Voting
11:00-11:50 PostgreSQL and Data Analytics Referential Integrity Using Statement Level Triggers
11:50 - 13:00 Lunch
13:00-13:50 Shared Buffer Manager Hypotehtical Partitioning
14:00-14:50 Partition-wise JOIN and Visibility Map Query Monitoring
15:00-15:50 Shared Catalog and Relation Cache Built In Job Scheduler
16:00-16:50 Per-tablespace transparent encryption Idle Connection Scaling
17:00 - Beer Elsewhere

Unconference minutes

  • TBD

Social Event

  • All attendees can join the social event.
  • Venue
 * TBD


Statement level referential integrity

Oracle-style import No way to invalidate then revalidate constraints Right thing also the last thing Checks fire once per row Kevin Grittner + Thomas Munro: statement-level triggers, transition tables benchmark: 98.x% faster no code, just a PL/PgSQL script REFERENCES: INSERT/UPDATE trigger on child, DELETE trigger on parent initial pass: LEFT OUTER JOIN code is already there! transition tables don't necessarily have names enforce constraints with a single table scan! important when _appending_ to a table

  • CreateFKCheckTrigger
  • SPI_register_triger_data

very seldom more statements than rows issues with the error message: just show the first row? doesn't make things worse ignore failed rows server-side; report them so app can fix them (different thing)

 different rejection tables for every failure?

Kevin Grittner: delta relations in ALTER triggers [pgsql-hackers]

  • commands/tablecmds.c
  • RI_Initial_Check

are statement-level triggers also `Trigger *`? what does the benchmark look like for one-row-at-a-time? make the row count tunable? how to upgrade? find all row-level triggers and replace them

 not a concern with dump/restore, but pg_upgrade needs to skip the initial check
 triggers have PG-reserved names; use that to detect and upgrade triggers
 pg_upgrade creates empty cluster; maybe no need to skip check?
 all that's missing is the magic to convert the trigger
 no catalogs are binary-linked
 pg_dump/pg_restore dump standard SQL; no syntax changes

switch from row-level to statement-level after X rows

 parallel query uses a queue already
 suggestion: have custom non-trigger mechanism

Query Monitoring (enhancing pg_stat_statements)

Common complaints:

  • cannot see actual values used in specific queries
  • cannot see the query plan that was used

pg_stat_statements today

  • dbid
  • userid
  • queryid
  • query

Brief explanation of pg_qualstats and how it works

  • high overhead to deparsing
  • lots of quals can be slow
  • sampling rate of queries deparsed is < 5%

Mention of pg_stat_plans and pg_hint_plans and


  • also uses query id
  • unclear LICENSE (issue opened)

Issue with auto_explain, no way to link it back to pg_stat_statements

Issue with how search_path can alter queries and cause pg_stat_statements to conflate unrelated but textually similar queries

Discussion of query performance variance and storing historgrams, and the limitations of those histograms.

Discussion of whether it would benefit us to have indicators of which queries had the most sequential scans, index scans, etc.

pg_stat_kcache, discussion of its utility, ability to gather OS-level CPU usage, I/O usage for the life of the session, which with the right sampling can be interpolated to the per-query level.

Discussion of tracking wait events.

Should queryid be in pg_stat_activity? PostgresPro already has an extension that does this: pg_wait_sampling.

Issue: canceled queries don't show up in pg_stat_statements.

PostgreSQL and Data Analytics

Karsten: Perspective from the user

1) Level of parallelism the query planner is creating is not optimal - Postgres does not create enough parallelism
2) Customers find database crashing when number of connections goes up (out of memory errors)

=> You can tune the Linux to avoid crashes and instead return out of memory errors (tell Linux to not overcommit memory)

Joe Conway: "What is many connections?"
Karsten: 1000 connections

Joe Conway: Thats a lot of connections, in particular for large analytics workloads
Joe Conway: You have to shrink work_mem when you use that many connections

Lukas Fittl: Is it useful to look at work_mem as a database-wide setting instead of per-connection? (there is an existing -hackers thread on this)
Joe Conway: People have talked about resource management over the years, its one of the features that Postgres lacks that other databases have


Karsten: To the other topic, not enough parallelism - how could parallelism work better?
Joe Conway: What are you bottlenecked on, I/O?
Karsten: The data is cached, in memory, you have a database of 100GB of memory, and still its not working in parallel
Karsten: Seeing this same problem when running TPC-H benchmarks
(?): Would need to look at the specific query execution plan to understand the parallelism problems better
(?): Memory usage is shared_buffers, per connection relation cache, etc. - not just work_mem


Joe Conway: Do we need per-user memory restrictions?
Karsten: In our case, not so much, its typically one user running the queries
Karsten: In our environments, queries can easily run for 30, 60 minutes, or longer

Michael Paquier: There is a memory debug function in Postgres (the same that prints in OOM situations), that could be utilized
Joe Conway: Someone needs to sit down and come up with a list of requirements on resource management

Michael Paquier: One idea could be to have a per-session memory limit that is managed by each backend
Joe Conway: Whatever you do in this area, it would have to not be overrideable per-user (from a security perspective)

[ discussion around measuring whether parallel query works ]

(?): There are two counters in shared memory around parallel workers launched/etc that could be exported

Connection Pooling

Pain Points:

  • low active connections / high total connections
  • Snapshot building / proc array lock / CSN (Commit Snapshot Number) at 10k connections
  • idle_in_transaction pain (out of scope)
  • relcache stays after query run on the idle connection
  • catalog cache bloat
  • is there a way to coax glibc to give back memory? nope.

  • Connection
    • Session
      • Transaction
        • Set Local

Things that taint a connection

  • SET

Discussion about what about a connection can be salvaged after a user disconnects

  • how many active backends to retain and for how long
  • "no way around" the connection pooler being per-user

Discussion of how to decide how long to keep which user's connections around

  • Master vs replica settings.
  • idle_backend_timeout would terminate unused connections, and commonly used users would simply be recycled enough

Discussion about max reuse of connections (to deal with memory leaks)

  • static analyzers already rigorously check for memory leaks, so little need for capping re-use

Final Thoughts / Good Next Steps:

  • Could this be done as an extension, and how could you avoid connection tainting.
  • Does socket handoff need to be done in core
  • Could this just be a part of pgbouncer

Hypothetical Partitions for testing optimizations

  • like hypopg does for indexes, but with partitions


  • how to get a fake table name, and how this is already accomplished for index relations in hypopg
  • relcache does not have the ability to remove negative relcache entries
  • how to impose catalog entries to represent the partitions in just one session
  • velocity of changes to partitioning in recent versions (10,11, master) means that few people understand what current behavior is
  • some insert/delete hooks coming in v12, which may cut down the amount of code that needs to be duplicated
  • CreateFakeRelcacheEntry / FreeFakeRelcacheEntry could be used

- How to select Oids for fake objects on a read replica, risk collision with new real objects from master. Some question of whether fake Oids are even necessary.

It was suggested that refactoring for v12, even if incomplete, is better than no progress at all.

Several user questions about hypopg and how stable it is, and how it would be used, with an eye toward including in cloud DBAAS products.

Some concern was expressed that it may be too late in the development cycle to get a chance of this size into v12.

Partition-wise Join & Visibility Map

KaiGai Kohei (HeteroDB,Inc)

First, KaiGai introduced the background why he focuses on the features.

He has developed PG-Strom which is an extension of PostgreSQL to accelerate analytic queries using GPU. It also challenges to I/O intensive workloads by its SSD-to-GPU Direct SQL feature; that loads PostgreSQL's data blocks on the storage (NVME-SSD) to GPU directly using P2P DMA, then GPU runs a part of SQL workloads (WHERE/JOIN/GROUP BY). It reduces the data size to be loaded to CPU/RAM.

Some HPC hardware supports PCIe switch between CPU and peripheral devices on PCIe bus. It also enables to bypass PCIe controller built-in CPU for SSD-to-GPU P2P DMA, if a pair of SSD and GPU are installed under the same PCIe-switch. From the standpoint of PostgreSQL, it makes sense that partition leaf accommodates bunch of SSDs for each PCIe-switch. Current version of partition-wise join allows JOIN between partitioned-tables with same distributed key, however, it makes sense to distribute and push down normal tables to individual partition-leafs, because we can run SQL on the peripheral device side.

He suggested, 1) a lightweight hook on SetHintBit() to manage the heap blocks which are safe for P2P direct read, because GPU cannot access commit log, and zHeap may eventually remove the visibility-map infrastructure. 2) an infrastructure to make a join path between normal-table and partitioned-table, without redundant inner read.

From the audience, they comment on materialization is valuable for merge join also, not only hash-join. Even if inner table is distributed to 100 partition-leaf, shared materialized image prevents repeat of same scan. One other also comments the hashed/materialized table can be reduced if join-key contains partition-key and we can skip tuples obviously unmatch. One asked it is safe for non-equality join. KaiGai thinks it is equivalent to the current behavior for INNER JOIN, even if reorder GATHER and JOIN. One other question was about the hook on SetHintBit() - which information shall be delivered. KaiGai said that relation-id, block-number, iterm-pointer and informask, but it is detailed stuff.

session slides are here:

Built-In Job Scheduler

Lack of scheduler in Postgres

Extensions end up making their own scheduler, or require people to use cron

Use cases

  • Partition rotation (pg_partman, Timescale)
  • Run-away queries and killing them
  • Time-based partial indexes (queries almost always look for the last one of something)
  • Manual vacuums

In-core scheduler can also be useful for extensions to add recurring tasks, no need to make their own background worker

Key features - RLS (Row Level Security) on user_name - Do NOT look like cron - Only execute stored procedures with a specific call signature

We have stored procedures - no need for another program!

- job_name text UNIQUE
- user_name text
- procedure_name text/oid
- initial_run tstz
- run_frequency interval
- last_run tstz
- last_duration interval
- last_successful_run tstz
- last_successful_duration interval
- max_run_duration interval
- num_overlap int

Stored procedures are currently restricted in terms of languages, but from there you could call anything

"How many backends do you use?"
=> One scheduler job (background worker) that would then spawn separate workers for each invocation

Dimitri: Should have a cap on the number of concurrent tasks that can be started

Magnus: If you want long-running jobs, you might want to put priority on them

Flag to run things on a primary/standby only (could also be solved with pg_is_in_recovery())

What wouldn't work with this?
=> VACUUM doesn't work, because it checks whether its a top-level statement

"Whats the value to only restricting it to execute stored procedures?"
=> Simplicity was the main motivation
=> One command text field is good too

"Which databases is it going to run on?"
=> Would need a way to specify the database

"Are timezones a problem - how do you handle a day-light savings time change?"
=> This should be taken care of by the schema as proposed

"Isn't this similar to the ability to write a background worker without using C?"

Magnus: "I want the ability to run a job now, without any other effects on the schedule"

"Are last_duration / last_successful_run / etc necessary at all / should they be in a separate table?"
=> Maybe should have "pg_stat_jobs_schedule" to store these and other things like rusage

The core feature that could be using this is continuous materialized views