PgCon 2016 Developer Unconference/Wait events monitoring

From PostgreSQL wiki
Jump to navigationJump to search

Wait events monitoring discussion at PGCon2016 Developer unconference

Preamble

Everyone agrees that we need some kind of wait events monitoring, since it is extremely useful for DBAs, and tools like systemtap or perf are not an option

State of the art in 9.6

  1. Infrastructure to expose wait event information of a particular process replace pg_stat_activity.waiting with something more descriptive
  2. pg_wait_sampling extension by Postgres Professional

9.7 Roadmap for Wait events monitoring from Postgres Professional

  1. Adding more events (IO and network at least)
  2. Exposing additional wait event parameters.
  3. Measuring accurate waiting time for individual events (as a turnable option)

Requirements

DBAs, who attended the session, insisted to elaborate some requirements first, to make the tool more useful.

  1. Wait events need to be easy to extend
    When a new functionality is added to PostgreSQL, adding new Wait events for it should not cause a complete overhaul of Wait event monitoring subsystem. Moreover, it is a good idea to allow user to add some extension-space or user space wait events and parameters.
  2. Sampling history
    Some light weighted cycle-buffer for recent samples in a whole database, which allows to estimate roughly what is going on and perform further investigation if required.
  3. Full session history (trace)
    Can be switched on/off for a particular session to be turned on if required for troubleshooting
  4. aggregated statistics
    Some wait statistics, aggregated for some prolonged period of time (a day or so) to reduce performance penalty. Supposed to be resetted manually like for example pg_stat_bgwriter.

Criticisms

  1. Performance penalty
    • Is the main obstacle for today
    • pg_wait_sampling - more performance review needed
    • Anecdotally, Oracle allows performance trade off (for example, Oracle database measures wait times with resource consuming gettimeofday syscall) because such advanced monitoring tool allows to improve performance dramatically in many ways.
    • Among PostgreSQL developers, there is a persisting opinion not to allow any individual patch to drop performance significantly.
  2. Oracle-style

It is hard not to follow Oracle's ideas in this area, because wait events were pioneered by Oracle. But despite the fact that compatibility with Oracle looks attractive for many companies, it is a bad idea to bring all oracle legacy and non-intuitive interfaces to Postgres implementation (like exact event classes, p1/p2/p3 values or enqueue waits)

Suggestions

  1. Make wait event monitoring turnable and turned off by default to avoid performance penalty
    • Good: people, who want to use it can do that in spite of performance overhead
    • Bad: if by default it is turned off, people can not investigate an issue, they only can turn it on and wait for the next time the issue occurs
    • Bad: if performance overhead is significant, nobody will use it in production environment (same as logging everything by default is not possible with intensive workload)
  2. Divide wait events in at least two groups, first of which - group "System" or so - is turned on by default and has tolerable performance penalty

Having some events turned on by default allows us to have some overview of what is going on in the database and gives a clue if we need further extended sampling for smaller performance trade off.