Scalability of pg stat statements

From PostgreSQL wiki
Jump to navigationJump to search

PGConf.dev 2026 Unconference

See mailinglist thread: https://www.postgresql.org/message-id/flat/CAA5RZ0vZwR_dSK6fo0P2-EnskUVN0NjLHnGnJMFDPC8-kEW3sQ%40mail.gmail.com

Performance limits in pg_stat_statements

  • Spinlock contention
  • Exclusive lock contention on pgss LWLock
    • When deallocating entries from the hash
    • When cleaning the query text file from disk
    • Which one of these is actually the issue is hard to know (its the same LWLock/pg_stat_statements)
    • But ultimately these problems are related

Last two releases have added infrastructure that is useful and is important for this

  • Allowing extensions to register their own stats kind (PG18)
  • Attaching DSA pointer to stats, so you can serialize things to disk and read it back on start up (PG19)

The idea is:

  • Create a new stats kind for pg_stat_statements using pluggable cumulative stats infrastructure
  • Store query text (or a portion of it) in a DSA (shared memory area) and spill to disk somehow

Once you move to the stats collector infrastructure the spinlock problem goes away

Stats collector accumulates locally (in each connection) and then flushed at the end of the transaction to shared memory

Deallocation can be more flexible, the pgss.max value could go over a bit, and we could potentially offload the eviction process to a background worker

  • Benchmark numbers in the thread
  • What happens in the high churn case (100k unique queries)
    • Performance with doing a prototype of the proposed changes improved 33%
  • Benchmark hot queries (common in the workload) vs those that are high churn
  • We don't just want to keep going over on the local storage in each connection, need some kind of limiting early
  • Current patch has eviction mechanism inline when adding entries
  • In the test case more hot entries survive compared to what exists today
  • For spinlock stress testing need to use a very large machine (96 cores, etc), regular pgbench with a single statement
  • On smaller machines you can use pipelining to show it (context switching bottlenecks it)

Andres: Are we sure that we want to continue the query text mess? (going to disk will trash you)

  • When you load the file (which also happens during GC cycles!) you are already have the full file in memory anyway
  • Production problems where query text file gets to 100GB

Michael: What kind of parameterization do we want?

  • Sizing based on memory?

What is changing on the eviction stuff?

  • As soon as you're running out, does a quicksort / LFU algorithm and evicts the bottom 5%
  • Change takes the eviction out of the hot execution path
  • Patch as presented does not use a background worker, instead has the first backend that runs over do the work, others throw away entries

Andres: Why do we need an exclusive lock to throw entries away / GC?

  • If we throw away entries then we don't need it
  • There is lots of places that have 5000 entries
  • Can GC be done in parallel?
    • You are only allowed to add a new entry if you also clean up one
    • If you have it done by only one backend then everyone has to wait on it

Julien: Question about repetition in the text file - could we avoid that?

  • Easier to do with shared memory, could separate tracking of the query text from stats
  • Andres: Would strongly suggest not keeping them in the same hash table
  • Security problem with query text sharing across roles could be solved by hashing the query text (queryid alone is not enough)

Downside of the shared memory stats design is the search across unrelated shared memory stats entries

  • If keeping query texts separate as a DSHash that would allow finding entries
  • Andres: Suggestion to have a separate array that stores for each entry, last time it was GCed, and what the current key is
  • That way its easy to find entries for GC

Julien: Could remember what the last time a query was executed to make it easier to find recent entries

  • Pavlo: There is a patch for it that uses the already stored transaction timestamp

Ryan: As pg_stat_statements continues to grow, there are a lot of columns that are not used, can we somehow opt out of some of them?

  • Maybe separate them, or have some kind of way to define what you need

Lukas: Does this solve the scalability concern re: adding more columns?

  • Andres: Entry lock might still be a problem (i.e. we might just replace a spinlock with an LWLock), but assuming we're careful with that it should be addressed

In Oracle, there are two separate views for this, we could follow the model

  • We already do that today with query text

Andres: How do we deal with stddev merging?

  • Sami: Addressed in the patch
  • Independently of that, not clear if its really used
  • Percentiles / histogram is what people would really want

One main question is: Do we really need the dual mode of query text in memory and on disk?

  • You can't really query it (or have garbage collection run) today without running out of memory - so making it be in memory doesn't actually make things worse

Should we truncate query texts?

  • Seems useful to offer a GUC for that, but don't want to truncate by default

How do we deal with query text space running out, but max entries not reached?

  • Removing old queries would make most sense, so recent queries are present

We could remove the pgss.max

  • Instead we could have the user specify the amount of memory for query texts mainly, and just scale up the pgss.max dynamically
  • Andres: We need both, workloads are different (lots of small queries vs a few large queries)
  • e.g. you set it to 1GB of untruncated query text, but you run out of space
  • In Oracle, its kept as separate storage for AWR
  • Multiple +1s on keeping both max entries and memory limit

Have you done a benchmark for track_planning?

  • Not yet, but would expect it to be solved
  • Possibly will have issues with prepared statements

In summary

  • Keep the pgss.max in addition to query memory, now that its dynamic
  • Separate the query text handling
  • Don't serialize the eviction, make it parallel

Michael: Does it make sense to parallelize the eviction even before we move the query text?

  • Andres: Would have to partition the lock for the hash table, it would likely complicate the work
  • Finding the victims to remove is the bottleneck right now
  • Would probably want some kind of LRU clocksweep because its easy

Evicting based on LFU

  • LRU might be the way to go for eviction

Background worker?

  • Andres: Likely bad idea, its important to have backpressure and not have backends wait on a background worker


PGConf.dev 2025 Unconference

Query text storage will potentially change in 19

  • Sami working on patch for moving query text to the shared memory
  • pg_stat_statements scalability problem: query itself is currently stored in the disk (to avoid truncation of the query text)
    • pg_stat_statements constantly needs to garbage collect query text file
    • Running into exclusive lock during the truncation
    • Garbage collection is expensive
  • Want to remove disk from the performance critical path
    • Could be hybrid approach: both memory and disk
  • Keeping data across crashes/reboots
    • Proposed changes don’t change the current behavior
    • System crash -> lose the stats (the same as pg_stat_statements today and pg_stats)
  • What about truncating long query text?
    • In practical terms, if we were to truncate (examples provided by Lukas):
      • Today we take about 2MB of shared memory by default (5000 entries @ ~400 bytes per entry)
      • For just the query text alone, if we stored it in shared memory:
      • 5000 entries @ 1024 query length = 5MB shared memory
      • 5000 entries @ 10240 query length = 50MB shared memory
      • 100000 entries @ 1024 query length = 100MB shared memory
    • Sami notes he does not want to truncate query text with the proposed changes
      • Audience feedback agrees: From operations perspective, it’s gonna make a life a bit difficult (since we can’t pin point the exact query)
      • Additional audience feedback someone agreeing to need the whole query (to copy the query text, to identify the query)
    • Could have a hybrid approach to have full query text on disk still in some way
    • Could have the limit be on the memory being used, and if someone has very large query texts, only a small number of query texts would be retained
    • When storing the query text in memory, we could also compress it
    • Audience question: Could we store them (query text) in a system table?
      • Postgres is a great database to store things!
      • => As the main approach does not seem realistic, because of all the extra overhead that brings (i.e. we do need the shared memory storage for performance) - but maybe as part of a hybrid approach?
  • (Additional post session discussion in a smaller group)
    • For configuring how many statements to keep, we could either have a per-entry maximum, or a total maximum of memory used
      • Total maximum of memory used seems more intuitive for users (e.g. "allocate 10MB of shared memory to query text storage, in total")
    • Should we have the memory limit apply to both query texts and stats, or should we limit them separately?
      • There is benefit to keeping stats and dropping query texts if needed, e.g. when you have an external tool/script in place that can retain the queryid to query text mapping for query texts that were already removed
      • Agreement that this seems to be how other database systems have handled this (separating query texts and statistics)
    • Should we allow querying for statistics without text?
      • Still useful, because text might be large and not needed (e.g. for a tool that diffs the counters and only infrequently gets the text)
      • Retrieving text may also be CPU intensive if compression is used

We have a lot of fields in pg_stat_statements, do we keep adding more?

  • pg_stat_statements has 55 fields right now, ~400 bytes per stats entry
  • Why do we care:
    • Additional fields each add a small amount of overhead
    • Additional fields make pg_stat_statements hard to use / understand when querying it directly (rest of the discussion focused on this aspect)
  • We recently added 10 fields for JIT, but do people actually use them?
    • => People generally agreed these don’t seem to be used, one comment noting more generally that they recommend turning off JIT completely
  • We could change the current view for easier use
    • e.g. categorize better, different views for different categories
    • or alternatively split out rarely used fields (e.g. JIT) for easier usability
  • Worth improving documentation here, to talk about use cases for the different fields
  • Do people use std dev in pg_stat_statements?
    • Only one person among ~40 people in the room
    • Percentiles would be more useful (P95, P98, P99)
    • Could implement via a sketch data structure (tdigest, etc)
  • Are pg_stat_statements and pg_stat_activity overlapping?
    • => Not really in terms of what they track
    • But slightly different names/meanings, etc. (datid vs dbid, query_id vs queryid, usesysid vs userid), which can be confusing

What are other cases (besides query text) where pg_stat_statements doesn't work well today?

  • Spin lock on statistics
    • Doing tests on a beefy AWS machine, can show visible lock contention
    • Not easy to resolve, Sami did some tests with atomics, but not clear how to make that work
  • Tracking planning time
    • Lots of overhead due to the same spinlock (which is worse for planning use case), turned off by default now
    • Problem with partitions
    • pg_stat_statements exec time doesn’t include planning time
  • Lots of tables and lots of schemas, or short-lived roles (creating lots of entries)
    • => Cause lots of dealloc cycles because of entries churning
    • (did not discuss any solutions to this particular problem)
  • Improved normalization
    • With ARRAY - creates many distinct pg_stat_statements entries
    • Fixed in 18 for IN lists (squash), but there are other cases that are not covered (e.g. VALUES)