User:Gsmith

From PostgreSQL wiki

Revision as of 07:57, 9 December 2009 by Gsmith (Talk | contribs)

Jump to: navigation, search

Greg Smith is a consultant for for 2ndQuadrant, regular hacker of new features starting with PostgreSQL 8.3, and author of many papers on PostgreSQL and related topics available at his PostgreSQL Performance Pitstop.

Below is a document I'm working on that I'll merge into the main structure when I'm done knocking the mistakes out of it.

Contents

The Shared Buffer Cache, Checkpoints, and the Background Writer

Introduction

This is an introduction to the internals of the shared buffer cache in PostgreSQL 8.2. The database engine assures data integrity by a process that involves writing out data to a Write Ahead Log (WAL), then writing to the main database. A procedure called a checkpoint forces a synchronization of the data in memory and that on disk. Checkpoints can take a long time, and optimizing the database to reduce their impact is a major goal of performance tuning with PostgreSQL.

This document explains how checkpoints actually happen inside the engine, and how the related buffer management mechanisms work. A related purpose is to explain enough of the operation of the database engine to understand what the parameters that control background writer actually control, for improved tuning insight.

Many sections here include pointers to the PostgreSQL source code that implements that function. These serve as an index to that area of the code for someone who wants to dig into the details, but looking at the source isn’t required to follow the discussion here.

Buffer Cache Internals

What's in a buffer?

You're probably familiar with increasing the shared_buffers parameter to add more buffers to the system. The actual content of each buffer is described in the sbufdesc structure. The major parts we'll talk about here are:

  • flags: a bitmask describing the general state of the buffer, with the various flags defined in buf_internals. Buffers that need to be written out will be marked BM_DIRTY.
  • refcount: Shows if a buffer is pinned, which means that some process is in the middle of updating it and has locked it for that purpose. Generally you'll only find a buffer or two per client backend in the pinned state.
  • usage_count: Each time a buffer is used, this count is increased, up to BM_MAX_USAGE_COUNT (currently set to 5). The main LRU clock-sweep decreases this number each time it passes over a buffer.

When someone wants a new buffer to hold a page from the database, they call BufferAlloc to get it. If it's not in memory already, a page is evicted and this page read in its place. In any case, the buffer is pinned, which keeps anyone else from evicting it until that pin is released by the person who allocated it. The process of pinning a buffer also increases its usage count.

How is the shared buffer cache organized?

The engine’s buffer cache is a simple array of these buffer entries. Most of the operations done on it pick a starting point, and just move through the array with a simple loop incrementing the entry number, rolling back to 0 when the end is reached. This is referred to as a clock-sweep of the buffer cache. For a visual of how this looks, see Page 11 of http://www.siue.edu/~sblythe/cs514/lectures/virtualmem.ppt , an article about implementing virtual memory.

Anytime some part of the engine wants to access a buffer, it increments the usage_count of how many clients have recently accessed that buffer. Such recently-used (RU) buffers can’t be recycled when a new buffer is needed. Once new information is written to a buffer, it’s referred to as being dirty. Some typical transitions between these states might be:


  • (not RU, not dirty) -> read -> (RU, not dirty) -> update -> (RU, dirty) -> transaction complete -> LRU sweep ->(not RU, dirty) -> checkpoint -> (not RU, not dirty)
  • (not RU, not dirty) -> insert -> (RU, dirty) -> background writer write -> (RU, not dirty) -> transaction complete -> (RU, not dirty)

For deeper background on how the buffer cache operates, with details on the locking mechanisms, take a look at src/backend/storage/buffer/README


How are new buffers allocated?

While there is initially a list of free buffers in the engine, it will quickly get exhausted during use. The vacuum process does repopulate that area, in order to keep vacuuming from blowing out the entire buffer cache (it will just continuously recycle a small section instead). But other than that, the only time pages end up back on the free list is by something like a DROP that invalidates data that was cached already. Under normal operation, the free list will be empty, so requests to allocate a new buffer will have to evict something that’s already in memory.

The method used to figure out which buffer gets booted is based on a simple clock-sweep. The point used to evict pages simply steps forward through the buffer cache, and is sometimes called the LRU (Least Recently Used) position. This is kind of confusing, because there’s really no LRU mechanism involved; the combination of the usage_count mechanism and the clock-sweep approach just approximates one.

How do shared buffers normally get written out?

When a new shared buffer is needed for an operation, BufferAlloc is called to get one. If there’s anything on the list of free pages, the client gets one of those. But under normal operations where there are no free buffers, it scans the buffer cache starting at the LRU clock-sweep position, looking for the first one that hasn't been recently used and isn’t pinned. If the buffer it gets is dirty, it doesn’t keep looking for a new one: it just writes it out, then gives that buffer to the caller that needed one.


  • BufferAlloc (src/backend/storage/buffer/bufmgr.c)
    • StrategyGetBuffer() (src/backend/storage/buffer/freelist.c)
    • FlushBuffer (src/backend/storage/buffer/bufmgr.c)
      • XLogflush (src/backend/storage/buffer/bufmgr.c)
        • smgrwrite (src/backend/storage/smgr/smgr.c)
          • mdwrite (src/backend/storage/smgr/md.c)
            • FileWrite (src/backend/storage/file/fd.c)

Checkpoints

How are checkpoints triggered inside the engine?

A call to RequestCheckpoint (src/backend/postmaster/bgwriter.c) asks for a checkpoint; several places inside the source code make this call:

  • XLogWrite requests a checkpoint when it has written checkpoint_segments in the transaction log since the last checkpoint (src/backend/access/transam/xlog.c)
  • BackgroundWriterMain notices more than checkpoint_timeout seconds have passed (src/backend/postmaster/bgwriter.c). Note that since the background writer is also the process that does the checkpoints, it doesn’t actually call RequestCheckpoint, it just requests it internally.
  • pg_start_backup wants the WAL written out before it can do the backup (src/backend/access/transam/xlog.c)
  • CHECKPOINT SQL command (src/backend/tcop/utility.c)
  • createdb wants a checkpoint after the database is created (src/backend/commands/dbcommands.c)
  • dropdb needs a checkpoint on Win32 to get the bgwriter to release its open files before the database is removed (src/backend/commands/dbcommands.c)

How do checkpoints happen inside the PostgreSQL backend?

Here is a high level view of what happens inside the database engine when a checkpoint occurs:

1) Some part of the engine decides it needs a checkpoint. In normal operation, this will most likely be either because the WAL writer has filled the number of checkpoint_segments it’s supposed to, or you’ve hit the checkpoint_timeout for how long to go between checkpoints. A checkpoint request is made.

2) If the background writer is busy writing out records, it has to finish that before it notices the checkpoint request.

3) The checkpoint starts. A record is written into the WAL log saying this is redo point for this checkpoint. Clients can continue to write to the WAL and operate normally while the checkpoint is going on.

4) The shared_buffers cache is scanned using a clock-sweep, and any dirty (changed) buffers in there are written to disk.

5) In order to complete the integrity cycle, the database file system is fsync’d, and the process waits for that to complete

6) The fsync is complete, so the server writes to the WAL saying that the checkpoint is done.


The main difficult part here is (5). While clients aren’t technically blocked here, it’s likely that whatever they are doing will require reading from the database—which is going to be extremely slow while the database disk is going through a fsync. Your only real recourse here is to aggressively tune the background writer, in hopes that it will have written most of the dirty buffers before the checkpoint occurs.

What source code is involved in writing out checkpoint data?

Here is a flow of how writes actually make their way onto the disk after a checkpoint request. The main thing left out of this are calls to synchronize locks on all the resources involved. The numbers here correspond to the steps in “How do checkpoints happen” above:


  • RequestCheckpoint(src/backend/postmaster/bgwriter.c) (1)
  • BackgroundWriterMain(src/backend/postmaster/bgwriter.c) (2)
    • CreateCheckPoint(src/backend/access/transam/xlog.c) (3)
      • CheckPointGuts (src/backend/access/transam/xlog.c)
        • FlushBufferPool (src/backend/storage/buffer/bufmgr.c) (4)
          • BufferSync (src/backend/storage/buffer/bufmgr.c) (4)
            • SyncOneBuffer (src/backend/storage/buffer/bufmgr.c) (4)
              • FlushBuffer (src/backend/storage/buffer/bufmgr.c) (4)
                • XLogflush (src/backend/storage/buffer/bufmgr.c) (4)
                • smgrwrite (src/backend/storage/smgr/smgr.c) (4)
                  • mdwrite (src/backend/storage/smgr/md.c) (4)
                    • FileWrite (src/backend/storage/file/fd.c) (4)
                • smgrsync (src/backend/storage/smgr/smgr.c) (5)
                  • mdsync (src/backend/storage/smgr/md.c) (5)
                    • FileSync (src/backend/storage/file/fd.c) (5)
    • CreateCheckPoint(src/backend/access/transam/xlog.c) (6)

The Background Writer

What does the background writer do?

Section 17.4.5 of http://www.postgresql.org/docs/8.2/interactive/runtime-config-resource.html has an introduction to the background writer and its parameters. It’s difficult to understand exactly how the background writer works until you’ve gotten some background on how the buffer cache is managed.

BackgroundWriterMain(src/backend/postmaster/bgwriter.c)
  BgBufferSync(src/backend/storage/buffer/bufmgr.c)

At any time, the BW has two points it is paying attention to. The first is the clock-sweep position of the BufferAlloc code, which as explained above is referred to as the LRU point. The BW’s LRU scan looks for dirty buffers that haven't been recently used and aren’t pinned, starting at the current clock-sweep spot, until it either writes out bgwriter_lru_maxpages or has scanned bgwriter_lru_percent of the buffer cache; whichever comes first. Theoretically, in an idle engine the LRU clock-sweep position might not move between BW scans, which means it will be scanning the same set of buffers it just looked at on the previous run. In practice, the way buffers are allocated forces that position to move anytime you’re doing something that’s likely to generate dirty pages.

The other starting point is for the all pages scan, and it’s internal to the bgwriter; it’s another simple clock-sweep that works its way through the entire buffer cache. This one looks for dirty buffers, but doesn’t care if they've been used recently or not. As such, you can end up writing out a buffer that gets changed again before work on it is finished, which is a wasted I/O operation. This is one of the reasons using the BW aggressively is less efficient than doing the same work at checkpoint time. The all scan runs until it writes out bgwriter_all_maxpages or has scanned bgwriter_all_percent of the buffer cache.

By default, the LRU writer is tuned to scan 3X as much as the all scan; this is partly because the way the LRU scan skips recently used and pinned pages means it will find less pages to write per capita. In a busy system, your buffer cache is full all the time and therefore the LRU clock-sweep point is always moving, which might lead one to think that’s all you need. The main reason to even bother with the all scan is again because how the LRU ignores recently used buffers. On a heavily loaded system, so many of the buffers can get high usage counts that the only way you get disk writes out of the background writer that help reduce checkpoint overhead is to push out some of the recently used ones. When the usage_count on a buffer drops to 0, someone else is going to grab it and use it again very soon, which means that the LRU writer might not ever actually find a page to write. All of the default tunings are extremely low however, and useful values for a write-heavy environment will be dramatically larger in every respect.

Note that the background writer is also in charge of handling checkpoints. While a checkpoint is occurring, the writer isn’t active. This means that if checkpoints take too long, you can find yourself with another checkpoint request before the current one is complete, which means you’ll go from checkpoint to checkpoint before the background writer has executed even a single time between.

Personal tools