Getting a stack trace of a running PostgreSQL backend on Linux/BSD

From PostgreSQL wiki

Revision as of 13:57, 16 August 2012 by Alexk (Talk | contribs)

Jump to: navigation, search

Up to parent


Linux and BSD

Linux and BSD systems generally use the GNU compiler collection and the GNU Debugger ("gdb"). It's pretty trivial to get a stack trace of a process.

Installing External symbols

(BSD users who installed from ports can skip this)

On many Linux systems, debugging info is separated out from program binaries and stored separately. It's often not installed when you install a package, so if you want to debug the program (say, get a stack trace) you will need to install debug info packages. Unfortunately, the names of these packages vary depending on your distro, as does the procedure for installing them.

Some generic instructions (unrelated to PostgreSQL) are maintained on the GNOME Wiki here.

On Ubuntu

First, follow the instructions on the Ubuntu wiki entry DebuggingProgramCrash.

Once you've finished enabling the use of debug info packages as described, you will need to use the script linked to on that wiki article to get a list of debug packages you need. Installing the debug package for postgresql alone is not sufficient.

After following the instructions on the Ubuntu wiki, download the script to your desktop, open a terminal, and run:

$ sudo apt-get install $(sudo bash Desktop/ -t -p $(pidof -s postgres))

On Fedora

All Fedora versions: - StackTraces

Other distros

In general, you need to install at least the debug symbol packages for the PostgreSQL server and client as well as any common package that may exist, and the debug symbol package for libc. It's a good idea to add debug symbols for the other libraries PostgreSQL uses in case the problem you're having arises in or touches on one of those libraries.

Collecting a stack trace

How to tell if a stack trace is any good

Read this section and keep it in mind as you collect information using the instructions below. Making sure the information you collect is actually useful will save you, and everybody else, time and hassle.

It is vitally important to have debugging symbols available to get a useful stack trace. If you do not have the required symbols installed, backtraces will contain lots of entries like this:

#1  0x00686a3d in ?? ()
#2  0x00d3d406 in ?? ()
#3  0x00bf0ba4 in ?? ()
#4  0x00d3663b in ?? ()
#5  0x00d39782 in ?? ()

... which are completely useless for debugging without access to your system (and almost useless with access). If you see results like the above, you need to install debugging symbol packages, or even re-build postgresql with debugging enabled. Do not bother collecting such backtraces, they are not useful.

Sometimes you'll get backtraces that contain just the function name and the executable it's within, not source code file names and line numbers or parameters. Such output will have lines like this:

#11 0x00d3afbe in PostmasterMain () from /usr/lib/postgresql/8.4/bin/postgres

This isn't ideal, but is a lot better than nothing. Installing debug information packages should give an even more detailed stack trace with line number and argument information, like this:

#9  0xb758d97e in PostmasterMain (argc=5, argv=0xb813a0e8) at postmaster.c:1040

... which is the most useful for tracking down your problem. Note the reference to a source file and line number instead of just an executable name.

Identifying the backend to connect to

You need to know the process ID of the postgresql backend to connect to. If you're interested in a backend that's using lots of CPU it might show up in top. If you have a current connection to the backend you're interested in, use select pg_backend_pid() to get its process ID. Otherwise, the pg_catalog.pg_stat_activity and/or pg_catalog.pg_locks views may be useful in identifying the backend of interest; see the "procpid" column in those views.

Attaching gdb to the backend

Once you know the process ID to connect to, run:

sudo gdb -p pid

where "pid" is the process ID of the backend. GDB will pause the execution of the process you specified and drop you into interactive mode (the (gdb) prompt) after showing the call the backend is currently running, eg:

0xb7c73424 in __kernel_vsyscall ()

You'll want to tell gdb to save a log of the session to a file, so at the gdb prompt enter:

(gdb) set pagination off
(gdb) set logging file debuglog.txt
(gdb) set logging on

gdb is now saving all input and output to a file, debuglog.txt, in the directory in which you started gdb.

At this point execution of the backend is still paused. It can even hold up other backends, so I recommend that you tell it to resume executing normally with the "cont" command:

(gdb) cont

The backend is now running normally, as if gdb wasn't connected to it.

Getting the trace

OK, with gdb connected you're ready to get a useful stack trace.

In addition to the instructions below, you can find some useful tips about using gdb with postgresql backends on the Developer FAQ.

Getting a trace from a running backend

If you want a stack trace from a running backend, you're probably interested in a backend that's taking way too long to execute a query, is using too much CPU, or appears to be in an infinite loop. In all those cases you'll want to repeatedly interrupt its execution, get a stack trace, and let it resume executing. Having a collection of several stack traces helps provide a better idea of where it's spending its time.

You interrupt the backend and get back to the gdb command line with ^C (control-C). Once at the gdb command line, you use the "bt" command to get a backtrace, then the "cont" command to resume normal backend execution.

Once you've collected a few backtraces, detach then exit gdb at the gdb interactive prompt:

(gdb) detach
Detaching from program: /usr/lib/postgresql/8.3/bin/postgres, process 12912
(gdb) quit

An alternative approach is to use the gcore program to save a series of core dumps of the running program without disrupting its execution. Those core dumps may then be examined at your leisure, giving you time to get more than just a backtrace because you're not holding up the backend's execution while you think and type.

Getting a trace from a reproducibly crashing backend

GDB will automatically interrupt the execution of a program if it detects a crash. So, once you've attached gdb to the backend you expect to crash, you just let it continue execution as normal and do whatever you need to to make the backend crash.

gdb will drop you into interactive mode as the backend crashes. At the gdb prompt you can enter the bt command to get a stack trace of the crash, then cont to continue execution. When gdb reports the process has exited, use the quit command.

Alternately, you can collect a core file as explained below, but it's probably more hassle than it's worth if you know which backend to attach gdb to before it crashes.

Getting a trace from a randomly crashing backend

It's a lot harder to get a stack trace from a backend that's crashing when you don't know why it's crashing, what causes a backend to crash, or which backends will crash when. For this, you generally need to enable the generation of core files, which are debuggable dumps of a program's state that are generated by the operating system when the program crashes.

Enabling core dumps

On a Linux system you can check to see if core file generation is enabled for a process by examining /proc/$pid/limits, where $pid is the process ID of interest. "Max core file size" should be non-zero.

Generally, adding "ulimit -c unlimited" to the top of the PostgreSQL startup script and restarting postgresql is sufficient to enable core dump collection. Make sure you have plenty of free space in your PostgreSQL data directory, because that's where the core dumps will be written and they can be fairly big due to Pg's use of shared memory.

On a Linux system it's also worth changing the file name format used for core dumps so that core dumps don't overwrite each other. The /proc/sys/kernel/core_pattern file controls this. I suggest core.%p.sig%s.%ts, which will record the process's PID, the signal that killed it, and the timestamp at which the core was generated. See man 5 core. To apply the settings change just run sudo echo core.%p.sig%s.%ts | tee -a /proc/sys/kernel/core_pattern.

You can test whether core dumps are enabled by starting a `psql' session, finding the backend pid for it using the instructions given above, then killing it with "kill -ABRT pidofbackend" (where pidofbackend is the PID of the postgres backend, NOT the pid of psql). You should see a core file appear in your postgresql data directory.

Debugging the core dump

Once you've enabled core dumps, you need to wait until you see a backend crash. A core dump will be generated by the operating system, and you'll be able to attach gdb to it to collect a stack trace or other information.

You need to tell gdb what executable file generated the core if you want to get useful backtraces and other debugging information. To do this, just specify the postgres executable path then the core file path when invoking gdb, as shown below. If you do not know the location of the postgres executable, you can get it by examining /proc/$pid/exe for a running postgres instance. For example:

$ for f in `pgrep postgres`; do ls -l /proc/$f/exe; done
lrwxrwxrwx 1 postgres postgres 0 2010-04-19 10:30 /proc/10621/exe -> /usr/lib/postgresql/8.4/bin/postgres
lrwxrwxrwx 1 postgres postgres 0 2010-04-19 10:51 /proc/11052/exe -> /usr/lib/postgresql/8.4/bin/postgres
lrwxrwxrwx 1 postgres postgres 0 2010-04-19 10:51 /proc/11053/exe -> /usr/lib/postgresql/8.4/bin/postgres
lrwxrwxrwx 1 postgres postgres 0 2010-04-19 10:51 /proc/11054/exe -> /usr/lib/postgresql/8.4/bin/postgres
lrwxrwxrwx 1 postgres postgres 0 2010-04-19 10:51 /proc/11055/exe -> /usr/lib/postgresql/8.4/bin/postgres

... we can see from the above that the postgres executable on my (Ubuntu) system is /usr/lib/postgresql/8.4/bin/postgres.

Once you know the executable path and the core file location, just run gdb with those as arguments, ie gdb -q /path/to/postgres /path/to/core. Now you can debug it as if it was a normal running postgres, as discussed in the sections above.

Debugging the core dump - example

For example, having just forced a postgres backend to crash with kill -ABRT, I have a core file named core.10780.sig6.1271644870s in /var/lib/postgresql/8.4/main, which is the data directory on my Ubuntu system. I've used /proc to find out that the executable for postgres on my system is /usr/lib/postgresql/8.4/bin/postgres.

It's now easy to run GDB against it and request a backtrace:

$ sudo -u postgres gdb -q -c /var/lib/postgresql/8.4/main/core.10780.sig6.1271644870s /usr/lib/postgresql/8.4/bin/postgres
Core was generated by `postgres: wal writer process                                                  '.
Program terminated with signal 6, Aborted.
#0  0x00a65422 in __kernel_vsyscall ()
(gdb) bt
#0  0x00a65422 in __kernel_vsyscall ()
#1  0x00686a3d in ___newselect_nocancel () from /lib/tls/i686/cmov/
#2  0x00e68d25 in pg_usleep () from /usr/lib/postgresql/8.4/bin/postgres
#3  0x00d3d406 in WalWriterMain () from /usr/lib/postgresql/8.4/bin/postgres
#4  0x00bf0ba4 in AuxiliaryProcessMain () from /usr/lib/postgresql/8.4/bin/postgres
#5  0x00d3663b in ?? () from /usr/lib/postgresql/8.4/bin/postgres
#6  0x00d39782 in ?? () from /usr/lib/postgresql/8.4/bin/postgres
#7  <signal handler called>
#8  0x00a65422 in __kernel_vsyscall ()
#9  0x00686a3d in ___newselect_nocancel () from /lib/tls/i686/cmov/
#10 0x00d37bee in ?? () from /usr/lib/postgresql/8.4/bin/postgres
#11 0x00d3afbe in PostmasterMain () from /usr/lib/postgresql/8.4/bin/postgres
#12 0x00cdc0dc in main () from /usr/lib/postgresql/8.4/bin/postgres

If you don't have proper symbols installed, specify the wrong executable to gdb or fail to specify an executable at all, you'll see a useless backtrace like this following one:

$ sudo -u postgres gdb -q -c /var/lib/postgresql/8.4/main/core.10780.sig6.1271644870s 
Core was generated by `postgres: wal writer process                                                  '.
Program terminated with signal 6, Aborted.
#0  0x00a65422 in __kernel_vsyscall ()
(gdb) bt
#0  0x00a65422 in __kernel_vsyscall ()
#1  0x00686a3d in ?? ()
#2  0x00d3d406 in ?? ()
#3  0x00bf0ba4 in ?? ()
#4  0x00d3663b in ?? ()
#5  0x00d39782 in ?? ()
#6  <signal handler called>
#7  0x00a65422 in __kernel_vsyscall ()
#8  0x00686a3d in ?? ()
#9  0x00d3afbe in ?? ()
#10 0x00cdc0dc in ?? ()
#11 0x005d7b56 in ?? ()
#12 0x00b8fad1 in ?? ()

If you get something like that, don't bother sending it in. If you didn't just get the executable path wrong, you'll probably need to install debugging symbols for PostgreSQL (or even re-build PostgreSQL with debugging enabled) and try again.

Tracing problems when creating a cluster

If you're running into a crash when trying to create a database cluster using initdb, that may leave behind a core dump that you can analyze with gdb as described above. This should be the case if there's an assertion failure for example.

If you run into problems with the cluster creation "bootstamp" process, that may not happen. Another technique for finding bootstrap-time bugs by manually feeding the bootstrapping commands into bootstrap mode, with a leftover dir from initdb --noclean. This can help if there has been no PANIC that leaves a core dump, but just a FATAL or ERROR, for example. It's easy to attach GDB to such a backend.

Starting Postgres under GDB

Debugging multi-process applications like PostgreSQL has historically been very painful with GDB. Thankfully with recent 7.x releases, this has been improved greatly by "inferiors" (GDB's term for multiple debugged processes).

NB! This is still quite fragile, so don't expect to be able to do this in production.

# Stop server
pg_ctl -D /path/to/data stop -m fast
# Launch postgres via gdb
gdb --args postgres -D /path/to/data

Now, in the GDB shell, use these commands to set up an environment:

# We have scroll bars in the year 2012!
set pagination off
# Attach to both parent and child on fork
set detach-on-fork off
# Stop/resume all processes
set schedule-multiple on
# Usually don't care about these signals
handle SIGUSR1 noprint nostop
handle SIGUSR2 noprint nostop
# Ugly hack so we don't break on process exit
python x: [gdb.execute('inferior 1'), gdb.post_event(lambda: gdb.execute('continue'))])
# Phew! Run it.

To get a list of processes, run info inferior. To switch to another process, run inferior NUM.

Personal tools