ENOSPC

From PostgreSQL wiki
Jump to navigationJump to search

What happens when PostgreSQL runs out of disk space?

When writing the WAL

If there is a short write, XLogWrite() will retry. If it gets ENOSPC (or any other error) while writing, it will PANIC. The server will crash and run crash recovery. It is therefore very important not to run out of space for pg_wal, and this is one motivation to put it on a separate filesystem, or apply some kind of quota scheme, and monitor WAL accumulation so you don't have (say) a replication slot that leads to unbounded data.

COW filesystems may be more likely to get ENOSPC while trying to write actual data in XLogWrite(), while overwrite filesystems probably get the ENOSPC while "preallocating" a WAL segment by writing out zeroes, in in XLogFileInitInternal(). This means that overwrite filesystems can avoid one variant of this failure mode: if they have enough "recyclable" WAL files, they can't run out of disk space here, which means they have a defence against non-WAL data filling the disk up; but if the problem is a stuck replication slot that requires WAL data to be retained, preventing WAL segment recycling, then we need to make new WAL segments, and then we might PANIC.

When writing relation data

The behaviour varies considerably depending on the filesystem semantics. There are broadly three times at which we can discover ENOSPC, on three different classes of filesystem:

1. In smgrextend()/smgrzeroextend(), which is a good time because the transaction generating too much new data will be aborted "gracefully" when the kernel reports ENOSPC (or any other error). The user will see an ENOSPC error message, but the system will continue to run, allowing the user to drop or truncate database objects to make space. However, smgrextend() really only reserves space on local, overwrite filesystems, such as ext4, ntfs, ufs, and usually xfs.

2. In smgrwrite(), which is a bad time because it happens after you accumulated dirty data that needs to be written back, and you've possibly already committed the relevant transactions, but we can't write it back in the background, and checkpoints will repeatedly raise ERRORs until you make some space. In this state, you can't shut down (because that requires a checkpoint), and if you crash then in crash recovery, you'll probably run into the same problem but now it will be raised as FATAL, so you can't even start up your database again, until you make some space. This is expected on COW systems like btfs, zfs, apfs, refs, and maybe xfs if you copied/upgraded your cluster with reflinks and the source links still exist.

3. In close() or fsync(), which is a terrible time because we have to PANIC, and that's not all. In the time between the smgrextend() and the fsync() calls in the following checkpoint some time later, some systems seem to roll back recent file extension when they asynchronously discover the remote ENOSPC, affecting lseek(SEEK_END), so our seqscans will silently fail to scan recently added blocks, etc. This is network systems like nfs that don't do any kind of reservation until they flush (I believe NFS 4.2 has the machinery required to make posix_fallocate() work, but we aren't doing the right things for that and we couldn't tell if it's being used if we did. I have some other threads about that, and prototype solutions. But if it's a remote COW filesystem accessed by NFS, that couldn't help even in theory.)

(Vapourware speculation: For COW filesystems, it's not smgrextend() that needs to reserve space or fail! I think it's a fourth, earlier time: when we dirty buffers in memory. That's a point at which we'd ideally want some well-amortised way to reserve the right to write them out later without ENOSPC/EDQUOT. It is interesting to think about how that might look and work in an imaginary kernel interface, if we could have anything we wanted. But it's not clear how it'd work, how you'd amortise enough, how you'd share 'reservations' between processes, etc. Wild speculation on social media.)

When writing other data

Most other places, temporary data etc, will simply raise an ERROR and abort the transaction if they get an error like ENOSPC, or a short write.