Readahead

From PostgreSQL wiki
Jump to navigationJump to search

in Linux

Summary

Very concise summary
readahead does read.
Concise summary
readahead in linux is more similar to readv implementation in PostgreSQL (at least the stream IO implemented by Thomas Munro for pg17).
It might be easier to read it this way to understand how it works.
Important points
readahead is not async
readahead is soon filesystem specific, see bottom of call tree below.

On linux, readahead as evolved over time and the last known review has been done by Neil Brown around April 8, 2022.
He wrote an article to summarize what he discovered: Readahead: the documentation I wanted to read.

Which lead to this updated doc (apparently merge in 5.18): https://www.kernel.org/doc/html/latest/core-api/mm-api.html#readahead

This is indeed a very nice summary and it covers mostly 2 entry functions for readahead: page_cache_sync_ra() and page_cache_async_ra().

Both have a nice up to date documentation about their underlying functions: https://www.kernel.org/doc/html/latest/core-api/mm-api.html?highlight=page_cache_async_readahead#c.page_cache_sync_readahead


Take a moment to read the documentation (excerpt here):

page_cache_sync_readahead()
should be called when a cache miss happened: it will submit the read. The readahead logic may decide to piggyback more pages onto the read request if access patterns suggest it will improve performance.
page_cache_async_readahead()
should be called when a page is used which is marked as PageReadahead; this is a marker to suggest that the application has used up enough of the readahead window that we should start pulling in more pages.

OK, so far so good.

But the code when doing posix_fadvise is distinct, and the entry point lead to:

page_cache_ra_unbounded()
This function is for filesystems to call when they want to start readahead beyond a file's stated i_size. This is almost certainly not the function you want to call. Use page_cache_async_readahead() or page_cache_sync_readahead() instead. File is referenced by caller. Mutexes may be held by caller. May sleep, but will not reenter filesystem to reclaim memory.

Today, linux is using "folios", and the readahead flag is set via folio_set_readahead(folio); in this function.

Other super important part is: "can wait", yes IT IS NOT ASYNCHRONOUS. The only situation where it's partly asynchronous is when the storage is really congested and so during the execution of the call on the provided readahead range, the process is aborted. As mentioned by Neil the "congestion" has not been followed everywhere and may not work as expected.

Side note: linux split range in chunks of 2MB to manage memory and reduce locking. Hardcoded.

An excerpt from the comment in the code of readahead:

   /*
    * Each readahead request is partly synchronous read, and partly async
    * readahead.  This is reflected in the struct file_ra_state which
    * contains ->size being the total number of pages, and ->async_size
    * which is the number of pages in the async section.  The readahead
    * flag will be set on the first folio in this async section to trigger
    * a subsequent readahead.  Once a series of sequential reads has been
    * established, there should be no need for a synchronous component and
    * all readahead request will be fully asynchronous.
    */

Call tree - The functions really used by linux posix_fadvise

https://elixir.bootlin.com/linux/latest/source/mm/fadvise.c#L31

   /*
   * POSIX_FADV_WILLNEED could set PG_Referenced, and POSIX_FADV_NOREUSE could
   * deactivate the pages and clear PG_Referenced.
   */
   int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)

https://elixir.bootlin.com/linux/latest/source/mm/internal.h#L126

   inline wrapper

https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L306

   /*
   * Chunk the readahead into 2 megabyte units, so that we don't pin too much
   * memory at once.
   */
   void force_page_cache_ra(struct readahead_control *ractl,
           unsigned long nr_to_read)

https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L281

   /*
   * do_page_cache_ra() actually reads a chunk of disk.  It allocates
   * the pages first, then submits them for I/O. This avoids the very bad
   * behaviour which would occur if page allocations are causing VM writeback.
   * We really don't want to intermingle reads and writes like that.
   */
   

https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L205

   /**
   * page_cache_ra_unbounded - Start unchecked readahead.
   * @ractl: Readahead control.
   * @nr_to_read: The number of pages to read.
   * @lookahead_size: Where to start the next readahead.
   *
   * This function is for filesystems to call when they want to start
   * readahead beyond a file's stated i_size.  This is almost certainly
   * not the function you want to call.  Use page_cache_async_readahead()
   * or page_cache_sync_readahead() instead.
   *
   * Context: File is referenced by caller.  Mutexes may be held by caller.
   * May sleep, but will not reenter filesystem to reclaim memory.
   */ 

some interesting comments in it, during preallocation:

   /*
    * Partway through the readahead operation, we will have added
     * locked pages to the page cache, but will not yet have submitted
     * them for I/O.  Adding another page may need to allocate memory,
     * which can trigger memory reclaim.  Telling the VM we're in
     * the middle of a filesystem operation will cause it to not
     * touch file-backed pages, preventing a deadlock.  Most (all?)
     * filesystems already specify __GFP_NOFS in their mapping's
     * gfp_mask, but let's be explicit here.
     */

https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L146

   static void read_pages(struct readahead_control *rac)

Then it is filesystem specific.

EXT4

https://elixir.bootlin.com/linux/latest/source/fs/ext4/inode.c#L3124

   static int ext4_read_folio(struct file *file, struct folio *folio)

if not found in mem:

https://elixir.bootlin.com/linux/latest/source/fs/ext4/readpage.c#L211

   int ext4_mpage_readpages(struct inode *inode,
           struct readahead_control *rac, struct folio *folio)

after lot of logic around sequential, hole, ...:

https://elixir.bootlin.com/linux/latest/source/block/blk-core.c#L833


   /**
    * submit_bio - submit a bio to the block device layer for I/O
    * @bio: The &struct bio which describes the I/O
    *
    * submit_bio() is used to submit I/O requests to block devices.  It is passed a
    * fully set up &struct bio that describes the I/O that needs to be done.  The
    * bio will be send to the device described by the bi_bdev field.
    *
    * The success/failure status of the request, along with notification of
    * completion, is delivered asynchronously through the ->bi_end_io() callback
    * in @bio.  The bio must NOT be touched by the caller until ->bi_end_io() has
    * been called.
    */

First thoughts

  • having a WILLNEED on a sequential pattern just competes with linux own ra.
  • if PG_readahead is set, linux will interpret that as a successful past readahead and may keep on doing more ra from this block when reading (it's mitigated by some hole detection apparently).
  • if PG_readahead is not set when reading, not checked how the ra logic will use that.
  • having a DONTNEED just after a read is apparently optimized in linux code.
  • setting RANDOM or SEQUENTIAL flag influence the linux default ra effectively. And is costless.