Parallel Query Execution

From PostgreSQL wiki

(Difference between revisions)
Jump to: navigation, search
(Challenges: add subheadings)
(move sections)
Line 12: Line 12:
  
 
*  server-side languages can potentially do parallel operations
 
*  server-side languages can potentially do parallel operations
 
==Challenges==
 
'''Finding Opportunities'''  For parallelism to be added to a single-threaded task, the task must be able to be broken into sufficiently-large parts and executed independently.  (If the sub-parts are too small, the overhead of doing parallelism overwhelms the benefits of parallelism.)  Unfortunately, unlike a GUI application, the Postgres backend executes a query by performing many small tasks that must be executed in sequence, e.g. parser, planner, executor.
 
 
This means that databases allow parallelism only in limited situations, mostly for large queries that can become CPU or I/O bound.  For example, it is unlikely that selecting a row based on a primary key would benefit from parallelism.  In contrast, large queries can often benefit from parallelism.
 
 
'''Returning Data''' Another challenge is returning data from the helper process/thread.  For something like SUM(), it is easy, but passing a large volume of data back can be complex.
 
 
'''Avoiding Overhead''' Parallelism has its own costs so there will need to be a way to control when parallel execution is used.
 
 
'''Limiting Excessive Parallelism''' There also needs to be some mechnism that detects parallelism by other sessions so CPU and I/O resources are not exceeded.
 
  
 
==Benefits==
 
==Benefits==
Line 44: Line 33:
 
*  An initial approach might start by modifying individual plan nodes to run in parallel in the executor.  Eventually we'd need to educate the planner and optimizer about how to model parallelizing queries.
 
*  An initial approach might start by modifying individual plan nodes to run in parallel in the executor.  Eventually we'd need to educate the planner and optimizer about how to model parallelizing queries.
  
==Opportunties==
+
==Challenges==
 +
'''Finding Opportunities'''  For parallelism to be added to a single-threaded task, the task must be able to be broken into sufficiently-large parts and executed independently.  (If the sub-parts are too small, the overhead of doing parallelism overwhelms the benefits of parallelism.)  Unfortunately, unlike a GUI application, the Postgres backend executes a query by performing many small tasks that must be executed in sequence, e.g. parser, planner, executor.
 +
 
 +
This means that databases allow parallelism only in limited situations, mostly for large queries that can become CPU or I/O bound.  For example, it is unlikely that selecting a row based on a primary key would benefit from parallelism.  In contrast, large queries can often benefit from parallelism.
 +
 
 +
'''Returning Data''' Another challenge is returning data from the helper process/thread.  For something like SUM(), it is easy, but passing a large volume of data back can be complex.
 +
 
 +
'''Avoiding Overhead''' Parallelism has its own costs so there will need to be a way to control when parallel execution is used.
 +
 
 +
'''Limiting Excessive Parallelism''' There also needs to be some mechnism that detects parallelism by other sessions so CPU and I/O resources are not exceeded.
 +
 
 +
==Specific Opportunties==
 
Parallel opportunities include:
 
Parallel opportunities include:
  

Revision as of 18:43, 24 January 2013

This is currently under development. See the ToDo list.

Contents

Purpose

Postgres currently supports full parellism in client-side code. Applications can open multiple database connections and manage them asyncronously, or via threads.

On the server-side, there is already some parallelism:

  • server-side languages can potentially do parallel operations

Benefits

There are three possible benefits of parallelism:

  • using multiple CPUs
  • using multiple I/O channels (for sequential and random I/O)
  • using multiple CPUs and I/O channels

Approaches

There are several methods to add parallelism:

  • use fork (or a thread on Windows) and only call libc and parallel-specific functions to do parallel computation or I/O. This avoids the problem of trying to make the existing backend code thread-safe. Do we need to wait until we can share a transaction among back-end processes?
  • same as above, but modify some existing backend modules to be fork/thread-safe, with or without shared memory access; this might allow entire executor node trees to be run in parallel
  • create full backends that can execute parts of a query in parallel and return results
  • An initial approach might start by modifying individual plan nodes to run in parallel in the executor. Eventually we'd need to educate the planner and optimizer about how to model parallelizing queries.

Challenges

Finding Opportunities For parallelism to be added to a single-threaded task, the task must be able to be broken into sufficiently-large parts and executed independently. (If the sub-parts are too small, the overhead of doing parallelism overwhelms the benefits of parallelism.) Unfortunately, unlike a GUI application, the Postgres backend executes a query by performing many small tasks that must be executed in sequence, e.g. parser, planner, executor.

This means that databases allow parallelism only in limited situations, mostly for large queries that can become CPU or I/O bound. For example, it is unlikely that selecting a row based on a primary key would benefit from parallelism. In contrast, large queries can often benefit from parallelism.

Returning Data Another challenge is returning data from the helper process/thread. For something like SUM(), it is easy, but passing a large volume of data back can be complex.

Avoiding Overhead Parallelism has its own costs so there will need to be a way to control when parallel execution is used.

Limiting Excessive Parallelism There also needs to be some mechnism that detects parallelism by other sessions so CPU and I/O resources are not exceeded.

Specific Opportunties

Parallel opportunities include:

  • Tablespaces
  • Partitions
  • Multi-table access
  • Joins (e.g. nested loop), CTEs
  • Sequential scans on 1GB segment files
  • Aggregates
  • Data export
  • COPY (to reduce the CPU overhead of parsing)
  • Index builds
  • Constraint checking
  • Expensive functions
Personal tools