Parallel Query Execution

From PostgreSQL wiki

(Difference between revisions)
Jump to: navigation, search
(Challenges of Parallelism: large queries)
(finish overhaul)
Line 18: Line 18:
 
This means that databases allow parallelism only in limited situations, mostly for large queries that can become CPU or I/O bound.  For example, it is unlikely that selecting a row based on a primary key would benefit from parallelism.  In contrast, large queries can often benefit from parallelism.
 
This means that databases allow parallelism only in limited situations, mostly for large queries that can become CPU or I/O bound.  For example, it is unlikely that selecting a row based on a primary key would benefit from parallelism.  In contrast, large queries can often benefit from parallelism.
  
==Project Goal==
+
==Benefits of Parallelism==
* Implement parallel query
+
There are three possible benefits of parallelism:
* Implementation will use one master process (current backend) and multiple slaves processes forked from postmaster as a result of masters signal to postmaster.
+
  
==Issues==
+
* using multiple CPUs
* Shared memory
+
** new shared memory context which uses ossp mm library
+
** limitation – so far we do not bother with attaching to shared in execbackend case, so slaves can only be forked from postmaster
+
* Slave process
+
** initialization almost the same as standard backend, only username and database is from master process
+
** limitation – additional pg modules loaded in backend are not reloaded in slave
+
  
==ToDo==
+
* using multiple I/O channels
* parallel sort using multiple processes
+
** in nodesort distribute incoming tuples to slaves using hash function
+
** implement producer consument structure in shared memory to allow sending data between processes
+
** implement final merge phase of slave results
+
  
==Process vs Thread==
+
* using multiple CPUs and I/O channels
* Process +
+
 
** Existing code does not need to be rewritten to be thread safe
+
==Parallelism Approaches==
* Thread +
+
There are several methods to add parallelism:
** No special effort to share data between threads
+
 
* Process -
+
* user fork (or a thread on Windows) and only call libc and parallel-specific functions to do parallel computation or I/O.  This avoids the problem of trying to make the existing backend code thread-safe.
** Speed issues in switching context
+
 
* Thread -
+
* same as above, but modify some existing backend modules to be fork/thread-safe, with or without shared memory access;  this might allow entire executor node trees to be run in parallel
** Not thread safe code
+
 
 +
* create full backends that can execute parts of a query in parallel and return results
 +
 
 +
==Parallelism Opportunties==
 +
Parallel opportunities include:
 +
 
 +
* sorting
 +
 
 +
* tablespaces
 +
 
 +
* partitions
  
 
[[Category:Development]]
 
[[Category:Development]]

Revision as of 04:24, 15 January 2013

This is currently under development. See the ToDo list.

Contents

Purpose of Parallelism

Postgres currently supports full parellism in client-side code. Applications can open multiple database connections and manage them asyncronously, or via threads.

On the server-side, there is already some parallelism:

  • server-side languages can potentially do parallel operations

Challenges of Parallelism

For parallelism to be added to a single-threaded task, the task must be able to be broken into sufficiently-large parts and executed independently. (If the sub-parts are too small, the overhead of doing parallelism overwhelms the benefits of parallelism.) Unfortunately, unlike a GUI application, the Postgres backend executes a query by performing many small tasks that must be executed in sequence, e.g. parser, planner, executor.

This means that databases allow parallelism only in limited situations, mostly for large queries that can become CPU or I/O bound. For example, it is unlikely that selecting a row based on a primary key would benefit from parallelism. In contrast, large queries can often benefit from parallelism.

Benefits of Parallelism

There are three possible benefits of parallelism:

  • using multiple CPUs
  • using multiple I/O channels
  • using multiple CPUs and I/O channels

Parallelism Approaches

There are several methods to add parallelism:

  • user fork (or a thread on Windows) and only call libc and parallel-specific functions to do parallel computation or I/O. This avoids the problem of trying to make the existing backend code thread-safe.
  • same as above, but modify some existing backend modules to be fork/thread-safe, with or without shared memory access; this might allow entire executor node trees to be run in parallel
  • create full backends that can execute parts of a query in parallel and return results

Parallelism Opportunties

Parallel opportunities include:

  • sorting
  • tablespaces
  • partitions
Personal tools