PGStrom

From PostgreSQL wiki

Jump to: navigation, search

Contents

Overview

PG-Strom is an extension of PostgreSQL, works as custom-scan provider. It was designed to off-load several CPU intensive workloads to GPU devices, to utilize its massive parallel execution capability. GPU has advantages on number of processor cores (usually several hundreds - thousands) and wider RAM bandwidth (usually multiple times larger capacity than CPU). It works most efficiently when it processes massive amount of numerical operations simultaneously.

PG-Strom stands on two major ideas:

  1. Native GPU code generation on the fly
  2. Asynchronous and pipelined execution model

PG-Strom checks whether (a part of) given query is executable on GPU devices during query optimization phase. Once it determines the query is available to off-load, PG-Strom construct a source code of GPU native binary on the fly, then kicks just-in-time compile process on the head of execution stage.

Next, PG-Strom loads a bunch of fetched rows on DMA buffers (15MB per buffer, in the default), then kick DMA transfer and GPU kernel execution in asynchronous manner. CUDA platform allows to handle these tasks in background, so PostgreSQL can make advance the current process. This asynchronous and in-relation sharding also hides a usual latency around GPU acceleration.

PGStrom Fig ArchOverview.png

How to use

Installation

Requirements

  • PostgreSQL 9.5devel
  • CUDA 7.0 or later
  • x86_64 Linux platform supported by CUDA

Manual Installation

1. Install CUDA Toolkit v7.0 (or later)
  • You can get CUDA Toolkit from NVIDIA, then install the toolkit. Its default installation path is /usr/local/cuda. Please don't change the default path.
  • For runtime library lookup, you should add a configuration to tell location of the libnvrtc.so.
$ sudo bash
# echo /usr/local/cuda/lib64 > /etc/ld.so.conf.d/cuda-lib64.conf
2. Install PostgreSQL v9.5devel (or later)
  • You can check out PostgreSQL v9.5devel from [GitHUB].
  • Run ./configure script. At this moment, we recommend to attach --enable-debug and --enable-cassert.
  • Run make and make install
$ git clone https://github.com/postgres/postgres.git pgsql
$ cd pgsql
$ ./configure --enable-debug --enable-cassert
$ make
$ sudo make install
3. Install latest PG-Strom
  • You can check out the latest PG-Strom from [GitHUB]
  • Ensure pg_config is in your command search path.
  • Run make and make install
$ git clone https://github.com/pg-strom/devel pg_strom
$ cd pg_strom
$ which pg_config
/usr/local/pgsql/bin/pg_config
$ make
$ sudo make install
4. Create datanase
  • Create database using initdb.
  • For text comparison, we recommend to add --no-local option, because collation aware text comparison is not supported on GPU side.
  • Edit $PGDATA/postgresql.conf for parameter configuration.
    • $libdir/pg_strom has to be added to shared_preload_libraries.
    • shared_buffers has to be expanded according to the sizing guideline.
    • For other configurations, see the sizing guideline.
5. Start/Restart PostgreSQL
  • Start PostgreSQL, using pg_ctl. You may see the following message on log.
LOG:  CUDA Runtime version 7.0.0
LOG:  NVIDIA driver version: 346.46
LOG:  GPU0 GeForce GTX 750 Ti (640 CUDA cores, 1110MHz), L2 2048KB, RAM 2047MB (128bits, 2700KHz), capability 5.0
LOG:  NVRTC - CUDA Runtime Compilation vertion 7.0
LOG:  database system was shut down at 2015-06-28 17:26:03 JST
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started
6. Create PG-Strom extension
  • Run CREATE EXTENSION pg_strom to setup relevant SQL objects.

Cloud Installation

Under construction

Run SQL on GPU

Once PG-Strom is loaded, no special indication is needed to run SQL on GPU device. It perform as a custom-scan provider of PostgreSQL, and offers alternative scan/join logic that can run on GPU device. If it is feasible and reasonable from estimated cost perspective, planner put custom-scan node instead of the built-in query execution logic. Below is an ideal case to execute, because size of all the inner relations to be joined is available to load onto GPU RAM at once, and pre-aggregation can reduce number of rows to be processed by CPU effectively.

postgres=# EXPLAIN SELECT cat, avg(x) FROM t0 natural join t1 natural join t2 natural join t3 natural join t4 GROUP BY cat;
                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=300706.95..300707.28 rows=26 width=12)
   Group Key: t0.cat
   ->  Custom Scan (GpuPreAgg)  (cost=8860.76..252963.69 rows=1077 width=52)
         Bulkload: On (density: 99.00%)
         Reduction: Local + Global
         ->  Custom Scan (GpuJoin)  (cost=7860.76..251210.47 rows=9899296 width=12)
               Bulkload: On (density: 100.00%)
               Depth 1: Logic: GpuHashJoin, HashKeys: (cid), JoinQual: (cid = cid), nrows_ratio: 0.98995197
               Depth 2: Logic: GpuHashJoin, HashKeys: (aid), JoinQual: (aid = aid), nrows_ratio: 1.00000000
               Depth 3: Logic: GpuHashJoin, HashKeys: (bid), JoinQual: (bid = bid), nrows_ratio: 1.00000000
               Depth 4: Logic: GpuHashJoin, HashKeys: (did), JoinQual: (did = did), nrows_ratio: 1.00000000
               ->  Custom Scan (BulkScan) on t0  (cost=0.00..242855.74 rows=9999774 width=28)
               ->  Seq Scan on t3  (cost=0.00..734.00 rows=40000 width=4)
               ->  Seq Scan on t1  (cost=0.00..734.00 rows=40000 width=4)
               ->  Seq Scan on t2  (cost=0.00..734.00 rows=40000 width=4)
               ->  Seq Scan on t4  (cost=0.00..734.00 rows=40000 width=4)
(16 rows)

Below is a set of benchmark result using above microbenchmark workloads according to increase number of tables to be joined.

PGStrom Fig MicroBenchGpuJoin.png

Target Query
SELECT cat, AVG(x) FROM t0 NATURAL JOIN t1 [, ...] GROUP BY cat;
t0 contains 100M rows, other relations (t1-t10) contain 100K rows for each.
All the data was preloaded.
measurement environment
H/W: NEC Express5800 HR120b-1
CPU: Xeon E5-2640 (2.50GHz, 6Core)
RAM: 256GB
GPU: NVIDIA GTX980
S/W: PostgreSQL 9.5devel + PG-Strom (26-Mar-2015 ver), CUDA 7.0(x86_64)

Data types

  • Numeric
    • smallint
    • integer
    • bigint
    • real
    • float
    • numeric
  • Date and Time
    • date
    • time
    • timestamp
    • timestamptz
  • Variable length
    • Text
    • bpchar()

Functions and Operators

Under construction

GUC options

pg_strom.enabled
type: boolean
default: on
It enables / disables the entire PG-Strom functionality. Once it gets disabled, CustomScan node never appear in the query execution plan.
pg_strom.perfmon
type: boolean
default: off
It enables / disables the performance monitor feature. Once this parameter gets enabled, EXPLAIN ANALYZE will show extra performance information collected from CUDA runtime.
pg_strom.cuda_visible_devices
type: text
default: NULL
It allows to configure the variable to be assigned at CUDA_VISIBLE_DEVICES environment variable on system startup time. CUDA runtime ignores the GPU devices, if this environment variable is valid but device number was not lited.
See CUDA Environment Variables for more details.
pg_strom.enable_gpuscan
type: boolean
default: on
It enables / disables the feature support of GpuScan. Once it got disabled, PG-Strom never uses GpuScan during plan construction. Note that it does not affect to behavior of the query execution plan already constructed.
pg_strom.enable_gpuhashjoin
type: boolean
default: on
It enables / disables the feature support of GpuHashJoin. Once it got disabled, PG-Strom never uses GpuScan during plan construction. Note that it does not affect to behavior of the query execution plan already constructed.
pg_strom.enable_gpunestloop
type: boolean
default: on
It enables / disables the feature support of GpuNestLoop. Once it got disabled, PG-Strom never uses GpuScan during plan construction. Note that it does not affect to behavior of the query execution plan already constructed.
pg_strom.enable_gpupreagg
type: boolean
default: on
It enables / disables the feature support of GpuPreAgg. Once it got disabled, PG-Strom never inject GpuPreAgg node to the PlannedStmt. Note that it does not affect to behavior of the query execution plan already constructed.
pg_strom.enable_gpusort
type: boolean
default: on
It enables / disables the feature support of GpuSort. Once it got disabled, PG-Strom never inject GpuSort node to the PlannedStmt. Note that it does not affect to behavior of the query execution plan already constructed.
pg_strom.chunk_size
type: integer
default: 15MB
The default size of individual DMA buffer (usually called chunk).
pg_strom.gpu_setup_cost
type: float
default: to be determined based on the device property
The planner's estimate of the average cost to set up GPU devices once per query execution. It implies the cost to initialize CUDA context and build (if not on cache) the GPU kernel source
pg_strom.gpu_operator_cost
type: float
default: to be determined based on the device property
The planner's estimate of the cost of processing each operator or function call on GPU devices
pg_strom.gpu_tuple_cost
type: float
default: to be determined based on the device property
The planner's estimate of the cost of processing each tuple (row)
pg_strom.program_cache_size
type: integer
default: 16MB
Size of CUDA program cache to be allocated on the static shared memory, during startup time once. So, too large configuration makes waste of RAM. PG-Strom caches a pair of GPU kernel source and CUDA module built, to skip run-time module build second time or later. The cached entry is kept in this area, and too small configuration makes performance degradation because of too frequent run-time compile.
pg_strom.bulkload_enabled
type: boolean
default: on
It enables / disables the bulkload feature. If both of parent and child node of the plan tree are CustomScan node managed by PG-Strom, it checks whether bulkload mode is available to exchange the bunck of data between them.
Once PG-Strom thought bulkload is more reasonable, the child node setup its results on DMA buffer then passes this chunk to the parent node. It allows to reduce waste of CPU cycles for unnecessary data copy between normal memory area and DMA buffers. So, it is the fastest path if we can try.
pg_strom.bulkload_density
type: float
default: 0.50
It sets a threshold to use bulkloading mode to exchange a bunch of data from the underlying node of the plan tree. If child node may increase / reduce number of tuples more / less than the threshold, we don't apply bulkload mode even if possible.
pg_strom.max_async_tasks
type: integer
default: 32
It sets maximum number of asynchronous tasks that are sum of launched and pending. When GPU kernel execution or GPU kernel build (usual case) takes times, PG-Strom makes advance child node scanning and loading to DMA buffer. This parameter works as threshold of this prefetching.
pg_strom.max_workers
type: integer
default: ---
It sets maximum number of background worker process if and when PG-Strom also takes CPU parallel execution.

Sizing Guideline

Under construction


Background

Parallel Approach

We have a few ways to enhance performance of PostgreSQL.

  • Homogeneous scale-up
  • Heterogeneous scale-up
  • Scale-out

PG-Strom takes heterogeneous scale-up approach; that utilizes the most advantaged hardware according to the characteristics of the workloads. In other words, it assigns simple but massive amount of numerical calculation, that is too previous to run on CPU cores, on GPU devices.

PGStrom Fig ParallelApproach.png

Future Development

PostgreSQL core

PG-Strom

Long term development

Personal tools