Parallel Hash

From PostgreSQL wiki

Jump to: navigation, search

There is a high-level explanation of the feature on Thomas Munro's blog. This wiki page is to track outstanding issues and potential improvements.

Known estimation problems, to be resolved in PG11 cycle:

  • try_partial_hashjoin_path() passes constant true to initial_cost_hashjoin() for parallel_hash. That is a bug. Here is a patch.
  • ExecChooseHashTableSize estimates the size of the hash table with ntuples * tuple_size, but at execution time Parallel Hash will allocate the memory in 32KB chunks, creating a bit of extra overhead. This means that we can finish up underestimating the number of batches required even with the most ideal statistics. In other words, we'll finish up having to increase the number of batches at execution time. That's expensive. We should teach ExecChooseHashTableSize about the maximum possible extra overhead, so that we can avoid gratuitous underestimation. Some form of chunk-aware estimation was actually included in the PHJ patchset up until v20 but was dropped in v21 in an effort to bring the patchset size down and avoid touching non-parallel hash code. We should probably fix that as a follow-on patch, because otherwise users might experience easily avoidable nbatch expansion. (Note that the non-parallel path also allocates memory in chunks, but it doesn't count chunk headers or wasted space at execution time so that's invisible memory and should match the planner's estimation given perfect statistics.)
  • Parallel Hash doesn't consider the size of the bucket array when checking if space_allowed is exceeded. It should! That's probably providing some mitigation for the above problem. Easily fixed.
  • ExecChooseHashTableSize estimates the bucket array size using sizeof(HashJoinTuple), but it should be sizeof(dsa_pointer_atomic) for Parallel Hash. That sounds trivial to fix, except that we have to consider systems that have emulated atomics. Their size could be a lot larger (I am reliably informed that HPPA = 20, which is the size of 5 pointers), and then some of the arithmetic in there about the ratio of pointers to tuples would be wrong and needs to be reconsidered. Doesn't matter until the previous point is fixed.

Known executor problems, to be resolved in PG11 cycle:

  • ExecParallelHashTuplePrealloc doesn't deal with oversized tuples correctly. If larger than HASH_CHUNK_THRESHOLD, it should preallocate HASH_CHUNK_HEADER_SIZE + tuple size, but the existing preallocation size should be preserved (because that's how it'll play out when reloaded).

Other problems not directly in Parallel Hash code:

  • It's a bit inconsistent that Parallel Hash allocates chunks of exactly HASH_CHUNK_SIZE, but non-parallel Hash allocates chunks of HASH_CHUNK_SIZE + HASH_CHUNK_HEADER_SIZE. The latter actually causes unmeasured memory waste on some platforms, so let's fix that. Thread Thread

Observations that might be worth thinking about:

  • Non-parallel Hash Join doesn't ever write outer batch 0 out to disk. Parallel Hash Join does, if nbatch > 0. Perhaps the planner should take that into consideration, ie give the non-parallel version a bit of a discount in the cost.

Possible future features:

  • Parallel Hash could be used to implement right/outer joins, because all participants working on a batch see the same matched bits in tuple headers. They can all write to the matched bits without synchronisation because the transition is always 0->1 (it doesn't matter whose update wins and it's only a single bit in a whole machine word that is changing, the entire rest of the header + tuple is immutable during probing) and then nobody ever reads the value until after probing. The problem to be solved is this: it would not be deadlock-free to wait for all participants to finish probing, because they may have emitted a tuple and may now be running some other node (including potentially a nothing PHJ that is waiting for you). The solution is probably something like: let unmatched tuple scans be done only by one participant. Other participants would simply move on to a different batch, if possible, or be finished the join. That is not great because in a single batch join you get only one backend doing the whole unmatched tuple scan, but it may be better than not allowing parallelism at all for right/outer hash joins.
  • There should probably be a way to initialise arrays of atomics that just does memset on modern architectures but falls back to a loop so that emulated atomics work.
Personal tools