There is a high-level explanation of the feature on Thomas Munro's blog. This wiki page is to track outstanding issues and potential improvements. See also Hash_Join for hash join ideas that are not related to parallelism. (Perhaps these should be merged?)
Known estimation problems:
- ExecChooseHashTableSize estimates the size of the hash table with ntuples * tuple_size, but at execution time Parallel Hash will allocate the memory in 32KB chunks, creating a bit of extra overhead. This means that we can finish up underestimating the number of batches required even with the most ideal statistics. In other words, we'll finish up having to increase the number of batches at execution time. That's expensive. We should teach ExecChooseHashTableSize about the maximum possible extra overhead, so that we can avoid gratuitous underestimation. Some form of chunk-aware estimation was actually included in the PHJ patchset up until v20 but was dropped in v21 in an effort to bring the patchset size down and avoid touching non-parallel hash code. We should probably fix that as a follow-on patch, because otherwise users might experience easily avoidable nbatch expansion. (Note that the non-parallel path also allocates memory in chunks, but it doesn't count chunk headers or wasted space at execution time so that's invisible memory and should match the planner's estimation given perfect statistics.)
- Parallel Hash doesn't consider the size of the bucket array when checking if space_allowed is exceeded. It should! That's probably providing some mitigation for the above problem. Easily fixed.
- ExecChooseHashTableSize estimates the bucket array size using sizeof(HashJoinTuple), but it should be sizeof(dsa_pointer_atomic) for Parallel Hash. That sounds trivial to fix, except that we have to consider systems that have emulated atomics. Their size could be a lot larger (I am reliably informed that HPPA = 20, which is the size of 5 pointers), and then some of the arithmetic in there about the ratio of pointers to tuples would be wrong and needs to be reconsidered. Doesn't matter until the previous point is fixed.
Known executor problems:
- ExecParallelHashTuplePrealloc doesn't deal with oversized tuples correctly. If larger than HASH_CHUNK_THRESHOLD, it should preallocate HASH_CHUNK_HEADER_SIZE + tuple size, but the existing preallocation size should be preserved (because that's how it'll play out when reloaded).
Other problems not directly in Parallel Hash code:
- It's a bit inconsistent that Parallel Hash allocates chunks of exactly HASH_CHUNK_SIZE, but non-parallel Hash allocates chunks of HASH_CHUNK_SIZE + HASH_CHUNK_HEADER_SIZE. The latter actually causes unmeasured memory waste on some platforms, so let's fix that. Thread Thread
Observations that might be worth thinking about:
- Non-parallel Hash Join doesn't ever write outer batch 0 out to disk. Parallel Hash Join does, if nbatch > 0. Perhaps the planner should take that into consideration, ie give the non-parallel version a bit of a discount in the cost.
- Multi-batch PHJ assumes that workers would be better off working on independent batches when possible (though when it runs out they start helping each other). While that reduces contention on shared memory, it doesn't consider IO concurrency: it might be that your storage would do better at streaming in just 2 partitions at a time (2 sequential read streams), and having 4 workers on each, rather than having 8 workers trying to stream in 8 partitions and causing more random I/O.
Possible future features:
- A NUMA-aware PHJ could use some hash bits to choose a NUMA-node and mark chunks as only holding tuples for a given NUMA-node; each worker could load tuples into chunks for the appropriate node, and then cross-node traffic could be avoided at insertion time by preferring to insert tuples from your NUMA node. Or something.
- Parallel Hash could be used to implement right/outer joins (here is a flawed patch implementing that), because all participants working on a batch see the same matched bits in tuple headers. They can all write to the matched bits without synchronisation because the transition is always 0->1 (it doesn't matter whose update wins and it's only a single bit in a whole machine word that is changing, the entire rest of the header + tuple is immutable during probing) and then nobody ever reads the value until after probing. The problem to be solved is this: it would not be deadlock-free to wait for all participants to finish probing, because they may have emitted a tuple and may now be running some other node (including potentially another PHJ that is waiting for you). The solution is probably something like: let unmatched tuple scans be done only by one participant. Other participants would simply move on to a different batch, if possible, or be finished the join. That is not great because in a single batch join you get only one backend doing the whole unmatched tuple scan, but it may be better than not allowing parallelism at all for right/outer hash joins.
- EXPLAIN ANALYZE doesn't really tell you how much memory was used at the same time, it tells you the largest hash table that was ever loaded. Description of the behaviour and some alternatives
- JIT hash loop?
A list of assorted possible micro-optimisations relating to Parallel Hash, and underlying DSA + DSM:
- New operations pg_atomic_init_u64_array() and pg_atomic_zero_u64_array() could be implemented as memset() on most systems; that'd allow the hash buckets to be initialised with streaming/SIMD/etc instructions on sufficiently smart libc (Thomas has a patch for this)
- parallel memset(): do the above in parallel, using chunks of (say) 2MB; this is known to be beneficial on systems with more memory bandwidth than one core can saturate (probably not worth bothering with)
- if you use dsa_allocate_zero(), and dsa.c knows that it's just mapped in new memory, it can skip the memset(0)! If you know that you can do that because you're on a platform where pg_atomic_init_u64() just writes zero, you can skip the hash bucket initialization completely almost every time (since interesting sized hash joins usually have a new mapping just for the buckets); unfortunately, skipping the memset causes FreeBSD not to promote to super-pages immediately
- use huge/super/large pages; this is reported to be beneficial in several academic papers; this already works for FreeBSD but it'd be nice to change DSA_INITIAL_SEGMENT_SIZE to 2MB so that we don't create the first few MB of tuples in small pages; Linux needs code changes (can't use plain shm_open() in current kernels), Windows needs explicit request
- O(1) DSM handles: the search-all-the-slots code is only really necessary for sysv shm where keys are kernel-global; for other impls we could just encode the slot number (+ generation) into the handle!
- memory prefetch -- several academic researchers describe prefetch-based inmprovements (probably means batch access to tuples, or at least next tuple, and hashing the keys in advance, something like)
- hardware DSA: DSA is really a kind of software MMU/address translation; we could reserve a whole address space so that we know we can safely map DSM segments into the same address in each backend, and trap segfaults to remap as appropriate; in such a build, we'd be able to us raw pointers for dsa_pointer!
- tuples are copied while loading them into the hash table during conversion to minimal tuple; same in non-parallel hash join; that's fixable, but needs some refactoring of slot interfaces; we really just want to copy it into place in the hash table
- recycling recently freed DSM segments: allocating new DSM segments is really expensive, especially on Linux where we posix_fallocate() them... for DSA areas, these are somewhat more fungible (because they're in a small range of sizes)
- allocating whole new DSM segments is done while holding a lock on the DSA area; that's bad on Linux where the operation is slow