Here is a page to track ideas and ongoing work for hash joins. See also the Parallel_Hash page for parallelism-specific ideas.
Some active discussions with patches:
- Hash joins can, in extreme cases, use more memory that they are allowed to, if hash-based partitioning fails to divide the inner relation up enough to fit in the memory budget. Thread discussing strategies for respecting work_mem more strictly.
- Hash joins can decide to use a huge number of partitions in order to fit into work_mem, but the partition book-keeping is unmetered so can be way more than work_mem. Thread discussing that, somewhat intertwined with the above discussion. A related report.
- Our 'extreme skew detector' is not sensitive enough. Instead of waiting until 100% of a split partition goes to one child partition and 0% to another, perhaps we should have a threshold like 95%, otherwise you can easily come up with a distribution that never triggers the extreme skew detector and keeps repartitioning like crazy. Patch (intertwined with the above discussions)
- Hash Join and Hash nodes seem to be very tightly coupled. Perhaps we should just merge them into one?
- Tuples are copied to intermediate palloc'd memory while loading them into the hash table during conversion to minimal tuple; we really just want to copy them directly into place in the hash table having allocated the memory! (The same thing happens for various other code paths that store minimal tuples) Thread
Observations without patches:
- nodeHashjoin.c's switch case HJ_SCAN_BUCKET calls ExecScanHashBucket() which contains ExecQualAndReset(hash clauses), and then after that returns it immediately does ExecQual(joinqual). So we pay the cost of evaluator setup twice, instead of doing both together. The reason we have to do this that EXPLAIN needs to count the two kinds of filtering separately, and we don't have a mechanism to build a single expression which ANDs together two filters and but maintains two separate counters. This might apply to other join types too. (Observation made by Andres Freund in private discussion with Thomas Munro, who added this note here.)
- Memory is allocated in chunks that waste a ton of "invisible" memory on some OSes. Thread.
- Andres: There's too many indirections for hashtable lookups. For a successful hashtable lookup we need the following pointer dereferences: 1) HashJoinState->hj_HashTable (and a bunch of related state), 2) HashJoinTable->unshared 3) HashJoinTable->unshared[bucket] (likely uncached), 4) HashJoinTuple->hashvalue (likely uncached)
- We should work on at least removing the indirection for ->hashvalue, by moving the hashvalue into HashJoinTable->unshared[bucket], rather than that being just a pointer. It might be worthwhile to also store the ->next pointer inline.
- It could also be worthwhile to just embed the HashJoinTableData into the HashJoinState. As far as I can tell there's really no need for that to be a separate pointer indirection.
- Andres: The whole notion of HashJoinTuple is suspect - needing to perform the offsetting math in various places costs cycles itself, but also makes it harder to avoid unnecessary allocations in other places.
- Andres: The bucket scanning API (ExecScanHashBucket()) leads to unnecessary unpredictable branches, in a path that already suffers heavily from stalls.
- In a query where every lookup is a match (very commmon), the branch that decides whether to continue lookups in a previous bucket, or perform a fresh lookup, reaches both paths at roughly 50% - the worst case for a branch predictor. Instead there should be separate function for continuing a bucket search, which should be called by either branching earlier in the caller (will be very predictable), or even better by having a separate hj_JoinState value for continuing a bucket search.
- Currently the same "if" determines whether there is a match in a fresh lookup (common), and whether there's further tuples in a bucket (uncommon). That leads to inefficient branch prediction. Should be split.
- It might be worthwhile to move the check whether to search the skew hashtable into either a) the ExecHashJoinImpl() state-machine, by having a separate state for searching buckets in the skew hashtable b) just moving the check into ExecHashJoinImpl(), which'd at least allow to reuse the result of previous checks.
- There are arbitrary limits on the number of buckets you can have that come from using int in the logic and from respecting a 1GB memory allocation limit. If you run very large hash joins, you finish up hitting the cap of 64 (or is it 128?) million buckets and then the load factor goes beyond 1 due to this. It would be nice to fix that, as memories and data sets increase in size.
- Should we use 64 bit hashes? When the above problem is fixed, hashing many billions of rows on very large memory systems will run out of hash bits (and this may already be a problem even with smaller memory systems using a lot of batches?)
- Our load factor accounting should ideally be based on the number of unique keys, not the number of tuples. If there are a duplicate keys in the inner relation, we finish up with a too-sparse hash table.
- The ExecHashJoinImpl() specialisation trick seems to work pretty well. Should we create more permutations, so that we can compile in/out various optional things without runtime branching, such as skew hash table, 64 bit hashes, inlined hash functions for common data types, ... ?
- In create_hashjoin_plan(), we don't consider the skew optimisation for multi-column joins, because at the time we had only single column stats; now that we have multivariate stats, we could probably use their MCV list when available.
- We should get rid of the use of "long" from all the hash join code. It's a terrible type, and has a different size on Windows.
Some work on the horizon:
- Andres Freund is working on transforming execution plans into a "linear programs" of opcodes (and eventually probably also machine code via LLVM), like SQLite and System R. This means we'll need to figure out how to break our hash join algorithm into steps that can be expressed that way.
- Even before we get that far, there may be micro-optimisation opportunities where we JIT some of the hashing work? Andres: Indeed, the evaluation of the hash keys is fairly expensive, especially with multiple columns. We should add an expression step for mixing hash values, and use that to build an expression that does not just evaluate the key, but immediately hashes it.
Some more speculative ideas/work:
- Perhaps different hash join nodes could share a hash table, for the benefit of partition-wise joins. Thread.
- It's possible to make hash joins go faster by peeking ahead at the next tuple to be probed, and prefetching the right memory cache line. Experimental hack thread with links to academic papers. To do this well might require executor changes to that we can get a batch of tuples at the same time, and process them without escaping the current node.