Hash proposal

From PostgreSQL wiki
Jump to navigationJump to search

Abstract

Hash index is well used index in equality queries. But it perform no better than B-tree in PostgreSQL. The PostgreSQL community has discuss the problem before

 http://archives.postgresql.org/pgsql-hackers/2007-09/msg00051.php

This project tries to deal with this problem by storing the hash value instead of real key. It's one of schemes to improve hash index performance listed in PostgreSQL's TODO list. This work will deal with this problem by storing the hash value instead of the real key. Then I proposed an idea to deal with the dark side of the scheme. I'll benchmark both of them against original implementation and B-Tree to show the improvement.

Detailed Description

  • PART A. Store the hash value instead of the real key

It has been discussed here: http://archives.postgresql.org/pgsql-hackers/2004-06/msg00168.php Neil Conway had posted a patch doing this with an old version of PostgreSQL.

 http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php

But it makes it significantly more expensive to skip over entries that have the same hash code but aren't really equal to the key being sought (since you need to an additional disk access to visit the heap to find out they aren't equal).

  • PART B. Variety of the first scheme

We still store hash value and tuple pointer in bucket. But when we insert a tuple, we can determine whether or not store the real key. If we encounter with the same hash value when inserting, we store both of the hash value and the real key. That is to say, when there's many tuples with the same hash code, the first one can only store the hash code but the other tuples will store both hash code and real value.

  • Part C. Benchmark and maybe look at other ways to improve performance

Because of the application area of hash index, the benchmark should satisfy these conditions:

 - Workload - All of the queries should be equality queries.
 - Data - The key should be almost unique (i.e. most of the equality queries are point queries instead of multipoint queries). 
          Multipoint queries can only be efficient for cluster file.
 

I'll test the first scheme version and the 2nd scheme version against the original version and the B-Tree index to show the improvement.

And we can also try to look at other ways to improve performance when benchmarking and discuss with other developers to make future improvement.

  • Part D. Alternative Plan

If the schemes don't work well, I'll have to start an alternative plan. At least, it can work well with large index field in my opinion. So I'll add a clause in Alter Index so it will be optional when users encounter with the special conditions sometimes. So even make worst of plan, it's still can be used in some situations.