Gsoc08-tss
Project summary
Implement gathering statistics from tsvector columns and a selectivity function for the @@ operator that will use these statistics. You can get the latest version by doing a
git clone git://git.postgresql.org/git/~wulczer/gsoc08-tss.git
or browsing to http://git.postgresql.org/?p=~wulczer/gsoc08-tss.git;a=summary
Detailed info
The basic goal is to write a custom typanalyze function for the tsvector type and a new selectivity function that will replace contsel for @@ selectivity estimation. Most frequently appearing lexemes will be stored in pg_statistic and the selectivity function will make decisions based on that data. Both functions will be contained in a contrib module.
Estimated timeline
- dig deeper into the source, write a typanalyze function for the tsvector type to compute most frequent lexemes and store them in pg_statistic. (4 weeks)
- write a selectivity function for the @@ operator that looks at the collected statistics and returns a selectivity estimate (2 weeks)
- fiddle with the algorithms, explore the possibilities of using previously computed statistics to get better estimates (1-2 weeks)
- write tests to approximate the overhead introduced in the planner (especially in complex tsqueries), refactor the code if necessary (1-2 weeks)
- wrap things up as a proper contrib module, add documentation for end users + safety buffer in case of unexpected delays (2 weeks)
Possible improvements
If my work will progress ahead of schedule, there could be room for employing some smarter algorithms to estimate a query's selectivity. This tasks appears to be related with some natural language processing problems and maybe there are some clever solutions that could be looked at.
Additional resources
A somewhat detailed design proposal is available at -hackers (http://archives.postgresql.org/pgsql-hackers/2008-03/msg00702.php)
You can also look at the original proposal submitted to Google (http://code.google.com/soc/2008/postgres/appinfo.html?csaid=6EACD033B4F97D56)
How it all worked out
The first step was to get a rough understanding of what the statistics code does. I had some idea about how statistics are stored and I knew you could easily plug in your own stats gathering function, so I jumped straight into it and tried to store, hmmm, let's say, a one-element array containing the word "foo" with frequency 1.0.
A couple hundred segmentation violations later I started reading the documentation for gdb-mode.
After some serious single-stepping (and lemme tell you, Emacs *is* the best frontend for gdb) I found out that the function that creates the Postgres representation of an SQL array needs to be passed the datatype of its elements. OK, fair enough. But ah! it defaults to the datatype of the column, for which the stats are gathered! So I've been feeding it C strings and it's been using it as tsvectors! Hmm, but wait, the assumption about the datatype is hardwired in the code. If it's tsvectors you ANALYZe, it's tsvectors you'll store. Maybe this is a deficiency... So, mail to -hackers. Check everything twice, press "Send". Refresh, refresh, refresh. Aha! However unprobable that sounds, I was right. So let's code up a patch for that.
But wait, they use CVS. I need to be *online* to get a *diff*? Err, rsync the repository? OK, there's something called cvsup without which CSV is just an unusable piece of cruft that ought to be put away (with all honours) in a cabinet and dusted twice a year so onlookers won't cough to death.
Google for cvsup. Hmm, "if you're lucky, your OS vendor will have cvsup packaged for you". Yeah, right, Slackware 11.1 packaging cvsup. Or a Modula-3 compiler. This definitely won't be easy. I seem to recall, however, that there's a git mirror of Postgres' CVS. And Heikki (my mentor) mentioned he wanted to try developing with git. And, how convenient, Slackware packages git.
Browse to http://git.postgresql.org/. Mail the administrator, git-pull the source. Hooray, it worked, and now I can pick up chicks by typing git-foo in public places. I even got an account so I can git-pull what I've done. Sexy.
Now, for the patch. Wait, they use tabs? Oh, the abomination! Hm, they also have a sample .emacs snippet that should set up things nicely. Let's try it. Well, it *does* look quote nice, variables lining up. Maybe tabs are not that evil. I mean, they *are* evil and we all know it, but you can live with them and they won't eat your brain out. If you add a nifty hook to your c-mode-hook, anyway.
So finally, some coding. Design, design, design - the Postgres mantra (at least that's the impression I got). How nice, you email your thoughts and people tell you you're wrong (and they're right, too). Even better, they tell you how you should've done it. Saves time for everyone. So after a short time it's there. And lo! after a few days - it's committed
Oh my, what's that? A letter stating that you can go to Canada to PGCon and they'll pay for it! 15 seconds of thought - press "Reply". "Hi, I'd like to go". "Well, it's a long way, and you need to avoind the USA, 'cause they got visas for you second world citizens". "Yeah, but the flight's cheap and I'm staying away from the States". "OK, you're on".
And there I am, in Canada, listening to all these hackers that to date were more like ghost voices coming from the mailing list archives. The schedule is tight and I have my laptop, so it's daytime lecturing and nighttime hacking. Listen, listen, listen, eat, listen, listen, pub, hack, hack, hack, wtf?, hack, hack, write elisp, hack, hack, !> Connection terminated, hack, curse, hack, sleep.
On the second day I brought my laptop with me for the tutorial (last day I was one of three people who haven't brought their computers. I felt strange). Suddenly, an email from Robert Treat. "Hi, have you sent me the slides for your lightning talk?". OMG, what slides? I thought it's just five minutes? "Well, not yet, I'll have them ready by tomorrow". Looks like another long night...
And then the lightning talk itself. I actually thought the big timer they were holding in front of the speaker showed *remaining* time, not *elapsed*. A quick glance got me panicking and I finished one minute early. Oh well, better than go overtime.
After coming back to Warsaw (via Amsterdam, so there was quite some sightseeing to be made) it was time to continue withthe project. So, since I can now store the the datatype I want, let's see if I can extract most common lexemes from the tsvector sample. Well, it looks like a matter of a simple loop. Code up the patch and send it to -hackers. Well, compile diffutils first and wonder about how people can prefer context diffs over unified. Oh well, not my thing to judge.
Wait, looping over a couple hundred of tsvectors is going to be slow. Worse, determining most common lexemes by brute force will be O(N^2) and that's *way* too slow. Hash tables to the rescue. Now I've always used these by doing foo['sth'] = 5. Doing it in *C* is something way different. But it can't be that hard, can it?
After reading up on gdb documentation this time, and spending some long hours poking at random memory addresses I finally get it. Oooh, so the hashed struct *contains* the hash key? Now that's what I call conterintuitive (yes, I know it's documented. It just feels weird).
OK, now everything's nice and hashed. But... *hashing* all those elements won't do any good, too. You just don't want to keep all this stuff in memory, you need to do something smarter. Google for "most common elements from stream". What's that, a paper from a database conference. Well, Lossy Counting. Looks like a slick algorithm, easily understandable, easily implemenetable, easily proven correct. Maybe it's worth giving it a shot.
Well whaddya know. These algorithms you read about in papers really work. And the get you the right results, fast. After some more coding use newly-acquired git proficiency and summon forth another patch. Learn that Emacs has diff-context->unified, too.
And an amazing thing happens, it gets committed during the next commit fest! A brand new typanalyze function for tsvectors. Now to use that info, for the glory of the planner.
That one was actually quite easy. Go through the tsquery bottom-top, apply common sense about multiplying and negating probabilities. If only I could *not* look up each and every lexeme in the most common lexemes array... Binary search to the rescue. Aw, darn, lexemes are non-null terminated C strings and the statistics store text datums (because storing C strings seemed like bad style). So I need to qsort() it one way, and bsearch() through it another. Ugly. And then each time a query is planned I need to sort the most commont lexemes, so recursing down a tsquery I can use binary search. It's slow and it smells of bad design.
Final enlightenment. Have ANALYZE sort the lexemes and let the planner see pre-sorted data. While at it, have it sort on lexeme length first, and bytes second, so if the lengths don't match you don't have to detoast the datum in order to compare it. Still not sure if that's a win, but we'll see. After some minor changes I have a commitfest-ready piece of code.
So, it's been a hectic three months, and the joy of hacking has been aplenty. The September commit fest is coming up and it's very possible that improved text search selectivity will make it's way into 8.4. Exciting, fun, refreshing, educative. That's GSoC for you!
Brief List of Things Learned (order by random()):
- Emacs pwnz (knew that already, but confirmation's always welcome)
- documentation is more important than code
- hsearch(3) works in ways that I'm inclined to call "perverse"
- git's good
- algorithms taken from papers *do* work
- tabs are evil, but can be tamed
- Postgres V1 was in Lisp (sic!)
- you never know where you'll be sleeping next week
And that wraps it up - all in all, great success!