Enhancing FDW functionality for PostgreSQL GSoC2011

Idea

Two FDWs that allows the user to access:

 1.CouchDB code: https://github.com/ZhengYang/couchdb_fdw
 2.ODBC enabled DBMS code: https://github.com/ZhengYang/odbc_fdw

Details

CouchDB FDW

In general, the main work of developing a FDW is to implement the 6 callback functions:

       * PlanForeignScan
       * ExplainForeignScan
       * BeginForeignScan
       * IterateForeignScan
       * ReScanForeignScan
       * EndForeignScan

CouchDB foreign server options:

address: The address or hostname of the CouchDB server.
- Default: 127.0.0.1
port: The port number on which the CouchDB server is listening.
- Default: 5984

CouchDB foreign table options:

database: The name of the CouchDB database to query.
- Default: test
<column name>: The column mapping to remote JSON attributes. If there is no column mapping specified, default is the origninal column name.Please note that '_doc' is reserved for mapping the entire JSON document. The format will be <foreign table col_name> '<JSON attr name>'

CouchDB User mapping options

username: The username to authenticate to the CouchDB server with.
- Default: <none>
password: The password to authenticate to the CouchDB server with.
- Default: <none>

Feature Details

If there is no quals to pushdown, we get a list of '_id's to begin with, and then fetch whatever records still exist as we build the tuples.
We can only pushdown '_id' (with or without '_rev') to CouchDB, which must use the TEXTEQ operator.
The top-level-attributes of a document can be mapped to columns in a foreign table.
In case you want the map a column to the entire JOSN doc, use columnname '_doc' in foreign table options (which means your remote couchdb shouldn't have a top-level-attribute called '_doc', otherwise there will be conficts).

Implementation Details

Some background information on CouchDB:

Only RESTful HTTP API available. There're some third party wrappers available here: http://wiki.apache.org/couchdb/Basics. Unfortunately the C wrapper is out of date and no longer maintained. According to the wiki "This one is horribly out of date. Don't read it. If you need a C API, you've got to write one".So gotta use curl and a JSON wrapper (my choice is YAJL) to do all the API calling.

Since FDW is read only, we only need to curl GET request. Below are some typical examples on how to construct a URL to access a CouchDB server:Assume the server is located at

   http://127.0.0.1:5984

URL A. access meta info of a database (Used in ExpainForeignScan to provide the foreign table info & PlanForeignScan to provide the cost based on the size and num of docs in current db )

   http://127.0.0.1:5984/db_name/ 
   returns useful information like "doc_count":4, "disk_size":49241

URL B. get all _id's in a database (this is used when no qual or qual's not based on _id, i.e. requires a full table scan)

   http://127.0.0.1:5984/db_name/_all_docs
   returns all _id in database named 'db_name'.
    
   http://127.0.0.1:5984/db_name/_all_docs?startkey="doc2"&limit=2  
   order, limit, startkey can be set to retrieve a range of _id's (This is useful especially when db size is too large, we can retrieve part by part to save memory)

URL C.get a doc with _id and _rev

   http://127.0.0.1:5984/db_name/some_id?rev=some_rev
   if rev is absent, the latest rev will be returned.

The general strategy is:

If a single record is meant to be retrieved(conditions contains '_id'), we call URL C to get the record. It's fast and efficient since only one HTTP call is made:
- e.g. SELECT * FROM test where _id = 'QWERTTYYUUU';
If a range of record is meant to be retrieve (all records OR filter conditions other than '_id' has been specified). We use URL B to load batches of records and then using URL C to access them one by one. After the current batch of records finished iteration, a new batch of records is loaded. Repeat until all the record has been accessed.
- e.g. SELECT * FROM test where name = 'tom'; (this means to retrieve all the JSON docs that has a top-level-attribute 'name' whose value equals to 'tom')

ODBC FDW

Using ODBC drivers to connect to a foreign database. The following options can be set in order to connect a foreign database (make sure you have ODBC driver managers and drivers installed in your computer):

ODBC foreign server options:

dsn: The Database Source Name for the foreign database system you're connecting to.
- Default: <none>

ODBC foreign table options:

database: The name of the database to query.
- Default: <none>

schema: The schema of the database to query.
- Default: <none>

table: The name of the table to query.
- Default: <none>

sql_query: Optional: User defined SQL statement for querying the foreign table.
- Default: <none>

sql_count: Optional: User defined SQL statement for counting number of records in the foreign table.
- Default: <none>

<column name>: The column mapping to remote table columns.

If there is no column mapping specified, default is the origninal column name.

ODBC user mapping options:

username: The username to authenticate to the foreign server with.
- Default: <none>

password: The password to authenticate to the foreign server with.
- Default: <none>

Explanation of implementation details:

the option 'dsn' is supposed to point to a valid Data Source Name which contains the connection information.A DNS usually can be found in the configuration files of a driver manager with DBMS specific parameters set there.
Although some of the parameters specified here sometimes can also be specified in the DSN, you still need to specify it here again because some of the drivers doesn't load parameters in configuration file properly (e.g MySQL connector doesn't load 'database' from config file).
The name qualifier is obtained from SQL_QUALIFIER_NAME_SEPARATOR, for most of the database will be 'dot notation'. For mysql, to fully qualify a table name, we use <db_name>.<table_name> (MySQL uses db_name as schema).However for postgresql, we use <db_name>.<schema>.<table_name>.So my implementation is to let user provide db_name, schema, table_name, we use <schema>SQL_QUALIFIER_NAME_SEPARATOR<table_name> as the format to fully qualify a table name. for mysql users, they will need to specify the schema same as db_name which seems to be what mysql system implied. The quote char can be determined by SQL_IDENTIFIER_QUOTE_CHAR. e.g. for mysql is backquote (`), wherever for postgresql is double quote ("). So the constructed query is something like this: SELECT <mapped column names> FROM <schema>< SQL_QUALIFIER_NAME_SEPARATOR > (WHERE ... if quals can be pushed down)
COUNT() function is used to provide information for Planner and Explain function.SELECT COUNT(*) FROM <schema>< SQL_QUALIFIER_NAME_SEPARATOR >

Despite the above effort for standardization, there still can be exceptions that uses customized SQL syntax. Therefore, to provide the ultimate flexibility, two more optional parameters can also be set to let the user provide a customized SQL statements.
1. sql_query: Customized query for fetch results. this query should include (not necessarily limited to) all the columns in the FDW.
  - e.g. SELECT * FROM test.test; (an asterisk is always a convenient way)
  - e.g. SELECT a, b, c, FROM test.test; (make sure set {a, b, c} contains all the columns in foreign table)
  - e.g. SELECT b, c, a FROM test.test; (order of the columns doesn't mater, the FDW will handle this :-)
2. sql_count: Customized query to count number of rows will be returned.the query should return a single row with the first element of the tuple containing the number meant to be the result of the count.
  - e.g. SELECT COUNT(id) FROM test.test; (this is more efficient than the default COUNT(*), as only one column is counted. The result is the same)
  - e.g. SELECT MY_COUNT(*) FROM test.test; (If the default count function is not COUNT(), this option also helps)
  - e.g. SELECT 9999; (Sometimes, count function can be resource consuming as it usually requires a full table scan, this example shows you how to bypass this step)

Schedule

1. Late April 2011 (Week 1): Research & reading mysql_fdw and redis_fdw

2. May 2011 (Week 2~5): Warm up and implement CouchDB FDW. Implement ODBC FDW for iODBC driver on one of the linux distribution.

3. June 2011 (Week 6~9): Implement ODBC FDW for Microsoft driver on Windows. Implement complex quals pushdown for ODBC FDW.

4. July 2011 (Week 10~13): Finish complex quals pushdown feature.

5. August 2011 (Week 14~16): Documentation and Testing.

Enhancing FDW functionality for PostgreSQL GSoC2011

Contents