XReader

xReader - the XLOG reader for PostgreSQL - a Google Summer of Code 2012 project proposal

Project Title - xReader - the XLOG reader for PostgreSQL

Name of proposer and email - Aakash Goel, aakash.bits@gmail.com

Synopsis

Note: For explanation on terms used, please refer to 'Project Details'.

xReader is an tool that will allow users to read filtered data from any xlog stream.

In a master - hot standby database configuration, xReader will intercept the xlog stream, and provide custom-filtered (token based filtering) logical information to each of the listeners attached to it.

The tool will be the foundation of trigger-free audit and replication systems in future.

xReader will be packaged as an extension in the contrib/ directory.

Benefits to the PostgreSQL Community

Functionality provided by xReader will be the foundation for several future applications. Since one we will be able to filter logical data as required by tapping directly into the XLOG stream, tools which feed on this data can be built on top of xReader to develop Data Audit, Data Analysis and Data Replication applications. Each of these applications will be 'listeners' for xReader.

This tool also eliminates the dependency on triggers for data replication and analysis tasks. Triggers have several disadvantages including:

Performance overhead on the primary database - Imagine running 5 different triggers on an update in a complex replication environment - every time an update happens, 5 triggers need to be executed, resulting in a tremendous overhead.
Sequencing Issues - Multiple triggers defined on a particular event add the complexity of handling the order of firing of triggers.
Flexibility - Triggers are too rigid to cater to auditing and replication needs. Handling filtering is much easier if the database objects are not involved.

Reading custom-filtered logical data from the xlog stream bypasses these limitations.

Project Details

a. Need for xReader + Basic design

PostgreSQL logs are records of changes to the database at the physical level. As an example, for an update, these records do not directly tell which column of which table was updated, but only which bytes on which pages were changed.

As such, these records cannot be used to perform queries on the database at the logical level. Since performance is always an important factor with any production database, these queries must be offloaded from the database. If an audit or replication application consistently queries the database, it will degrade performance.

Database logs, on the other hand, can be streamed, duplicated or shipped. So can be the datafiles for the database, one might say. Yes, they can, but those datafiles contain dynamic data at the physical level. Using data files would be useful only if the database is not changing state, which is almost never the case.

PostgreSQL database logs do not contain information about the logical changes to the database. Without access to logical information, we cannot make sense of the data from the logs. Since the physical to logical mapping information is already available in the database itself, querying the database metadata is a solution to this problem.

b. Terms

Master database - The primary database where all data changes are originating. This will act as the source database in a replication configuration.
Standby database - The secondary database where the changes are being applied, to keep the database in synchronization with a master database.
Listener - A generic name for any client that can register with xReader.
Registration - A process in which a listener specifies the preferences for reading xlog contents.
Deregistration - A process in which a listener detaches itself from xReader with an intent to discontinue receiving the filtered xlog content.
Subscriber - Publisher model - A design model where the subscribers provide their preferences to a publisher, which then pushes information to each of its subscribers according to the preferences.

c. Listener Functions

Registration

 register(<Listener Name>,<xReader Name>,<Comma Separated List of tokens to be read>,<Stream Handle>);

Listener Name = Unique name of the Listener (unique per xReader instance)
xReader Name = Unique name of xReader
Comma Separated List of tokens to be read = the list of tokens corresponding to which the Listener will receive the data
Stream Handle = Handle to the stream where xReader will push the resulting data

Examples:

 register('Listener1','xReader_Main','oid,ctid','myfilehandle');

 register('Listener2','xReader_Main','*','stdout');

Deregistration

 deregister(<Listener Name>,<xReader Name>);

Example:

 deregister('Listener1','xReader_Main');

d. Where will xReader fit in?

As shown in the diagram below, xReader will intercept the xlog stream from the master to the standby, query metadata, and cater to the needs of all the registered listeners by proving filtered information to them.

e. What are the components of xReader?

Receiver: Receives the xlog stream, and forks it. One substream goes to Sender and then to the standby, while the other substream goes for parsing to Parser.
Parser: Parses the tokens present in the stream received, and forwards requests about the tokens to the Metadata Client.
Metadata Client: Connects to the standby database as any other client, and performs necessary queries to make sense of the tokens parsed by the xParser.
Subscription List: Maintains a record of all the listeners registered with xReader at a particular point of time, and their subscription tokens.
Dispenser: For each token of information received from Parser, loops through the Subscription List to send the data to all Listeners who requested for that particular token.
Sender: Sends the xlog stream to the standby database untouched.

f. How will xReader work?

Receiver will intercept the xlog stream transparently to the source database.
Sender will send the xlog stream transparently to the standby database. Together, Receiver and Sender ensure that xReader is transparent to both the databases.
To decipher the logical meaning of the tokens read, xReader will query the standby database for information.
Subscribers can add or detach themselves with xReader at any time.
xReader will receive the logs asynchronously from the master, and synchronously send the logs to the standby. Sending logs synchronously to the standby and waiting for application of changes to finish is a must because we need to be sure that the metadata queries always return the correct results.
Only when all the previous changes have been applied to the standby database can the Metadata Client query the standby database.
Updating the standby and querying the standby are both blocking synchronous operations.
Once the required data has been obtained, xReader will loop through the subscription list and send out data.
All the previous steps will be repeated as long as there are Listeners attached to xReader. In absence of any listeners, xReader will act as a transparent pass-through tool.

g. Where will xReader code reside?

The tool will be packaged as an extension module in the contrib directory of the PostgreSQL distribution.

h. Variations and Experiments

xReader model as described above is suitable for replication environments and depends on the availability of a standby database. One alternative that can be explored is to make use of the primary database for metadata queries. This configuration is depicted in diagram below. To make sure we are querying the correct metadata, the logs need to be streamed synchronously from the primary database.

Inch-stones

Provide a list of individual incremental steps to be implemented here.

Create xReader as a simple pass through tool
Create Receiver that Listens to xlog stream from the Master database
Create Sender that sends the xlog stream to the Standby database
Create Metadata Client to query the standby database
Create registration + deregistration model for Listeners
Create a simple listener that can attach to xReader and receive data
Implement types of xlog records one by one
Package code as an extension in the contrib/ directory
Documentation

Project Schedule

Provide a timeline for the inch-stones above.

June 15, 2012
June 20, 2012
July 5, 2012
July 12, 2012
August 10, 2012
August 12, 2012
August 18, 2012

Completeness Criteria

How can we tell if you've succeeded?

Working Proof of Concept Application - Intercept an asynchronous stream of xlog records using xReader and dump their logical interpretation to a disk file.