MergeImplementationDetails

From PostgreSQL wiki
Jump to navigationJump to search

This page details the implementation of the MERGE command as developed during the GSoC 2010.

This was never integrated into PostgreSQL, and requires significant work to be production quality


Implementation Outline

For adding the MERGE command into PostgreSQL system, we need to construct a whole work flow of this new query type. The parts we modified including:


1. Parser: add the syntax definition of MERGE command, transform the user query string into a internal structure.


2. Analyzer: do semantic check on the MERGE command statement, and build a “Query” node for it.


3. Rewriter: rewrite the query of MERGE and its actions. Fire rules if necessary.


4. Planner: build the plan tree.


5. Executor: run the plan


Besides, we need to modify the EXPLAIN command, let the MERGE query to be explainable.


Parser

Data Structure

In Parser, we created two nodes for the MERGE command.


/*The structure for MERGE command statement*/
typedef struct MergeStmt
{
	NodeTag		type;
	RangeVar   	*relation;		/*targe relation for merge */

	/* source relations for the merge. 
	*Currently, we only allwo single-source merge, 
	*so the length of this list should always be 1 
	*/
	List		*source;		
	Node	   	*matchCondition;	/* qualifications of the merge*/

	/*list  of MergeConditionAction structure. 
	*It stores all the matched / not-matched conditions and the corresponding actions
	*The elments of this list are MergeConditionAction nodes
	*/
	List	   	*actions;		

}MergeStmt;


/*	the structure for the actions of MERGE command. 
*	Holds info of the clauses like "WHEN MATCHED/NOT MATCHED AND <qual> THEN UPDATE/DELETE/INSERT"
*/
typedef struct MergeConditionAction
{
	NodeTag		type;
	bool 		match; /*match or not match*/
	Node		*condition;/*the AND condition for this action*/
	Node 		*action; /*the actions: delete , insert or update*/
}MergeConditionAction;


The information of a MERGE command will be parsed and transformed into a "MergeStmt" structure. The “actions” field in "MergeStmt" is a list of "MergeConditionAction" structures.

One "MergeConditionAction" represents one action in the command.


Beside, we add two new key words, MERGE and MATCHED, in kwlist.h

PG_KEYWORD("matched", MATCHED, UNRESERVED_KEYWORD)
 
PG_KEYWORD("merge", MERGE, UNRESERVED_KEYWORD)


Process

In /parser/gram.y, we add the definition for MERGE command.


The target relation name, source table (which could be a subquery with alias) and the matching conditions can all be represented by existing components in gram.y


The merge action clauses (WHEN … clauses) will be transformed into MergeConditionAction nodes and appended in the list of “actions”.


For a “MATCHED” action, the bool field “match” will be set true. An action can specify its own additional qualificaion, which is an a_expr recorded in the field of "condition". The remain part of an action, such as the “SET …” clause of an update action or the “VALUES …” clauses of insert actions, will be transformed in to an " MergeUpdate / MergeInsert / MergeDelete " node and put in the “action” field of “MergeConditionAction”. These nodes are in fact the nodes of “UpdateStmt, InsertStmt and DeleteStmt”. We just create new node tags for them for diversity.


Analyzer

Data Structure

One MERGE command needs to create a serial of side-queries (not subqueries) for its actions. The query node for a merge action is quite similar with that of a common UPDATE/INSERT/DELETE query. In order to hold the merge action queries and to make a difference between them and common queries, we add two fields in the “Query” structure.


typedef struct Query
{
	NodeTag		type;

	CmdType		commandType;	/* select|insert|update|delete|utility */

	List	   *rtable;			/* list of range table entries */
	
	………

	bool		 isMergeAction; /*if this query is a merge action. */	 

	List 		*mergeActQry; /* the list of all the merge actions. 
								* used only for merge query statment*/			
} Query;


For a merge command, we will create a "Query" node for its main query (the query that joins source table with target table). And the merge actions will be transformed into "Query" nodes, whose “isMergeAction” fields are set to true. Then, these "Query" nodes for merge actions are linked to the “mergeActQry” field of their main query.


NOTE:


A special thing here is that all the merge action queries share the same range table list. The range table is created when the main query is transformed. And the transformation of merge actions will NOT create their own range table but just make their “rtable” point to that of the main query directly.


Process

transformMergeStmt()


The “MergeStmt” created by Parser will be sent to the function of “transformStmt()” in /parser/analyze.c . We add a new switch case in this function and pass the node to a new function “transformMergeStmt()”.


In this function, we build the main query of MERGE command in the following way:


1. Create a blank “Query” node, with a command type of “CMD_MERGE”.


2. Open the relation of target table with “RowExclusiveLock”. No adding it into rangle table yet.


3. Make a ”JoinExpr” node with the source table and target table as the left parameter and right parameter respectively. The join type is LEFT JOIN. This join expression will be the only element in the “FROM LIST” of main query.


4. Make a target list with only one “A_Start” reference.


5. Transform this query as an ordinary SELECT query. Thus, we will get a Query like that of command:


SELECT * FROM <source_table> LEFT JOIN <target_table> ON <match_condition>


6. Locate the Range Table Entry for the target table, set it as the real “resultRelation” of the query.


7. For each merge action, process it with “transformMergeActions()”, and append the returned result to the list of “mergeActQry” in main query.


8. Return the main query as the final result.


transformMergeActions()


For transforming a merge action we need to do the following:


1. Do a simple check on the action. Make sure that the INSERT action is only taken on “MATCHED” cases, and UPDATE/DELETE on “NOT MATCHED” cases.


2. Put the target relation reference and the additional qualifications in the corresponding fields of the MergeUpdate/MergeInsert/MergeDelete nodes. Then transform it as an common nodes with the function “transformStmt()”.


3. Return the result. Note that, all the merge action queries share the very same Range Table with the main query but have their own target lists and join trees quals.


NOTE:


1. In “transformMergeActions()” the ParseState is directly inherited from the main query process. Thus, the range table in it is the same one in the main query.


2. The “transformStmt()” will pass the nodes to corresponding functions such as “transformUpdateStmt()”. But we have modified these functions, so that they will not touch the range table.


3. The INSERT action query is a little bit different with a common INSERT query.


In an ordinary INSERT, the VALUES list can only contains constants and DEFAULT. (am I right??) But in the INSERT action of MERGE command, the VALUES list can have expressions with variables(attributes of the targe and source tables).


Besides, in the ordinary INSERT, a VALUES list can never be followed by a WHERE clause. But in MERGE INSERT action, there are matching conditions. Thus, the output qry of this function is an INSERT query in the style of "INSERT...VALUES...", except that we have other range tables and a WHERE clause.


Note that it is also different from the "INSERT ... SELECT..." query, in which the whole SELECT is a subquery. (We don't have subquery here).


Rewriter

The main job of Rewriter for a table-modifying query is to generate extra queries according to the rules defined on the Result Relation of the query. The rule-generated queries will be executed before or after the user-input query.


The process of MERGE command in Rewriter is simple. We need to fire the rules for each merge action and sum up the rule-generated queries of all actions. We did this by adding codes in function “QueryRewrite()” in /rewrite/rewriteHandler.c


Since the merge actions are “Query” nodes themselves, they can be applied to “RewriteQuery()” directly.


We only need to make sure that:


1. All the action queries should be processed by “QueryRewrite()”. But the actions of the same type should not generate rule queries for more than once.


2. If an action is replaced by INSTEAD rules (or NOTHING), it should be removed from the action list along with all the other actions of the same type.


3. If all the merge actions of a merge command is replaced by INSTEAD rules, the MERGE command query itself should also be eliminated from the query queue.


Planner

Data Structure

In Planner, the system builds plan trees for the queries that need to scan and join tables. The nodes in a plan tree define various actions in the process of fetching the result tuple from tables.


For a table-modifying query, namely an UPDATE/INSERT/DELETE query, the Planner will pack the plan tree with in a “ModifyTable” node, which is a new type of plan node in PostgreSQL 9.


As a new table-modifying query, MERGE is also handled in this way. The main query generated by “pg_analyze_and_rewrite()” will be fully processed by the function “Planner()”. And finally be returned as a “ModifyTable” node. The plan of main query will be put in the List field “plans” in “ModifyTalbe” nodes. We prefer to call this plan the “main plan” rather than the “top plan”. It contains all the information for joining the source table and target table. But, technically speaking, it is still beneath the merge action plans, because the tuple produced by the main plan will be further processed by (one of) the merge actions.


MergeAction


We design a new plan type for merge actions specially, which is the “MergeAction” node.


typedef struct MergeAction
{
	Plan plan;
	CmdType	operation;/* INSERT, UPDATE, or DELETE */
	List *flattenedqual; /*the flattened qual expression of action*/
}MergeAction;


mergeActPlan in ModifyTable


In planner the “MergeAction” nodes created for merge actions will be packed in “ModifyTable” nodes. For holding these nodes, we add a List field “mergeActPlan” in “ModifyTable”. Thus, the nodes of merge actions will be linked in the “mergeActPlan” of “ModifyTable” of their main plan.


typedef struct ModifyTable
{
	Plan		plan;
	CmdType		operation;		/* INSERT, UPDATE, or DELETE */
	List	   *plans;			/* plan(s) producing source data */
	
	……
	
	List 	*mergeActPlan;	/*the plans for merge actions, 
				which are also ModifyTable nodes*/
} ModifyTable;


For summing up, we can show this structure as a picture.


The planner result of merge command.jpg


Process

The “Query” node of MERGE command is passed to function “subquery_planner()” through “Planner()”, which will do most of the main job of building the main plan, in the common way.


At the end of this function, the main plan is put into a “ModifyTable” node for returning. At this step, we process all the merge actions by the function “merge_action_planner()”, and link the results to the list of “mergeActPlan”.


merge_action_planner()


This function generates a “ModifyTable” node from the “Query” node of a merge action. It does very limited jobs, including:


0. Make a new “MergeAction” node as the returning result.


1. Preprocess the target list


2. Preprocess the qual in JoinTree of the Query.


3. Generate a “flat” qual expression for the EXPLAIN command.


4. Push up the Vars in target list and qual.


5. Put all the pieces of information into the result “MergeAction”. Pack it with a “ModifyTable” node and return.


Instruction for step 4:


When prepare for the MERGE command, we have made a left join between the source table and target table as the main plan. In this case, the range table contains ONLY THREE range table entries:

  • The first one is that of the source table, which may be a subquery or a plain table
  • The second one is the entry of the targe table, which is a plain table
  • The third one is a join expression with the sourse table and target table as its parameters.


Each merge action of the command has its own query and plan nodes as well. And, the vars in its target list and qual expressions may refers to the attribute in any one of the above 3 range table entries.


However, since the result tuple slots of merge actions are projected from the returned tuple of the join, we need to mapping the vars of source table and target table to their corresponding attributes in the third range table entry.


Thus we need a function which does the opposit of the “flatten_join_alias_vars()” function. It walks through the target list and qual of a “MergeAction” plan, changes the vars' varno and varattno, make it points to the corresponding attribute in the result tuple of main plan.

Executor

Data Structure

As we have a new plan node MergeAction, we need to define a corresponding plan state node for it, which is “MergeActionState”.

typedef struct MergeActionState
{
	PlanState	ps;		/* its first field is NodeTag */
	CmdType		operation;	
} MergeActionState;

Similarly, we add a new list field “mergeActPstates” in “ModifyTableState” for holding these “MergeActionState” structures.

typedef struct ModifyTableState
{
	PlanState	ps;		/* its first field is NodeTag */
	CmdType		operation;
	……
	List 		*mergeActPstates; /*the list of the planstate of meger command actions. 
					NIL if this is not a merge command.
					The elements if it are MergeActionState nodes*/
} ModifyTableState;

Process

The Executor has 3 main phase functins: “ExecutorStart()”, “ExecutorRun()” and “ExecutorEnd()”. New codes are needed in Start and Run phases while the End phase remains unchanged.

Executor Start

In this phase, the main job is to Initialize plans, construct exec state for the query. This work is done by recursively calling the “ExecInitNode()” function.


“ExecInitNode()” will walk through the plan tree and pass the different plan nodes to their corresponding initializing functions.


For a MERGE command plan (as well as the other table-modifying queries), it will firstly go to “ExecInitModifyTable()”, which is the function for “ModifyTable” plan. At the end of this function, we added codes to traverse the “mergeActPlan” list of the input plan and collect the results into the list of “mergeActPstates” of the returning result.


“ExecInitMergeAction()”


In the “ExecInitNode()” we add a branch for MergeAction plan type. This branch flows to the function “ExecInitMergeAction()”. In this function, we will do the following:


1. Create a MergeActionState structure for returning.


2. Init the result tuple slot.


3. initialize tuple type


4. Create expression context for node


5. Initialize child expressions in target list and qual


6. Initialize the projection information


Finally, the initialized plan state structure for a merge command is like that shown in the following figure.


The ExecInitNode result of merge command.jpg


Executor Run

In this phase, the MERGE command is executed by the “ExecModifyTable()” function. The general procedure is simple.


Firstly, the main plan is executed and the tuple of it is returned one by one.


Then, for each of the slot produced by main plan, we try to extract the “ctid” attribute from it. If the ctid is NULL, then we get a NOT MATCHED tuple.


Then the slot and the ctid are passed to “ExecMerge()” function. In this function, the merge actions will be evaluated one by one. If the tuple meet the qualification of a merge action, we will project out the target tuple slot of the action from the slot of main plan. All the information is then passed to “ExecInsert()”, “ExecDelete()”, or “ExecUpdate()” according to the action’s command type. And, the table modification is done.


Once one action is taken, the remaining actions are skipped.


EXPLAIN MERGE command

EXPLAIN command can show the plan details for a query. We have done necessary modifications on the system, in order to make the MERGE command be explainable.

Firstly, in gram.y, we add the "MergeStmt" to be one of the ExplainableStmt members. In this way, the EXPLAIN MERGE ... command will be accepted by psql.

Then, in the funtion "ExplainNode()", the main plan of MERGE command will be exlpained as other "ModifyTable" plans. Before that, the function "ExplainMergeActions()" is invoked to show the merge actions.