Query Parsing

From PostgreSQL wiki

Jump to: navigation, search

Contents

Query Parsing

There are quite a few use cases where it'd be of great help if you could access the query tree Postgres uses internally for processing queries.

For example, when you need to do non-trivial operations on queries, like:

  • Filtering out queries containing specific tables
  • Reformatting & pretty-printing query strings
  • Modifying or removing parts of the query string, e.g. rename a column

The Postgres documentation gives a good overview how query execution is handled and what the intermediate tree representations of a query string are.


Approaches

Using Postgres internals

Postgres already offers tooling to create a query tree from a given query string.

raw_parser takes a query statement as input and will return a parse tree. This parse tree can then be pretty-printed by using nodeToString.

An example can be seen here, this can be easily implemented in a Postgres extension as well as seen here.

   SELECT * FROM foo where bar = 42 ORDER BY id DESC LIMIT 23;
   (
      {SELECT 
      :distinctClause <> 
      :intoClause <> 
      :targetList (
         {RESTARGET 
         :name <> 
         :indirection <> 
         :val 
            {COLUMNREF 
            :fields (
               {A_STAR
               }
            )
            :location 7
            }
         :location 7
         }
      )
      :fromClause (
         {RANGEVAR 
         :schemaname <> 
         :relname foo 
         :inhOpt 2 
         :relpersistence p 
         :alias <> 
         :location 14
         }
      )
      :whereClause 
         {AEXPR  
         :name ("=")
         :lexpr 
            {COLUMNREF 
            :fields ("bar")
            :location 24
            }
         :rexpr 
            {A_CONST 
            :val 42 
            :location 30
            }
         :location 28
         }
      :groupClause <> 
      :havingClause <> 
      :windowClause <> 
      :valuesLists <> 
      :sortClause (
         {SORTBY 
         :node 
            {COLUMNREF 
            :fields ("id")
            :location 42
            }
         :sortby_dir 2 
         :sortby_nulls 0 
         :useOp <> 
         :location -1
         }
      )
      :limitOffset <> 
      :limitCount 
         {A_CONST 
         :val 23 
         :location 56
         }
      :lockingClause <> 
      :withClause <> 
      :op 0 
      :all false 
      :larg <> 
      :rarg <>
      }
   )


This approach isn't without caveats though, since the nodeToString code is used only for printing out planned query trees. These trees are rewritten to contain only contain nodes ultimately needed by the Executor. This means that all nodes not expected in later stages of query execution can't be printed, for example INSERT/UPDATE/DELETE nodes - since later on these are all aggregated in a ModifyStmt node.

pg_query

pg_query is a Ruby gem which can generate query trees in JSON representation. It accomplishes this by using a patched Postgres version which on one hand addresses the shortcomings outlined above, and on the other hand it provides a new function nodeToJSONString which turns a query tree into a JSON representation instead of the native Postgres format.

The Ruby bindings are just glue code around the Postgres code. More information can be found in this blog post

   #<PgQuery:0x000000009673a0
    @query="SELECT * FROM foo where bar = 42 ORDER BY id DESC LIMIT 23;",
    @parsetree=
     [{"SELECT"=>
        {"distinctClause"=>nil,
         "intoClause"=>nil,
         "targetList"=>
          [{"RESTARGET"=>
             {"name"=>nil,
              "indirection"=>nil,
              "val"=>{"COLUMNREF"=>{"fields"=>[{"A_STAR"=>{}}], "location"=>7}},
              "location"=>7}}],
         "fromClause"=>
          [{"RANGEVAR"=>
             {"schemaname"=>nil,
              "relname"=>"foo",
              "inhOpt"=>2,
              "relpersistence"=>"p",
              "alias"=>nil,
              "location"=>14}}],
         "whereClause"=>
          {"AEXPR"=>
            {"name"=>["="],
             "lexpr"=>{"COLUMNREF"=>{"fields"=>["bar"], "location"=>24}},
             "rexpr"=>{"A_CONST"=>{"val"=>42, "location"=>30}},
             "location"=>28}},
         "groupClause"=>nil,
         "havingClause"=>nil,
         "windowClause"=>nil,
         "valuesLists"=>nil,
         "sortClause"=>
          [{"SORTBY"=>
             {"node"=>{"COLUMNREF"=>{"fields"=>["id"], "location"=>42}},
              "sortby_dir"=>2,
              "sortby_nulls"=>0,
              "useOp"=>nil,
              "location"=>-1}}],
         "limitOffset"=>nil,
         "limitCount"=>{"A_CONST"=>{"val"=>23, "location"=>56}},
         "lockingClause"=>nil,
         "withClause"=>nil,
         "op"=>0,
         "all"=>false,
         "larg"=>nil,
         "rarg"=>nil}}],
    @warnings=[]>


pgpool-II

pgpool-II is a middleware for Postgresql which in it's simplest use-case is an connection broker between client and database servers.

To be able to do advanced query routing it needs to understand the queries it gets sent form the query. To be able to do that, the pgpool project forked the parser and implemented their own node outFuncs to be able to recreated a query string from a given query tree.

If you want to create your own as correct as possible Postgres query pretty-printer this might be a good starting point.


sqlparse

sqlparse is a Python module which implements a non-validating SQL parser suitable for parsing, splitting and formatting SQL statements. It provides the backbone for http://sqlformat.org/ - a SQL pretty printer.

Introspection into the queries is quite limited and since it's not specifically targeted at Postgres it's capabilities are quite limited when it comes to drilling down into query trees.

TODO

To make life easier for people using query trees the following things need to be addressed in Postgres core:

  • Implement outFuncs for all node types and not only those seen in a post-planning state.
  • Ship alternate outFuncs suitable for generating e.g. JSON trees instead of the native Postgres tree format.
  • pg_stat_statements: Consider using a placeholder character that can actually be parsed (e.g. "$0") as opposed to "?" (which is also an operator), since that would enable querytree-based tooling to work more easily on normalized queries.
Personal tools