Query Parsing

From PostgreSQL wiki
Jump to navigationJump to search

Query Parsing

There are quite a few use cases where it'd be of great help if you could access the query tree Postgres uses internally for processing queries.

For example, when you need to do non-trivial operations on queries, like:

  • Filtering out queries containing specific tables
  • Reformatting & pretty-printing query strings
  • Modifying or removing parts of the query string, e.g. rename a column

The Postgres documentation gives a good overview how query execution is handled and what the intermediate tree representations of a query string are.


Approaches

Using Postgres internals

Postgres already offers tooling to create a query tree from a given query string.

raw_parser takes a query statement as input and will return a parse tree. This parse tree can then be pretty-printed by using nodeToString.

An example can be seen here, this can be easily implemented in a Postgres extension as well as seen here.

   SELECT * FROM foo where bar = 42 ORDER BY id DESC LIMIT 23;
   (
      {SELECT 
      :distinctClause <> 
      :intoClause <> 
      :targetList (
         {RESTARGET 
         :name <> 
         :indirection <> 
         :val 
            {COLUMNREF 
            :fields (
               {A_STAR
               }
            )
            :location 7
            }
         :location 7
         }
      )
      :fromClause (
         {RANGEVAR 
         :schemaname <> 
         :relname foo 
         :inhOpt 2 
         :relpersistence p 
         :alias <> 
         :location 14
         }
      )
      :whereClause 
         {AEXPR  
         :name ("=")
         :lexpr 
            {COLUMNREF 
            :fields ("bar")
            :location 24
            }
         :rexpr 
            {A_CONST 
            :val 42 
            :location 30
            }
         :location 28
         }
      :groupClause <> 
      :havingClause <> 
      :windowClause <> 
      :valuesLists <> 
      :sortClause (
         {SORTBY 
         :node 
            {COLUMNREF 
            :fields ("id")
            :location 42
            }
         :sortby_dir 2 
         :sortby_nulls 0 
         :useOp <> 
         :location -1
         }
      )
      :limitOffset <> 
      :limitCount 
         {A_CONST 
         :val 23 
         :location 56
         }
      :lockingClause <> 
      :withClause <> 
      :op 0 
      :all false 
      :larg <> 
      :rarg <>
      }
   )


This approach isn't without caveats though, since the nodeToString code is used only for printing out planned query trees. These trees are rewritten to contain only contain nodes ultimately needed by the Executor. This means that all nodes not expected in later stages of query execution can't be printed, for example INSERT/UPDATE/DELETE nodes - since later on these are all aggregated in a ModifyStmt node.

pg_query

pg_query is a Ruby gem which can generate query trees in JSON representation. It accomplishes this by using a patched Postgres version which on one hand addresses the shortcomings outlined above, and on the other hand it provides a new function nodeToJSONString which turns a query tree into a JSON representation instead of the native Postgres format.

The Ruby bindings are just glue code around the Postgres code. More information can be found in this blog post

   #<PgQuery:0x000000009673a0
    @query="SELECT * FROM foo where bar = 42 ORDER BY id DESC LIMIT 23;",
    @parsetree=
     [{"SELECT"=>
        {"distinctClause"=>nil,
         "intoClause"=>nil,
         "targetList"=>
          [{"RESTARGET"=>
             {"name"=>nil,
              "indirection"=>nil,
              "val"=>{"COLUMNREF"=>{"fields"=>[{"A_STAR"=>{}}], "location"=>7}},
              "location"=>7}}],
         "fromClause"=>
          [{"RANGEVAR"=>
             {"schemaname"=>nil,
              "relname"=>"foo",
              "inhOpt"=>2,
              "relpersistence"=>"p",
              "alias"=>nil,
              "location"=>14}}],
         "whereClause"=>
          {"AEXPR"=>
            {"name"=>["="],
             "lexpr"=>{"COLUMNREF"=>{"fields"=>["bar"], "location"=>24}},
             "rexpr"=>{"A_CONST"=>{"val"=>42, "location"=>30}},
             "location"=>28}},
         "groupClause"=>nil,
         "havingClause"=>nil,
         "windowClause"=>nil,
         "valuesLists"=>nil,
         "sortClause"=>
          [{"SORTBY"=>
             {"node"=>{"COLUMNREF"=>{"fields"=>["id"], "location"=>42}},
              "sortby_dir"=>2,
              "sortby_nulls"=>0,
              "useOp"=>nil,
              "location"=>-1}}],
         "limitOffset"=>nil,
         "limitCount"=>{"A_CONST"=>{"val"=>23, "location"=>56}},
         "lockingClause"=>nil,
         "withClause"=>nil,
         "op"=>0,
         "all"=>false,
         "larg"=>nil,
         "rarg"=>nil}}],
    @warnings=[]>


pgpool-II

pgpool-II is a middleware for Postgresql which in it's simplest use-case is an connection broker between client and database servers.

To be able to do advanced query routing it needs to understand the queries it gets sent form the query. To be able to do that, the pgpool project forked the parser and implemented their own node outFuncs to be able to recreated a query string from a given query tree.

If you want to create your own as correct as possible Postgres query pretty-printer this might be a good starting point.


sqlparse

sqlparse is a Python module which implements a non-validating SQL parser suitable for parsing, splitting and formatting SQL statements. It provides the backbone for http://sqlformat.org/ - a SQL pretty printer.

Introspection into the queries is quite limited and since it's not specifically targeted at Postgres it's capabilities are quite limited when it comes to drilling down into query trees.

pglast

As an alternative to sqlparse for Python use, there is also pglast, which, like pg_query, mentioned above, also builds on the underlying machinery of libpg_query to make PostgreSQL's internal query parsing code and the elements of the parse tree available not only for validating and breaking down queries grammatically but also for building them up from parts, patching them, etc., and comes with pretty-printing and some PostgreSQL-specific functionality (rewriting comma-separated argument-list in calls of builtin functions with available stadardsy equivalent phrases, e.g.) for rendering to a range of outputs. E.g.,

   >>> import pglast
   >>> import pglast.stream
   >>> s = pglast.parse_sql("select pg_catalog.overlay('terse', 'verbo', 1, 3)")
   >>> s
   (<RawStmt stmt=<SelectStmt targetList=(<ResTarget val=<FuncCall funcname=(<String sval='pg_catalog'>, <String sval='overlay'>) args=(<A_Const isnull=False val=<String sval='terse'>>, <A_Const isnull=False val=<String sval='verbo'>>, <A_Const isnull=False val=<Integer ival=1>>, <A_Const isnull=False val=<Integer ival=3>>) agg_within_group=False agg_star=False agg_distinct=False func_variadic=False funcformat=<CoercionForm.COERCE_EXPLICIT_CALL: 0>>>,) groupDistinct=False limitOption=<LimitOption.LIMIT_OPTION_DEFAULT: 0> op=<SetOperation.SETOP_NONE: 0> all=False> stmt_location=0 stmt_len=0>,)
   >>> pglast.stream.IndentedStream()(s[0].stmt)
   "SELECT pg_catalog.overlay('terse', 'verbo', 1, 3)"
   >>> pglast.stream.IndentedStream(special_functions=True)(s[0].stmt)
   "SELECT overlay('terse' PLACING 'verbo' FROM 1 FOR 3)"
   >>> s[0].stmt.targetList[0].val.args[0].val.sval = "hasten"
   >>> pglast.stream.IndentedStream(special_functions=True)(s[0].stmt)
   "SELECT overlay('hasten' PLACING 'verbo' FROM 1 FOR 3)"
   >>> import pprint
   >>> pprint.pprint(s[0].stmt())  # yes, a node (in this case, the SelectStmt at s[0].stmt) is called as a no-argument function to get its serialization
   {'@': 'SelectStmt',
   [... etc. ... ]
     'targetList': ({'@': 'ResTarget',
                    'indirection': None,
                    'location': 7,
                    'name': None,
                    'val': {'@': 'FuncCall',
                            'agg_distinct': False,
                            'agg_filter': None,
                            'agg_order': None,
                            'agg_star': False,
                            'agg_within_group': False,
                            'args': ({'@': 'A_Const',
                                      'isnull': False,
                                      'val': {'@': 'String', 'sval': 'hasten'}},
                                     {'@': 'A_Const',
                                      'isnull': False,
                                      'val': {'@': 'String', 'sval': 'verbo'}},
                                     {'@': 'A_Const',
                                      'isnull': False,
                                      'val': {'@': 'Integer', 'ival': 1}},
                                     {'@': 'A_Const',
                                      'isnull': False,
                                      'val': {'@': 'Integer', 'ival': 3}}),
                            'func_variadic': False,
                            'funcformat': {'#': 'CoercionForm',
                                           'name': 'COERCE_EXPLICIT_CALL',
                                           'value': 0},
                            'funcname': ({'@': 'String', 'sval': 'pg_catalog'},
                                         {'@': 'String', 'sval': 'overlay'}),
                            'location': 7,
                            'over': None}},),
    'valuesLists': None,
    'whereClause': None,
    'windowClause': None,
    'withClause': None}

It also includes a pgpp script that exercises much of this functionality from the command line.

Other libpq_query bindings

The README for libpg_query lists bindings in a few other languages as well.

TODO

To make life easier for people using query trees the following things need to be addressed in Postgres core:

  • Implement outFuncs for all node types and not only those seen in a post-planning state.
  • Ship alternate outFuncs suitable for generating e.g. JSON trees instead of the native Postgres tree format.