PostgreSQL vs SQL/XML Standards

From PostgreSQL wiki
Jump to navigationJump to search

This page attempts to document where PostgreSQL’s XML-related features conform to or deviate from those in the standard (part 14 of ISO SQL, better known as SQL/XML).

Preliminary: significant revisions to the standards

Both the SQL/XML standard itself, and some of the XML technologies on which it is built, appeared in one form initially, but were then radically revised, creating a watershed of sorts between the first version and all the versions following.

As if to demonstrate the Fred Brooks maxim “plan to throw one away”, the standards have continued to develop since then, but in much less radical ways, thanks to the coherent foundations laid by throwing the first one away.

SQL/XML:2003 contrasted with SQL/XML since 2006

Eisenberg and Melton, in 1, introduce the significant redesigns that mark off SQL/XML (2006 and later) from the first edition that appeared in 2003. The paper is worth reading in full, but some highlights are:

  • The data model changed. The SQL/XML data model includes not only the various nodes of XML structure, but also what “atomic” (non-XML-node) types can be represented there. The 2003 edition had a data model based on XML Infoset, with Boolean, double, and string as the only atomic types. Since 2006, the data model has been based on that of the XML Query language, with a much richer type system.
  • The XML data type in SQL became parameterized. In place of the single type XML in the 2003 edition, since 2006 the data type has been more specific: XML(SEQUENCE), XML(CONTENT(ANY)), XML(CONTENT(UNTYPED)), XML(DOCUMENT(ANY)), XML(DOCUMENT(UNTYPED)), and (though not mentioned in the Eisenberg and Melton paper) XML(CONTENT(XMLSCHEMA ...)) and XML(DOCUMENT(XMLSCHEMA ...)) where the ... identifies a registered XML Schema. (The standard requires any schema used to be registered in advance, so that validation never involves retrieval of any schema from potentially untrusted sources.)
  • The standard since 2006 defined several new functions/predicates, all based on the XML Query language: XMLQUERY, XMLTABLE, and the predicate XMLEXISTS. It also adds the XMLCAST operation, capable of casting one of the refined XML types to another, but also of casting non-XML SQL data types to their representations in XML and vice versa.

XPath 1.0 contrasted with XQuery, XPath since 2.0

The W3C XPath language, in its 1.0 version, offered a simple syntax for following node paths in an XML document, and a small, incomplete set of operations on a data model based on Infoset (XML nodesets, and Boolean, double, and string as the only atomic types). It was replaced, in its 2.0 version, by a much more capable, rigorously specified language as a part of XQuery 1.0, strongly typed (based on the richer XML Query data model) with type inference, and a redesigned and expanded set of operators and functions on those types.

The watershed between XPath 1.0 and its redesigned successor, XPath 2.0 as used in XQuery 1.0, and the later versions of each, coincides with that between SQL/XML:2003 and the 2006 and later standard. The data model and formal semantics of the redesigned languages are the terms used to specify the redesigned SQL/XML.

SQL/XML support in PostgreSQL

PostgreSQL’s SQL/XML support follows the first, 2003 edition of the standard, with a few extensions resembling, but not conformant with, the current versions.

For brevity, because of the watershed revisions after 2003 and the more steady evolution of the standard since then, “the 2003 standard” will be used below for the first edition, and “the active standard” for the one introduced in 2006 and its subsequent revisions. “The standard” can be used for features that do not differ. Similarly, "XPath 1.0" may be contrasted to the "active" XPath and XQuery language standards.

PostgreSQL XML and SQL/XML (2003)

This is the edition of the standard that PostgreSQL most nearly implements.

The XML data type

The 2003 standard defines a single data type named XML. Its possible values are either null or a document information item as defined in the XML Information Set W3C recommendation of 2001, but relaxing the requirement of exactly one element information item child. PostgreSQL provides this type.

XML OPTION

By relaxing the “exactly one direct element child” requirement, the XML data type is able to hold XML “content” that is not strictly an XML “document”.

The standard defines an XML OPTION session property, with values DOCUMENT or CONTENT, affecting whether the requirement is enforced when an XML value is implicitly parsed or serialized. (The explicit XMLPARSE and XMLSERIALIZE operations have syntax to specify which treatment to apply.)

The XML OPTION can be specified in different ways:

  • With a SET XML OPTION statement, affecting the rest of the session. PostgreSQL implements this statement.
  • Explicitly on parameters and return types of functions. PostgreSQL does not implement this.

In the 2003 standard, neither XML subtype, DOCUMENT or CONTENT, completely contains the other: the grammar for CONTENT does not allow for a document-type definition, so any input that begins with a DTD must be parsed as DOCUMENT. The 2006 revision changed the definition of CONTENT to be a proper superset of DOCUMENT, so it is possible to use XML OPTION CONTENT to parse any valid input without knowing its form in advance (important for pg_restore). PostgreSQL implements the newer definition (starting with PostgreSQL 12, 11.3, 10.8, 9.6.13, 9.5.17, and 9.4.22).

Functions acting on XML

The standard defines functions XMLAGG, XMLCONCAT, XMLCOMMENT, XMLELEMENT, XMLFOREST, XMLPARSE, XMLPI, XMLROOT, XMLSERIALIZE.

PostgreSQL implements these functions.

Predicates

The standard defines an IS [ NOT ] DOCUMENT predicate. PostgreSQL implements it, if somewhat inefficiently (the property is not recorded in PostgreSQL’s representation of the XML type, so this predicate must repeat the effort of parsing to produce its result).

Mappings to XML

The standard defines several mappings from SQL types and data to XML, without specifying any syntax for invoking them:

  • an SQL table to XML and an XML Schema document
  • an SQL schema to XML and an XML Schema document
  • an SQL catalog to XML and an XML Schema document

PostgreSQL provides functions that implement these mappings.

XMLBINARY and XMLNAMESPACES

The standard has syntax to specify whether binary data should be mapped to XML in base64 or hex form, and to specify the XML namespace mappings in scope for an operation.

The standard syntax is lexically scoped, extending the grammar for WITH on a query to include XMLBINARY and XMLNAMESPACES clauses.

PostgreSQL does not implement the lexically-scoped settings, but instead has an xmlbinary setting in session state, and allows various XML functions to accept parameters that specify namespace mappings.

It’s been suggested that the addition of some lexically-scoped option setting syntax to PostgreSQL could be useful for more than just SQL/XML conformance.

PostgreSQL extensions to SQL:2003 SQL/XML

  • PostgreSQL has an XMLEXISTS predicate. This is like the predicate of the same name in the active standard, but the test must be written in XPath 1.0 instead of XQuery, so the context item passed in can only be XML DOCUMENT, never CONTENT or SEQUENCE; also, no other parameters can be passed in.
  • PostgreSQL has three predicates, xml_is_well_formed, xml_is_well_formed_document, xml_is_well_formed_content, that apply to an argument of text type rather than xml. In a sense, they test the parsability of the text.
  • PostgreSQL has functions xpath and xpath_exists to evaluate expressions in XPath 1.0. Like XMLEXISTS, they can only accept a DOCUMENT argument. xpath_exists returns boolean and is nearly equivalent to XMLEXISTS, but can accept namespace mappings as an additional argument. It is needed because PostgreSQL does not have the lexically-scoped XMLNAMESPACES syntax to use with XMLEXISTS. These functions resemble XMLEXISTS and XMLQUERY in the active standard, except for the restriction to XPath 1.0 and other limits discussed further below.
  • PostgreSQL has an XMLTABLE function. It is like the function of the same name in the active standard, but the row and column expressions must be written in XPath 1.0 instead of XQuery, the argument can only be XML DOCUMENT, never CONTENT or SEQUENCE, and there are limits on the types that can be returned, discussed further below.
  • PostgreSQL has an xmlvalidate function, but it simply raises an 0A000 “not implemented” error; it is disabled because of security concerns with possibly retrieving a schema from an untrusted location. It has the same name as the active standard’s XMLVALIDATE function, but returns boolean, functioning more like the active standard’s IS VALID predicate. (The standard addresses the untrusted-schema concern by requiring any schema to have been “registered” with the database in advance).
  • PostgreSQL allows BY REF (or, from PG 12 on, BY VALUE) to appear in various places in XML function syntax, a borrowing from the active standard. PostgreSQL ascribes no meaning to them, parsing them only for compatibility with the later syntax. However, under its actual meaning in the standard, only BY VALUE correctly captures PostgreSQL's behavior. BY REF in SQL/XML means that nodes can be passed and returned among functions and retain their XQuery node identity, which is impossible in PostgreSQL because the XML type’s representation is a string serialization.

PostgreSQL XML and SQL/XML (active)

The XML data type

In the active standard, instead of a single XML type, there is a hierarchy of types, with XML(SEQUENCE) being the most general, and the other, specific subtypes applying when the indicated conditions are true of the value:

SEQUENCE
|
(?sequence of length 1, a document node)
|
CONTENT(ANY)----------------.----------------(?every element
|                           |                 conforms to a
(?every element has         (?no extraneous   schema)
 xdt:untyped and !nilled,   nodes)            |
 every attribute has        |                 |
 xdt:untypedAtomic)         DOCUMENT(ANY)     CONTENT(XMLSCHEMA)
 |                                            |
 CONTENT(UNTYPED)                             (?whole thing is valid
 |                                            according to schema)
 (?no extraneous nodes)                       |
 |                                            DOCUMENT(XMLSCHEMA)
 DOCUMENT(UNTYPED)

In the active standard, unlike the 2003 edition, any value of a DOCUMENT type is also a value of a CONTENT type; this appears explicitly in the rules for “parsing a string as an XML value”.

The single XML type in PostgreSQL can hold values that fit XML(CONTENT(ANY)) or any of the more specific types. It does not record which specific properties apply (for example, the IS DOCUMENT predicate in PostgreSQL has to reparse the stored value). There is no type in PostgreSQL that corresponds to XML(SEQUENCE), a lack that leads to complications retrieving the results of XPath queries, as discussed further below.

The names XML(CONTENT(XMLSCHEMA)) and XML(DOCUMENT(XMLSCHEMA)) in the hierarchy are not complete names for any type; the actual syntax allows declarations like XML(DOCUMENT(XMLSCHEMA ID foo)) where foo is the identifier of some XML Schema registered with the database in advance. Even in the syntax using a URI for the schema, such as XML(DOCUMENT(XMLSCHEMA URI 'http:...')), the URI must be associated with a schema that has been registered.

It would not be hard to use the typmod mechanism in PostgreSQL to encode all of the standard’s XML subtypes and allow a database to register roughly a billion XML schemas or schema fragments for the types to refer to.

Functions acting on XML

The active standard updates the semantics of XMLPARSE: because the XML CONTENT types completely contain the corresponding DOCUMENT types, XMLPARSE with the CONTENT option can be relied on for parsing input whose form is unknown. PostgreSQL provides the updated semantics (starting in PostgreSQL 12, 11.3, 10.8, 9.6.13, 9.5.17, and 9.4.22).

The active standard also adds the following functions.

XMLCAST

XMLCAST can check and apply conversions between XML subtypes, and also can convert between a non-XML type on one side (for example, an SQL date) and an XML type on the other (for example, an XML(SEQUENCE) containing a value of XML Schema atomic type xs:date).

PostgreSQL does not have XMLCAST.

XMLQUERY

The XMLQUERY function evaluates an XML Query expression, passing some value (optionally) to be the context item, and zero or more other values to be named parameters available to the query expression. The values passed in can be of XML types, or ordinary SQL data types, which will be made into instances of the XML Query data model as if by XMLCAST. The function returns whatever the expression produces, either raw as XML(SEQUENCE), or wrapped into a document node as XML(CONTENT).

PostgreSQL does not have the XMLQUERY function. It has a similar function, xpath, which can only evaluate XPath 1.0 expressions, and therefore can only accept a well-formed XML document as the context item. It does not support passing named parameters to the expression.

The return type of PostgreSQL’s xpath is xml[], an improvisation reflecting PostgreSQL’s lack of an XML(SEQUENCE) type. The expression can therefore return a sequence of values, but no value in the sequence can be anything but (in the active standard’s terminology) XML(CONTENT).

If the expression returns a value of an atomic type (which can only be boolean, number, or string in XPath 1.0), the SQL value can be recovered by casting the xpath result first to text and then to the correct type, except in the case of a string:

WITH data(x) AS (SELECT
  XMLELEMENT(name foo,
    XMLATTRIBUTES(42 AS num, true AS bool, 'I am a <string>' AS str)
  )
SELECT (xpath('*/@num', x))[1]::text::double precision from data;
42
...
SELECT (xpath('*/@bool', x))[1]::text::boolean from data;
 t
...
SELECT (xpath('*/@str', x))[1]::text from data;
 I am a &lt;string&gt;

The active standard’s XML(SEQUENCE) type can contain an xs:string value directly as an item, but in the PostgreSQL array of XML, the string result is necessarily wrapped as XML content, which results in character escaping that a simple cast to text does not remove. If PostgreSQL had XMLCAST, an XMLCAST to text would recover the correct string value.

(Discussion: 2 pointed out that, before commit aaf15e5, string values could be recovered correctly by the simple cast to text, because the escaping was not being applied. But the commit was correct; the escaping must be applied, given that the result must be XML CONTENT. The resulting usability issue reflects the lack of an XML(SEQUENCE) type and/or XMLCAST function, and is not the fault of the commit.)

XMLTABLE

The XMLTABLE function produces an SQL table result from its input, by evaluating one XML Query expression against the input to produce a sequence result, then producing one row for each item in the sequence by evaluating XML Query expressions for the columns against that item. The call site specifies the SQL types of the result columns, which get values correctly converted (as by XMLCAST) from the XML Schema typed results of the column expressions.

PostgreSQL has an XMLTABLE function that differs from the standard in several ways:

  • The row and column expressions cannot be XQuery; only XPath 1.0 is supported.
  • Therefore, the context item can only be a well-formed XML document.
  • Only a context item, no named parameters, can be passed in.
  • As XPath 1.0, the expressions can only have XML types or boolean, number, or string.
  • If the row expression returns a non-XML type, no output row is produced.
  • The row expression cannot control the order of the output rows.
  • Column expressions of boolean or number type produce errors, in PostgreSQL versions before 12.

A column expression can have an XML type or string type:

SELECT * FROM XMLTABLE('*' PASSING '<a>a</a>'
  COLUMNS a xml PATH '.', b text PATH '.', c text PATH '"hi"');
    a     | b | c  
----------+---+----
 <a>a</a> | a | hi

In PostgreSQL versions before 12, where column values of boolean or number type produce errors, they can be recovered by casting them in the XPath expression to string type, which the XMLTABLE implementation accepts and casts implicitly to the result column type:

SELECT * FROM XMLTABLE('*' PASSING '<a>a</a>'
 COLUMNS a xml PATH '.', b text PATH '.', c text PATH '"hi"',
  d boolean PATH 'string(. = "a")', e integer PATH 'string(string-length(.))');
    a     | b | c  | d | e 
----------+---+----+---+---
 <a>a</a> | a | hi | t | 1
XMLTEXT

PostgreSQL does not implement this function, which constructs a text node.

XMLVALIDATE

The active standard has an XMLVALIDATE function. The function returns an XML type (the specific XML subtype is static, determined in parse analysis). The function either raises a validation error, or returns an instance of its static type that is effectively a copy of the input in which untyped elements and attributes can have been annotated with more specific XML Schema types, in accordance with the schema.

PostgreSQL has a function called XMLVALIDATE, but it has a boolean return type, making it more like the active standard’s IS VALID predicate. The function in PostgreSQL produces an “unimplemented” error, because PostgreSQL does not have any implementation of the mechanism assumed in the standard for registering schemas in advance, which is necessary to address the security implications of a function that could refer to schemas from arbitrary locations.

XML Query regular expressions

The active standard defines LIKE_REGEX, OCCURRENCES_REGEX, POSITION_REGEX, SUBSTRING_REGEX, and TRANSLATE_REGEX, all of which apply the regular expression syntax used in XML Query. PostgreSQL does not implement these.

To keep things interesting for implementors, SQL:2006 alters a few XQuery regular expression semantics.

XQuery regular expressions are already based on XML Schema regular expressions but with additional features the XQuery spec layers on, including anchors, reluctant quantifiers, capturing and non-capturing groups, and back references. All of those features will naturally be provided in any off-the-shelf XQuery library.

Not to be outdone, SQL:2006 layered on further changes (it's section 9.16 in 9075-2:2006 committee draft, rule 9). Per the XQuery spec, the ^ and $ anchors recognize only the LF character as a newline, the . in non-dotall mode matches anything other than an LF or CR, \s recognizes only those two characters plus space and tab, and \S is the exact complement of \s.

SQL:2006 imposes a well-intended but challenging-to-implement revision so that ^, $, . and \s all agree with Unicode Technical Standard 18 rule 1.6 as to just what constitutes a line boundary.

What that requires \S to mean was not sorted out until 9075-2:2011, which clarifies that it is no longer an exact complement of \s; rather, it matches "any single character that is not matched by a single character that matches" \s (emphasis mine).

Those will be a chore to implement, if PostgreSQL ever does. Unless an XQuery library can be found that offers some kind of switch to apply those semantics, they would have to be done with a preprocessing pass to rewrite the regular expression, taking the dotall and multiline flags into account. A fully-general rewriting may not be possible, given that XQuery regexps do not have lookahead/lookbehind constructs.

Obstacles to improving conformance

The chief obstacle is PostgreSQL’s reliance on the libxml2 library. The library does not implement XQuery, or even the XPath 2.0 subset of it, but only XPath 1.0. This has deeper implications than simply having to write expressions in one language instead of another. The PostgreSQL documentation downplays the difference (in the description of XMLEXISTS), saying “PostgreSQL currently only supports XPath, which is a subset of XQuery.” But even that would be accurate only if the XPath supported were 2.0, which it is not. There seems to be little sign that libxml2 will ever implement it.

Hardships for developers

Related to version of XPath

Because XQuery and its subset, XPath 2.0, began with the radical redesign after XPath 1.0, this leaves PostgreSQL with a number of traps for unwary developers who know XQuery and XPath 2. To give just a few highlights, XPath 1.0:

  • Does not have the sequence type, which is the fundamental XQuery/XPath2+ expression type. Where a sequence can contain XML nodes, atomic values (all of those available in XML Schema), or mixtures of both, in an order controlled by the query, an XPath 1.0 expression can only evaluate to a nodeset (zero or more XML nodes), one boolean, one double, or one string.
  • Does not, therefore, allow a query to control the order of items in its result: XPath 1.0 defines a nodeset as "an unordered collection of nodes without duplicates". (In practice, the libxml implementation used in PostgreSQL appears to reliably return nodesets to PostgreSQL that iterate out in document order—that is, with the nodes in the same relative order they had in the input document—and to return them in that order regardless of anything the XPath expression tries to do. But that's implementation defined behavior, and doesn't seem to be promised anywhere in libxml's docs.)
  • Has no conditional operator. A query as simple as if ( hat ) then hat/@size else "no hat" has no XPath 1.0 equivalent.
  • Has no ordering comparison operator for strings (both "cat" < "dog", and "cat" > "dog" are false, because each is a numeric comparison of two NaNs, though = and != compare the strings).
  • Blurs the distinction between value comparisons and general comparisons (sale/@hatsize = 7 and sale/@customer = "alice" both are existentially quantified comparisons, true if there is any sale with the given value for the attribute, but sale/@taxable = false() is a value comparison to the effective boolean value of a whole nodeset, and true only if no sale has a @taxable attribute at all).
  • Has a data model whose root node can contain only one (“the”) element node.

The last restriction is an incompatibility between XPath 1.0 and all editions of the SQL/XML standard. Even SQL/XML:2003, which did not yet divide the XML type into a hierarchy, still allowed its single XML type to have values of DOCUMENT (with a single top level element) or CONTENT (without that restriction) type.

Because no edition of SQL/XML ever specified functions based on XPath 1.0, the standard itself steered clear of that incompatibility. The functions that involve queries are only in the active standard, and have always been based on XQuery/XPath 2.0, where the “more permissive” data model supports both DOCUMENT and CONTENT. The 2003 edition simply never included those functions.

The query-related functions appearing in PostgreSQL are extensions to the 2003 standard, patterned after the functions added in the active standard, but replacing XQuery with XPath 1.0, which is why they cannot support the standard’s data model.

That mismatch led to an awkward hack with the aim of allowing the xpath function to handle DOCUMENT or CONTENT, followed later by a 48-message, eight-party -hackers thread seeking alternatives to the awkward hack, which was ultimately reverted.

The search for alternatives went so far as using a libxml2 extension mechanism to add a custom function to the XPath function library to present a CONTENT-typed context item as a node set, and took three weeks and 38 messages for the essence of the problem to be properly attributed to the use of the wrong, pre-redesign version of XPath. (This is recounted here only to back up the claim that the "traps" under discussion are subtle enough to sap serious time and attention from a sharp group of developers!)

This section highlights only a few incompatibilities. The W3C maintains more complete lists of the compatibility breaks between the XPath 1.0 and 2.0 languages and between their corresponding function libraries.

Related to lack of XMLCAST

The inability to retrieve a properly unescaped string value from an xpath query result is a subject of repeated reports from users.

Considering the first report, the standard way to write the example would be:

SELECT XMLCAST(XMLQUERY('/z/text()' PASSING ...) AS text);

which would have the intended effect. Any effort to implement XMLCAST over libxml2 and XPath 1.0 would be necessarily approximate, as its semantics are specified in terms of XQuery expressions and the types and data model of XPath 2.0.

Discussion

This section has given examples of the surprises, stumbling blocks, and time sinks presented to users and developers in consequence of XML support depending on a library that implements a language, data model, and type system that are not the ones specified in the active SQL/XML standard.

The existing implementation continues to receive occasional efforts toward improvement and better conformance to the standard, but such efforts are bound to amount to a layering of approximate hacks so long as the fundamental mismatch with the standard’s data model and language remains unaddressed.

Possible ways forward

Recognizing that libxml2 is likely to remain stuck at XPath 1.0, two related threads in 2010 identified XQilla and Zorba as open-source (both Apache-licensed), C++ implementations of the active XPath/XQuery standards. Both implement a common C API, XQC.

Since 2016, another possibility is Saxon/C, derived from the Java Saxon/J, arguably the most mature, reference-quality product in the space (its main developer having served as editor on several of the relevant standards). Saxon’s license is MPL 2.0 for the free versions.

Saxon/C did not, when last checked, offer a binding to the XQC API, but the developer was open to adding one.

Saxon/C is not, however, a wholesale C port of Saxon. It is the Java version, ahead-of-time compiled and bundled with a specialized Java runtime, Excelsior Jet. Jet is a commercial product for which Saxonica holds a license to redistribute the runtime royalty-free, but that is a far cry from open source.

A successful integration into PostgreSQL of Saxon/C, with its specialized Java runtime in tow, might differ little from simply using PL/Java to run the original Saxon/J. For this, there is a proof-of-concept example included with PL/Java 1.5.1, still work in progress but with useful subsets of XMLQUERY and XMLTABLE behavior implemented. (In fact, because Saxon implements XQuery 3.1, these examples make the latest language features available, such as first-class functions, tumbling and sliding windows, among others from 3.0, and maps, arrays, and JSON support from 3.1.)

An argument to allow multiple, pluggable implementations

Unlike XPath 1.0, a language that looked simple because it was, XPath 2.0 and XQuery look simple thanks to being deceptively sophisticated languages with strong typing, type inference, and elaborate optimizers under the hood. Their successful implementation is a more demanding project than for XPath 1.0, which may explain why they have not appeared in libxml2. It also weighs against the idea of an independent implementation within the PostgreSQL project (as once suggested), provided there are usable, quality implementations to be found.

The languages’ sophistication leaves room for different optimizer designs to strongly affect performance, creating the possibility that different implementations may have drastically better or worse performance on different workloads. An ability to swap one implementation for another could allow a site to try from a choice of several implementations for one that performs best on the expected workload. A site would also have flexibility to decide which implementations’ license terms are acceptable.

Proposal 1: XQC API as integration point

Because XQilla and Zorba both implement the interface XQC, PostgreSQL could perhaps work toward an implementation of the missing SQL/XML functionality in terms of XQC, and provide a configuration option allowing a chosen XQC implementation to be used. If Saxon/C adds XQC API support, it could become a third implementation choice, at least at a site comfortable with the non-open-source Jet runtime.

It may also be possible to write a thin wrapper implementing XQC with Saxon/J running in PL/Java, providing another pluggable choice. (This is not the structure of the current PoC example in PL/Java, which is, for now, ordinary Java functions called directly.)

A possible way to get there

One path to such an XQC-based ultimate implementation would be to begin by completing the current Saxon/J-based example in PL/Java. The example currently uses the Saxon-specific API, s9api. That API is sufficiently low-level to allow easy implementation of some of SQL/XML’s low-level requirements, such as the mapping from SQL types of parameters to the specific XML Schema types in the XQuery static context.

Once completed, that s9api-based example could be used as the model for an implementation still using Saxon/J, but via the non-Saxon-specific Java XQJ API (which Saxon/J supports). This step is unlikely to be a purely mechanical rewrite, as the higher-level XQJ API may call for different approaches to implement some of the low-level SQL/XML requirements that could be done in s9api directly. The result of this stage would be the conversion of the working PL/Java/Saxon example into a still-working PL/Java/Saxon example using the more generic API.

Next, that code could be used as a model, exploiting the design similarities between the Java XQJ and C XQC APIs, for the ultimately-desired C implementation in PostgreSQL, calling into the XQC API, while also developing a wrapper XQC implementation that calls the Java XQJ one. The result again would be a working example transformed to a still-working one. At this stage’s completion, it should also be possible to experiment with XQilla and Zorba as alternate XQC providers. XQilla and Zorba both appear to support, at least partially, XQuery 3.0.

Proposal 2: desugaring to calls on normal extension functions

Proposal 1 entails developing something like a specialized version of the extension architecture, allowing the installation of pluggable modules implementing the XQC API.

This Proposal 2 relies instead on the existing extension architecture. It may offer an even shorter path to implementation, and lower barriers to entry for new pluggable implementations.

In the current proof-of-concept work toward a fully-conformant implementation based on Saxon in PL/Java, one necessary ingredient is the development of syntax desugarings from the SQL-ish standard constructs, which require parser support, into normal PostgreSQL function calls that can be implemented by extension functions. So a useful product of this work is the demonstration that such desugarings exist.

Aside: the desugared form of XMLTABLE currently used in the PL/Java PoC lacks one expected property of a standard XMLTABLE, namely that default value expressions should be evaluated only when needed, making it safe to use volatile function calls, etc. This desugared form is chosen for ease of writing by hand, as is necessary without parser support and automated rewriting. In the SQL standard, XMLTABLE is, in fact, defined by means of a (more longwinded) desugared rewriting, and that desugaring would, of course, have the expected property.

A future proposal for the PostgreSQL core could be to add support to the parser for the fully-conformant SQL:2006+ XML functions, and "support" them by means of internal rewriting to documented, desugared forms that call ordinary functions. Implementations based on Zorba, XQilla, PL/Java+Saxon, or others, would take the form of extensions (using the existing extension mechanism available since PG 9.1) supplying the functions used in the desugarings.

Ideally, it would be possible to have more than one such extension installed at a time, with a GUC to select which set of functions would be targeted by the internal rewriting.

The GUC could, for example, name the schema into which the chosen extension's functions were installed. Alternatively, a GUC might name the extension directly, and exploit a recently-proposed extension OID mapping mechanism to find the target functions.

Where Proposal 1 offered a path to implementation several steps long (complete the current PL/Java+Saxon PoC based on Saxon's native s9api, then reimplement over Saxon's implementation of the XQJ API, then develop a C integration point based on XQC and port the implementation a final time), the path needed for Proposal 2 looks considerably shorter:

  1. Work would be needed in core to add the conformant constructs to the grammar and make the desugared rewritings happen;
  2. The PL/Java PoC, as soon as completed, could directly serve as one implementing extension, with no need to be reimplemented in an XQC-based version;
  3. Other implementations based on Zorba, XQilla, ?, could likewise be built in the most straightforward way, and simply packaged as extensions defining the needed functions.