XML Support

From PostgreSQL wiki

Jump to: navigation, search
This project has been mostly completed. See the XML Todo for current issues.

You should start with the PostgreSQL XML Data Type and PostgreSQL XML Functions documentation in the manual when getting started with PostgreSQL's XML support. This page discusses the planning of that work, which is now largely complete, rendering the following obsolete.

Contents

Obsolete content - XML support planning

XML support in an SQL database system can mean many different things, but most often it means one of these three:

  • XML data type and support functions
  • XML export format
  • Mapping XML documents to SQL databases

Most of the interfaces for the first two items are codified by ISO/IEC 9075-14 (“XML-Related Specifications” - SQL/XML). As of 2006-06-12, we have ISO/IEC 9075-14:2006 available. (This is probably not really "SQL:2006", as it is the only part of SQL to be released at that time.) These interfaces are already implemented by IBM DB/2, Microsoft SQL Server, OpenLink Virtuoso, Oracle, and perhaps others, so there is some justification to follow that path. More information on this standard, albeit outdated, can be found on http://www.sqlx.org/.

Note that the 2006 standard is a fairly large enhancement of the 2003 version (the first release of SQL/XML), so to preserve our sanity it might be wise to aim for the 2003 version first.

XML Data Type and Support Functions

An XML data type is probably the most interesting piece of "XML support". Here are some details of how it should work and a report on the status.

XML Data Type

The XML data type in action would unsurprisingly look like this:

CREATE TABLE test (
    ...,
    data xml,
    ...
);

As with any data type, it features input checking and support functions to make it useful.

Strictly speaking, the xml type does not accept string literals as input. The way to convert text strings to XML data and back are:

XMLPARSE ( DOCUMENT|CONTENT value ) --> xml
XMLSERIALIZE ( DOCUMENT|CONTENT xmlvalue AS varchar ) --> varchar

These would serve as PostgreSQL input/output functions, with some syntax wrappers around it to support the standard syntax.

An xml datum can contain one of two sorts of values: a document or a fragment. A document contains a header, a root element, and all the other things that make a well-formed XML document. A fragment just looks like XML syntax but isn't (necessarily) a complete document. That's why some syntax pieces make reference to DOCUMENT and CONTENT (same as fragment, inconsistently named).

XML Support Functions

SQL specifies the following function-like expressions for the xml data type:

  • XMLCOMMENT
  • XMLCONCAT
  • XMLELEMENT
  • XMLFOREST
  • XMLPI
  • XMLROOT
  • XMLAGG
  • IS DOCUMENT, IS NOT DOCUMENT

In truth, these are glorified string-concatenation functions, but they do all the right escaping and nesting and add some convenience features for processing SQL data into XML data. Here are some examples:

XMLROOT (
  XMLELEMENT (
    NAME gazonk,
    XMLATTRIBUTES (
      ’val’ AS name,
      1 + 1 AS num
    ),
    XMLELEMENT (
      NAME qux,
      ’foo’
    )
  ),
  VERSION ’1.0’,
  STANDALONE YES
)

This produces the following XML document:

<?xml version=’1.0’ standalone=’yes’ ?>
<gazonk name=’val’ num=’2’>
  <qux>foo</qux>
</gazonk>

This shows how the SQL expressions nest like the XML elements.

Here is an example of an XML "forest":

SELECT xmlforest (
  "FirstName" as "FName", "LastName" as "LName",
  ’string’ as "str", "Title", "Region" )
FROM "Demo"."demo"."Employees";

With some data in the employees table, this might result in:

<FName>Nancy</FName>
<LName>Davolio</LName>
<str>string</str>
<Title>Sales Representative</Title>
<Region>WA</Region>
...
<FName>Anne</FName>
<LName>Dodsworth</LName>
<str>string</str>
<Title>Sales Representative</Title>

Each of these employee records is one SQL datum.

Since the result of XMLFOREST is an xml datum, you can nest this together with XMLELEMENT and the other expressions to further decorate the result.

Here is an example with the aggregate function XMLAGG:

SELECT xmlelement (’Emp’,
   xmlattributes (’Sales Representative’ as "Title"),
   xmlagg (xmlelement (’Name’, "FirstName", ’ ’,
                        "LastName")))
   FROM "Demo"."demo"."Employees"
   WHERE "Title" = ’Sales Representative’;

This might result in:

<Emp Title="Sales Representative">
  <Name>Nancy Davolio</Name>
  <Name>Janet Leverling</Name>
  <Name>Margaret Peacock</Name>
  <Name>Michael Suyama</Name>
  <Name>Robert King</Name>
  <Name>Anne Dodsworth</Name>
</Emp>

Here, the result is one SQL datum.

SQL/XML 2006

SQL/XML 2006 adds a bunch more:

  • There is a more specific XML data type allowing not just document vs. fragment, but document validation:
    • XML
    • XML(DOCUMENT)
    • XML(CONTENT)
    • XML(SEQUENCE)
    • XML(DOCUMENT(ANY))
    • XML(DOCUMENT(UNTYPED))
    • XML(DOCUMENT(XMLSCHEMA ’http://...’))
  • Some more support functions/expressions are added:
    • XMLDOCUMENT
    • XMLTEXT
    • XMLVALIDATE
    • IS CONTENT
    • IS VALID
  • More XML-related options are added all over unsuspecting syntax elements. That will not be pretty.
  • XQuery support is added.

Status

A patch to support much of the SQL/XML 2003 functions by Pavel Stehule has been available for some time. This patch does not include an XML data type; it operates on XML data stored in text fields.

A 2006 Google Summer of Code project by Nikolay Samokhvalov, mentored by Peter Eisentraut, works on complete SQL/XML 2003 support. This work integrates and improves Pavel Stehule's patch. The Summer of Code is over and a fairly complete implementation has been produced. We are looking to put this up for integration into 8.3 as early as feasible.

Implementation Issues

The implementation as proposed uses the well-known libxml, which is licensed under the LGPL, so it does not cause a problem for PostgreSQL. Some implementation issues that have been encountered at one point or another (some might be resolved by now) are:

  • The encoding handling is weird. XML documents carry their own encoding information. If, say, you have a document that identifies itself as UTF-8-encoded and ship that over a PostgreSQL client/server connection that converts the client-side UTF-8 to something else on the server, the XML parser in the server will complain. And there are similar cases in other combinations. The resolution of this is unknown.
  • The memory management in libxml needs to be carefully worked out. libxml provides hooks to replace malloc() and free(), which is nice, but it seems to call the free hook whenever it wants, which is not nice if the PostgreSQL memory manager has already deallocated the memory in some other way. This is a problem especially during error recovery. This can surely be resolved, but it needs to be verified.
  • To what degree should this feature be mandatory or optional? Some build-time option is not a problem, but the special syntax support cannot be ifdef'ed out easily. We need some robust recovery for those cases.

Future Projects

Once the xml type is available, the functionality in contrib/xml/ should be adapted to these new features.

A lot of people submitted Summer of Code applications for the implementation of CTree indexes. There isn't a lot of information available on that, but if you look on Google Scholar, you will find out that it's some new index type for XML data (no kidding). It was probably premature to propose working on that before there actually is an XML type, but once that is done, CTree support might be an interesting project.

SQL/XML 2006 calls for full integration of XQuery. Sleepycat has an XQuery library that we could use. (Last time I looked, it was cumbersome to build, and it's not widely deployed. But the Sleepycat folks were certainly interested in PostgreSQL's interest. That was before they were bought up, though.) It would, however, seem to make more sense to reuse PostgreSQL's existing query planning and execution facilities and consider XQuery a parallel alternative to SQL. This is how some other vendors handle this. It would be a big dare-I-say-paradigm change for PostgreSQL, though.

XML Export Format

Everyone can create their own XML export format for relational data, and many people have proposed their own inventions on the PostgreSQL mailing lists in the past. In order to settle on one format, the SQL standard contains a specification for it.

In particular, there are specifications for:

  • Mapping an SQL schema to an XML Schema
  • Mapping database contents to an XML document (which adheres to the XML Schema document above)

An XML export format might be useful for:

Postprocessing using XSLT 
Convert SQL data to XHTML or DocBook easily and automatically.
pg_dump 
As a universal backup format?
psql 
To replace or augment the existing HTML support?

Escaping

Where the SQL-specified XML export format beats the ad-hoc defined ones is the fully worked-out escaping mechanism, which looks like this:

SQL <identifier> fully-escaped XML name partially-escaped XML name
employee EMPLOYEE EMPLOYEE
"employee" employee employee
"hire date" hire_x0020_date hire_x0020_date
"comp_plan" comp_x005F_plan comp_x005F_plan
"dept:id" dept_x003A_id dept:id
xmlcol _xFFFF_xmlcol xmlcol

The two different escaping forms are used depending on the situation.

Schema Mapping

As an example, here is how the following table definition would look as XML Schema.

CREATE TABLE test (a int PRIMARY KEY, b varchar(200));
<xsd:complexType name="RowType.catalog.schema.test">
  <xsd:sequence>
    <xsd:element name="a" type="INTEGER"></xsd:element>
    <xsd:element name="b" type="VARCHAR_200_200"
        minOccurs="0"></xsd:element>
  </xsd:sequence>
</xsd:complexType>
<xsd:complexType name="TableType.catalog.schema.test">
  <xsd:sequence>
    <xsd:element name="row"
        type="RowType.catalog.schema.test"
        minOccurs="0"
        maxOccurs="unbounded" />
  </xsd:sequence>
</xsd:complexType>

Data Mapping

Here is how a generic database with one table would be exported:

<catalogname>
  <schemaname>
    <tablename>
      <row>
        <colname1>value</colname1>
        <colname2 xsi:nil=’true’/>
        ...
      </row>
      ...
    </tablename>
    ...
  </schemaname>
  ...
</catalogname>

The element names would actually be replaced by the actual names of the objects in question (hence the need for escaping). The only element name that is a literal in the above example is row, because rows don't have names.

There are options for null handling. The above example handles it with a special attribute. The other choice is to omit the element.

Status

Peter Eisentraut has a prototype available for the XML export and import. The export is pretty much just dilligence work that extracts the catalog information and prints it out. The import is actually a table function in this prototype. In general, it would be good to have the XML data type first before pursuing this further, so the interfaces don't have to be changed later on.

One question is where to put it.

pg_dump? 
This would fit the "export" theme, but it might be restricting the flexibility of the feature.
psql? 
Many previous proposals placed the work into psql, but that doesn't seem right.
user-defined functions? 
This sounds right and would integrate nicely with the XML data type support. But we might need a "streaming" variant for exporting large tables.

Mapping XML Documents to SQL Databases

Mapping XML documents to SQL databases would involve resolving the hierarchical structure of the XML document and creating the appropriate interlinked SQL tables with the respective data. SQL does not specify this, but database theorists have surely figured this out.

But actually, this is not a PostgreSQL-specific problem, so it is not really our job to solve it. The Perl module DBIx-XML-DataLoader, for example, implements such functionality.

Links

Personal tools