IndexingXMLData

From PostgreSQL wiki
Jump to navigationJump to search

Idea

Current index approaches in PostgreSQL doesn't conforms XML specification. This GSoC project creates new index (based on GiST) with code name FastX which will follow different way of handling data in XML. The index stucture might be used for Xquery or Xpath 2.0 support without LibXML, because LibXML doesn't have support for that query technics. But this is sound of future, todays implementation use LibXML for XPath, so could be good start with select only subpart of XML document which will be then parsed and query by LibXML. Benefit of this is less CPU and Memory requirements during executing XPath query.

Details

As my research shows that lots of XML indexing technique has been invented and comparsion of them is itself quite difficult project. Some comparsion is written in [1]. In 2007 Nikolay Samokhvalov (PostgreSQL hacker) suggested XLABEL index technique based on ORDPATH method (used in SQL Server) which use schredding XML document to relation model [2].

My approach is different. I don't use schredding technique but hybrid method based on Patricia Trie enrich with XML attributes. Basic Idea is same as IndexFabric defined in [3] and [4].

Description of algorithm

Patricia Trie

Patricia trie.jpg

Trie is a tree for storing strings in which there is one node for every common prefix. The strings are stored in extra leaf nodes. A Patricia trie is a simple form of compressed trie which merges single child nodes with their parents. More efficient for long keys (non-common postfix in one node).

Base points of IndexFabric

<?xml version="1.0" ?>
<book>
<title>survey</title>
<chapter>
<section>index</section>
<section>
<paragraph>by</paragraph>
</section>
</chapter>
<chapter>
<section>indeed</section>
<section>data</section>
</chapter>
<appendix />
</book>


IndexFabric.png


Description:

  • Extension of dataguide for text search
  • Keeps all label paths starting from the root
  • Encode each label path with data value as a string
  • Use efficient index for strings to store it (Patricia trie)
  • Perform queries on keywords for elements as string search
  • Does not keep information on non-terminal nodes
  • Use designator table with codes of elements

Evaluation on Fabric Index

  • Identifiers
    • One per document
  • Descendant/Ancestor Search
    • As string search; do not keep order of elements
  • Keyword Search
    • By Patricia trie leaves if expanded; value index otherwise
  • Update
    • Insertion is incremental
    • Deletion is complex
  • Index size (index stored with document)
    • Entry number : Linear for tree
    • Entry size : number of elements for a path

FastX changes Use 2 separate tables, which one is named xnames and store all element names with offset from begin of XML document. Second table is xattribs with full name of all attributes for all elements.

Contact

tomaspospisil personal blog

Schedule

  • until 25th May
    • made research of existing index techniques, used for Master Thesis
  • 26th June - 16th June
    • Create first model with LibXML2 traversal methods (XMLReader, SAX)
  • 17th June - 22th June
    • Master Thesis defend
  • 23th June - 7th July
    • Index implemented for elements and ready for midterm evaluation.
    • Benchmark on representative sort of datasets.
  • 8th July - 16th August
    • Detail benchmarks on most wide range of datasets.
    • Final refactoring, testing and commit.