IndexingXMLData

Idea

Current index approaches in PostgreSQL doesn't conforms XML specification. This GSoC project creates new index (based on GiST) with code name FastX which will follow different way of handling data in XML. The index stucture might be used for Xquery or Xpath 2.0 support without LibXML, because LibXML doesn't have support for that query technics. But this is sound of future, todays implementation use LibXML for XPath, so could be good start with select only subpart of XML document which will be then parsed and query by LibXML. Benefit of this is less CPU and Memory requirements during executing XPath query.

Details

As my research shows that lots of XML indexing technique has been invented and comparsion of them is itself quite difficult project. Some comparsion is written in [1]. In 2007 Nikolay Samokhvalov (PostgreSQL hacker) suggested XLABEL index technique based on ORDPATH method (used in SQL Server) which use schredding XML document to relation model [2].

My approach is different. I don't use schredding technique but hybrid method based on Patricia Trie enrich with XML attributes. Basic Idea is same as IndexFabric defined in [3] and [4].

Description of algorithm

Patricia Trie

Trie is a tree for storing strings in which there is one node for every common prefix. The strings are stored in extra leaf nodes. A Patricia trie is a simple form of compressed trie which merges single child nodes with their parents. More efficient for long keys (non-common postfix in one node).

Base points of IndexFabric

<?xml version="1.0" ?>
<book>
<title>survey</title>
<chapter>
<section>index</section>
<section>
<paragraph>by</paragraph>
</section>
</chapter>
<chapter>
<section>indeed</section>
<section>data</section>
</chapter>
<appendix />
</book>

Description:

Extension of dataguide for text search
Keeps all label paths starting from the root
Encode each label path with data value as a string
Use efficient index for strings to store it (Patricia trie)
Perform queries on keywords for elements as string search
Does not keep information on non-terminal nodes
Use designator table with codes of elements

Evaluation on Fabric Index

Identifiers
- One per document
Descendant/Ancestor Search
- As string search; do not keep order of elements
Keyword Search
- By Patricia trie leaves if expanded; value index otherwise
Update
- Insertion is incremental
- Deletion is complex
Index size (index stored with document)
- Entry number : Linear for tree
- Entry size : number of elements for a path

FastX changes Use 2 separate tables, which one is named xnames and store all element names with offset from begin of XML document. Second table is xattribs with full name of all attributes for all elements.

Contact

tomaspospisil personal blog

Schedule

until 25th May
- made research of existing index techniques, used for Master Thesis

26th June - 16th June
- Create first model with LibXML2 traversal methods (XMLReader, SAX)

17th June - 22th June
- Master Thesis defend

23th June - 7th July
- Index implemented for elements and ready for midterm evaluation.
- Benchmark on representative sort of datasets.

8th July - 16th August
- Detail benchmarks on most wide range of datasets.
- Final refactoring, testing and commit.

IndexingXMLData

Contents

Idea

Details

Description of algorithm

Patricia Trie

Base points of IndexFabric

Contact

Schedule

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools