VTD-XML: The Future of XML Processing

SourceForge.net Logo

Sourceforge Home

Mailing Lists

XimpleWare

Download


VTD-XML Home

   VTD+XML Spec Version 1 and 2

1. Introduction

In September 2003, XimpleWare described the concept of a human-readable binary XML format in its position paper submitted to W3C's binary XML workshop.  The name of the format is VTD+XML.  As VTD-XML's internal representation of XML infoset is inherently persistent, VTD+XML is the binary format that packages VTD records, LCs entries and XML into a single file. We also provide the option to store VTD as a separate file from the XML document. This document is the second draft of VTD+XML. We welcome any suggestions and  inputs.

2. Changes from Previous Draft Versions

This version moves the XML payload to the top. The motivation: when people use text editors to read the VTD+XML file, they don't have to scroll way down to read XML text.

This version also includes a spec for separate XML document and VTD index.

Also character encoding support in enlarged to have a total 256 types (a single byte).

3. How does it work?

In the latest versions of VTD-XML, there will be methods (writeIndex, readIndex) in VTDGen and VTDNav that allows developers to save and load VTD+XML (both separate or integrated).

4. Benefits

XML indexing has been a heavily researched topic and but good solutions are still far and few between.  Most existing XML indexes have a number of technical issues:

  • Not Humanly Readable

  • Not general purpose: Developers have to choose what to index. For queries that are not indexed, XML parsing is still needed.

  • Difficult to understand:

  • Slow update

VTD+XML breaks new grounds and possesses a number of significant benefits:

  • Simple and general prupose: VTD+XML is an easy-to-understand and natural way to persist pre-parsed form of XML. 30 seconds is what it takes for most to understand it!

  • No loss to view-source principle:  XML is kept intact.

  • Ultra high performance index generation: See our benchmark report..

  • Incremental update -- Make efficient changes to XML content

  • Content extraction -- Extract XML content

  • Small footprint --30%~50% of the additional space

  • Hardware acceleration -- Even higher performance  possible with ASIC implementation.

  • Binary XML -- An upgrade path that retains the benefits of XML

5. The Spec For Integrated VTD+XML

First 4 bytes

  • First byte: version number (first one is version 1 corresponding to VTD-XML 2.0 release)

  • Second byte: Encoding type

  • Third byte:

    • b0: 1 -> last LC uses ints; 0-> last LC uses longs

    • b1: 1 -> ns true;    0 -> ns false

    • b2: 1 -> big endian; 0 -> little endian (consistent for vtd buffer and LC buffer)

    • b3  1 -> standard/extend VTD; 0 -> standard; 1-> extended

    • b4~b7 -> 0

  • Fourth byte: Maximum document depth

Second 4 bytes

  • First and Second bytes: # of LCs  (in big endian)

  • Third and Fourth bytes: Root index (in big endian)

3rd to 6th 4-byte words

Reserved

XML

  • 8 bytes: # of bytes

  • Append XML also padded with zero so overall XML payload length is integer multiple of 8 bytes

VTD Records

  • 8 bytes: # of VTD records (big endian)

  • Append VTD records

LC level 1

  • 8 bytes: # of LC entries (big endian)

  • Append LC entries (if uses ints, and entry count odd, append a 32 bit zero)

LC level 2

  • 8 bytes: # of LC entries (big endian)

  • Append LC entries (if uses ints, and entry count odd, append a 32 bit zero)

LC level 3

  • 8 bytes: # of LC entries (big endian)

  • Append LC entries (if uses ints, and entry count odd, append a 32 bit zero)

... (more LCs)

 

6. The Spec For Separate VTD+XML

First 4 bytes

  • First byte: version number (first one is version 2)

  • Second byte: Encoding type

  • Third byte:

    • b0: 1 -> last LC uses ints; 0-> last LC uses longs

    • b1: 1 -> ns true;    0 -> ns false

    • b2: 1 -> big endian; 0 -> little endian (consistent for vtd buffer and LC buffer)

    • b3  1 -> standard/extend VTD; 0 -> standard; 1-> extended

    • b4~b7 -> 0

  • Fourth byte: Maximum document depth

Second 4 bytes

  • First and Second bytes: # of LCs  (in big endian)

  • Third and Fourth bytes: Root index (in big endian)

3rd to 6th 4-byte words

Reserved

XML

  • 8 bytes: # of bytes

  • 16 bytes reserved (undefined)

  • Notice that compare with integrated VTD+XML, XML document bytes are missing, users are responsible to take care that part manually (an inherent shortcoming of separate VTD index from XML documents)

VTD Records

  • 8 bytes: # of VTD records (big endian)

  • Append VTD records

LC level 1

  • 8 bytes: # of LC entries (big endian)

  • Append LC entries (if uses ints, and entry count odd, append a 32 bit zero)

LC level 2

  • 8 bytes: # of LC entries (big endian)

  • Append LC entries (if uses ints, and entry count odd, append a 32 bit zero)

LC level 3

  • 8 bytes: # of LC entries (big endian)

  • Append LC entries (if uses ints, and entry count odd, append a 32 bit zero)

... (more LCs)

 

VTD in 30 seconds

VTD+XML Format (updated)

User's Guide

Developer's Guide

VTD: A Technical Perspective

Code Samples

FAQ

Getting Involved

Articles and Presentations

Benchmark

API Doc

Demo