1. Introduction
In September 2003, XimpleWare described the
concept of a human-readable binary XML format in its
position paper submitted to W3C's binary XML workshop. The name of
the format is VTD+XML. As VTD-XML's internal representation of XML
infoset is inherently persistent, VTD+XML is the binary
format that packages VTD records, LCs entries and XML into a single file.
We also provide the option to store VTD as a separate file from the XML
document. This document is the second draft of VTD+XML. We
welcome any suggestions
and inputs.
2. Changes from Previous Draft Versions
This version moves the XML payload to the top.
The motivation: when people use text editors to read the VTD+XML file, they
don't have to scroll way down to read XML text.
This version also includes a
spec for separate XML document and VTD index.
Also character encoding support in
enlarged to have a total 256 types (a single byte).
3. How does it work?
In the latest versions of VTD-XML, there will be
methods (writeIndex, readIndex) in VTDGen and VTDNav that allows developers to save and load VTD+XML
(both separate or integrated).
4. Benefits
XML indexing has been a heavily
researched topic and but good solutions are still far and few between.
Most existing XML indexes have a number of technical issues:
VTD+XML breaks new grounds and possesses a number of significant
benefits:
-
Simple and general
prupose:
VTD+XML is an easy-to-understand and natural way to persist pre-parsed form
of XML. 30 seconds is what it takes for most to understand it!
-
No loss to view-source principle:
XML is kept intact.
-
Ultra high performance index
generation: See our
benchmark report..
-
Incremental update --
Make efficient changes to XML content
-
Content extraction --
Extract XML content
-
Small footprint --30%~50%
of the additional space
-
Hardware acceleration --
Even higher performance possible with ASIC implementation.
-
Binary XML -- An
upgrade path that retains the benefits of XML
5. The Spec For Integrated VTD+XML
First 4 bytes
-
First byte: version number
(first one is version 1 corresponding to VTD-XML 2.0 release)
-
Second byte: Encoding
type
-
Third byte:
-
b0: 1 -> last LC uses ints; 0-> last LC uses longs
-
b1: 1 -> ns true; 0 -> ns false
-
b2: 1 -> big endian; 0 -> little endian (consistent for vtd buffer and LC buffer)
-
b3 1 ->
standard/extend VTD; 0 -> standard; 1-> extended
-
b4~b7 -> 0
-
Fourth byte: Maximum
document depth
Second 4 bytes
3rd to 6th 4-byte words
Reserved
XML
VTD Records
LC level 1
-
8 bytes: # of LC entries (big endian)
-
Append LC entries (if uses ints, and entry count odd,
append a 32 bit zero)
LC level 2
-
8 bytes: # of LC entries (big endian)
-
Append LC entries (if uses ints, and entry count odd,
append a 32 bit zero)
LC level 3
-
8 bytes: # of LC entries (big endian)
-
Append LC entries (if uses ints, and entry count odd,
append a 32 bit zero)
... (more LCs)
6. The Spec For Separate VTD+XML
First 4 bytes
-
First byte: version number
(first one is version 2)
-
Second byte: Encoding
type
-
Third byte:
-
b0: 1 -> last LC uses ints; 0-> last LC uses longs
-
b1: 1 -> ns true; 0 -> ns false
-
b2: 1 -> big endian; 0 -> little endian (consistent for vtd buffer and LC buffer)
-
b3 1 ->
standard/extend VTD; 0 -> standard; 1-> extended
-
b4~b7 -> 0
-
Fourth byte: Maximum
document depth
Second 4 bytes
3rd to 6th 4-byte words
Reserved
XML
-
8 bytes: # of bytes
-
16 bytes
reserved (undefined)
-
Notice that
compare with integrated VTD+XML, XML document bytes are missing, users are
responsible to take care that part manually (an inherent shortcoming of
separate VTD index from XML documents)
VTD Records
LC level 1
-
8 bytes: # of LC entries (big endian)
-
Append LC entries (if uses ints, and entry count odd,
append a 32 bit zero)
LC level 2
-
8 bytes: # of LC entries (big endian)
-
Append LC entries (if uses ints, and entry count odd,
append a 32 bit zero)
LC level 3
-
8 bytes: # of LC entries (big endian)
-
Append LC entries (if uses ints, and entry count odd,
append a 32 bit zero)
... (more LCs)