0. Abstract
As the first step of most
XML processing algorithms, one usually extracts token content out of the
source document into many discrete string objects. We propose a
"non-extractive" tokenization approach that maintains the source document
intact in memory. Using a binary encoding specification called Virtual
Token Descriptor (VTD), the processing model represents tokens exclusively
using starting offset and length. To create a hierarchical view of the
data encapsulated in XML, the parser further indexes elements of same
depths using directory-like structures we call location cache. Through a
demonstration of navigating the document hierarchy using VTD and location
caches, we show that it is indeed possible to create a cursor-based API
that retains most of DOM's random-access capabilities at a fraction
of its memory usage. Furthermore, by analyzing key design constraints of
custom hardware, we reason that the memory conserving characteristics of
the processing model simultaneously make possible "XML on a chip" and
"binary-enhanced XML." The benchmark results show that the reference
implementation of our processing model significantly outperforms Xerces
DOM in terms of both memory and processing performance.