To illustrate this concept, let us consider the logical path access
operator described above. A straightforward physical implementation
may parse the complete file and return only the portions that
correspond to the path. However, this physical operator can be very
inefficient if the file is large and the path is selective, since it will
perform wasteful work. As an alternative, we plan to develop a
physical path access operator based on the concept of fragmented
parsing. The basic idea is to record metadata describing the schema
of the file contents and the byte extents associated with the different
parts of the schema. The schema metadata can be obtained by
examining the header of the file, doing a limited parsing of the file
contents, or summarizing and compressing the information resulting
from a full parse of the file. The physical operator can match the
path against the schema, and then invoke the parser to selectively
parse the subset of the byte stream that is relevant for the path. Returning
to our previous example, the operator will only parse the
parts of the file corresponding to publication titles and abstracts.
Another relevant idea is the use of indexing. This approach is already
used in file systems that support keyword queries over files,
but in a more limited context. In Damasc, indexing is applied on
the structure of file data and not just on keywords. Moreover, index
accesses may be combined with parsing in order to provide
the final results. For instance, an index may identify the offsets of
publication records whose title contains certain keywords, and then
these records may be parsed in order to retrieve the corresponding
abstracts. This is again an application of the general idea of fragmented
parsing, but with a different implementation.