In the past few years people have convinced themselves that they have discovered an overlooked form of data. This new form of data is semi-structured. Bosh! There is no new form of data. What folks have discovered is really the effect of economics on data typing—but if you char-
acterize the problem as one of economics, it isn’t nearly as exciting. It is, however, much more accurate and valu-
able. Seeing the reality of semi-structured data clearly can actually lead to improving data processing (in the literal meaning of the term). As long as we look at this through the fogged vision of a “new type of data,” however, we will continue to misunderstand the problem and develop misguided solutions to address it. It’s time to change this.
For data to be operated on reliably, either in an appli-
cation or in a tool (such as a database), it must be typed, or for high reliability, it must be strongly typed. This is necessary because at some point the underlying hard-
ware has to choose the right circuit to process the bits. It has long since been demonstrated that any data can be typed—in fact, strongly typed. Still, today most data is not typed (i.e., structured). This is simply because it is not worth spending the resources to apply typing to the data (i.e., the value of the data simply does not justify the investment).
Typing data incurs a number of costs: • The design time cost, or the cost actually to determine and apply the types (schema design) to the data, is higher. • The cost in runtime resources increases. This includes, but is not limited to, CPU costs (i.e., complex joins can be slow). • Application development is more expensive when we use tools that require the precision necessary for handling strongly typed data. This often includes the cost of understanding (or almost decoding) the type description.
• Last, but by no means least, the cost to the user to actu-
ally query the data is higher, as the queries themselves are much more complex (to the point that the majority of users can’t actually formulate structured queries with languages such as XQuery or SQL). A cost/benefi t analysis is the real way to differenti-