When humans consume information, a great deal of heterogeneity is comfortably tolerated. In
fact, the nuance and richness of natural language can provide valuable depth. However, machine
analysis algorithms expect homogeneous data, and cannot understand nuance. In consequence, data
must be carefully structured as a first step in (or prior to) data analysis. Consider, for example, a patient
who has multiple medical procedures at a hospital. We could create one record per medical procedure
or laboratory test, one record for the entire hospital stay, or one record for all lifetime hospital
interactions of this patient. With anything other than the first design, the number of medical
procedures and lab tests per record would be different for each patient. The three design choices listed
have successively less structure and, conversely, successively greater variety. Greater structure is likely
to be required by many (traditional) data analysis systems. However, the less structured design is likely
to be more effective for many purposes – for example questions relating to disease progression over
time will require an expensive join operation with the first two designs, but can be avoided with the
latter. However, computer systems work most efficiently if they can store multiple items that are all
identical in size and structure. Efficient representation, access, and analysis of semi-structured data
require further work.