The data that is stored in the Data Warehouse is initially extracted from the source systems
generating the data. The source systems generating this data may store this data in different
formats like CSV or Excel files, XML or simple Flat files. Moreover, this data can be stored with
other distinct sets of data making the size of the data source typically huge for querying and
analysis purposes. This makes it important to extract the data of interest from this large amount of
5
aggregated data source. For instance, the data source provided by U.S department of education
contains 23 different excel and word files for the campus crimes reported every year. CSV and
Excel is not widely encouraged as a standard for data exchange over the web because it is not an
efficient way for representing semi structured data and requires pre-processing or postprocessing.
Therefore, Extensible Markup Language (XML) is used for encoding documents that
can be read by humans and can be interpreted by machine or software systems [4]