Existing Solutions
Past solutions to data-quality problems were driven
in part by the economics of the institutions
having the problem. Traditionally, the demand for
data-quality solutions was driven by very large
organizations, such as Global 2000 corporations.
They had the resources to deploy complex software
system for gathering data, and they were the
first to notice and suffer from the inevitable dataquality
problems resulting from this complexity.
Accordingly, the approaches developed by the
information technology researchers pioneering
the area of data quality (Lee et al. 2006) tended to
emphasize statistical data assessment, business
process engineering, and comprehensive organizational
data-assurance policies. Given their size, the
early data-quality customers had the resources to
adopt these labor-intensive and therefore expensive
solutions. The traditional data-quality solutions
tended to rely heavily on manual operations,
in two different respects.
First, data was often hand-cleansed by contracting
with external staffing agencies. Business analysts
would first identify what type of data-quality
work needed to be performed and on which data.
Large data sets would be broken up into reasonable
sizes and put into spreadsheets. This data would be
distributed to individuals along with instructions
for cleansing. After the manual work on a spreadsheet
was finished oftentimes the work could be
cross-checked by another worker and any discrepancies
investigated. Once the data was finished it
was reassembled into the appropriate format for
loading back into the source IT system. Such manual
effort has clear drawbacks, including the time
required to cycle through the entire process, the
possibility for manual error, and the need to export
and then import the final results. Exporting is usually
fairly easy. The importing is almost always the
bigger issue. Import logic typically needs to identify
not only which specific fields and records to
update, but also how to deal with deleted or
merged data, and how to accomplish this all without
introducing new errors. Another issue is that
additional work is required to build and thoroughly
test the import tools. Finally, the entire manual
process has little room for increased return on
investment. The customer has to pay for the manual
work each time data is cleansed, meaning that
the economic benefits of automation are not realized.
Second, earlier data-quality vendors provided
technological solutions to data-quality problems,
and these required significant manual setup. The
reasons for the manual setup included business
analysis to understand data-quality needs of the
organization, identification of the final data-quality
work flow, and then the actual programming
and configuration to put the data-quality solution
in place. In other words, these companies were
building custom data-quality solutions using dataquality
vendor application programming interfaces
(APIs). Once the solution was put in place,
automation reduced the manual effort, so a longer
horizon for return on investment was acceptable.
These solutions worked fine for large companies
that could afford both the problem, initial enterprise
system that aggregates the data, and the solutions.
Today, sophisticated business applications
are being used by even the smallest organizations;
hence, the manual effort associated with data quality
must be in alignment with the resources of
these smaller organizations. The data-quality solutions
must leverage automation and, at the same
time, provide the business user with intuitive
access to the processing results and the ability to
override the results.
Data-quality research has also seen significant
progress with the first issue of the new ACM Journal
of Data and Information Quality published in
2009. Frameworks for researching data quality
have been introduced (Madnick 2009, Wang 1995)
as well as specific mathematical models for
addressing the record linkage problem (Fellegi and
Sunter 1969). Recent research in record linkage
includes the development and deployment of
more intelligent linkage algorithms (Moustakides
and Verykios 2009, Winkler 2006). Data linkage is
a core issue in many data-cleansing operations and
is the process of identifying whether two separate
records refer to the same entity. Linkage can be
used for both identifying duplicate records in a
database as well as identifying similar records
across disparate data sets.