Several business and technology drivers are disrupting the
world of enterprise software, and that in turn is driving the
need for more effective data-quality solutions. These drivers
include business acceptance of the software-as-a-service
(SaaS) model, wide adoption of the web as a platform, collapse
of enterprise application silos, aggregation of data from disparate
internal and external sources, the agile mindset, and the
economic conditions driving agility.
SaaS is a software deployment model in which the provider
licenses applications for use as service on demand, most often
accessed through a browser. In 2007 SaaS clearly gained momentum,
with the sector having three billion dollars in revenue; by
2013 revenue could reach 50 percent of all application software
revenues (Ernst and Dunham 2006). Customer benefits from
SaaS deployments include much quicker and easier implementations,
relatively painless upgrades, global access through the
browser, lower total cost of ownership, and software vendors
sharing more of the risk (Friar et al. 2007). Related to SaaS is
another significant shift, the move from proprietary platforms
to the web as a platform. Again the user benefits because of less
vendor lock-in, open standards, well documented and broadly
accepted standards, and long-term commitment by the software
industry as a whole. These all contribute to higher quality solutions
across the industry as well as accelerated innovation and
more choice and flexible solutions available to the user.
Articles
Copyright © 2010, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602 SPRING 2010 65
Exceptional Data Quality
Using Intelligent
Matching and Retrieval
Clint Bidlack and
Michael P. Wellman
n Recent advances in enterprise web-based
software have created a need for sophisticated
yet user-friendly data-quality solutions. A new
category of data-quality solutions that fill this
need using intelligent matching and retrieval
algorithms is discussed. Solutions are focused
on customer and sales data and include realtime
inexact search, batch processing, and data
migration. Users are empowered to maintain
higher quality data resulting in more efficient
sales and marketing operations. Sales managers
spend more time with customers and less time
managing data.
Over the past couple of decades enterprise software
solutions tended to result in data in silos, for
example, a deployed accounting system that is
unable to share data with a customer-relationship
management system. The potential business value,
or benefit, from removing data silos has driven
companies to pursue such efforts to completion.
Collapse of the data silo is related to the larger phenomenon
of overall data aggregation on the web.
A growing pool of online structured data, better
tools, and, again, a large economic driver are all
pushing organizations to aggregate and use data
from a multitude of sources.
Finally, one of the most significant shifts in the
software industry is the explicit transition toward
agile software development. Agile development
includes iterative software development methodologies
in which both requirements and solutions
evolve during the development of the software.
Agile methodologies are in contrast to waterfall
methodologies, which imply that requirements are
well known before development starts (Larman
and Basili 2003). More effective agile processes are
important not only for software development, but
for organizations of all types and sizes. The necessity
to keep organizations aligned with opportunities
and external competitive forces is forcing this
reality. Being agile allows an organization to adapt
more rapidly to external forces, which in turn
increases the chances of survival.
For companies, the above trends are resulting in
more effective use of enterprise software as well as
more efficient business operations. Effectiveness is
driving adoption across the business landscape,
across industries, and from very small companies
up to the Global 2000. Efficiency is driving application
acceptance and usage within the company.
This combination of effectiveness and efficiency,
driving adoption and usage, is fueling enormous
growth of structured business data.
Large volumes of structured business data
require significant effort to maintain the quality of
the data. For instance, with customer-relationship
management (CRM) systems (deployed under SaaS
or traditional software installations), ActivePrime
and its partners have found that data quality has
become the number one issue that limits return on
investment. As the volume of data grows, the pain
experienced from poor quality data grows more
acute. Data quality has been an ongoing issue in
the IT industry for the past 30 years, and it continues
and is expanding as an issue, fueling the
growth of the data-quality industry to one billion
dollars in 2008. It is also estimated that companies
are losing 6 percent of sales because of poor management
of customer data (Experian QAS 2006).
Several competing definitions of data quality
exist. The pragmatic definition is considered here;
specifically, if data effectively and efficiently sup-
Articles
66 AI MAGAZINE
ports an organization’s analysis, planning, and
operations, then that data is considered of high
quality. In addition, data cleansing is defined as
manual or automated processes that are expected
to increase the quality of data.
Existing Solutions
Past solutions to data-quality problems were driven
in part by the economics of the institutions
having the problem. Traditionally, the demand for
data-quality solutions was driven by very large
organizations, such as Global 2000 corporations.
They had the resources to deploy complex software
system for gathering data, and they were the
first to notice and suffer from the inevitable dataquality
problems resulting from this complexity.
Accordingly, the approaches developed by the
information technology researchers pioneering
the area of data quality (Lee et al. 2006) tended to
emphasize statistical data assessment, business
process engineering, and comprehensive organizational
data-assurance policies. Given their size, the
early data-quality customers had the resources to
adopt these labor-intensive and therefore expensive
solutions. The traditional data-quality solutions
tended to rely heavily on manual operations,
in two different respects.
First, data was often hand-cleansed by contracting
with external staffing agencies. Business analysts
would first identify what type of data-quality
work needed to be performed and on which data.
Large data sets would be broken up into reasonable
sizes and put into spreadsheets. This data would be
distributed to individuals along with instructions
for cleansing. After the manual work on a spreadsheet
was finished oftentimes the work could be
cross-checked by another worker and any discrepancies
investigated. Once the data was finished it
was reassembled into the appropriate format for
loading back into the source IT system. Such manual
effort has clear drawbacks, including the time
required to cycle through the entire process, the
possibility for manual error, and the need to export
and then import the final results. Exporting is usually
fairly easy. The importing is almost always the
bigger issue. Import logic typically needs to identify
not only which specific fields and records to
update, but also how to deal with deleted or
merged data, and how to accomplish this all without
introducing new errors. Another issue is that
additional work is required to build and thoroughly
test the import tools. Finally, the entire manual
process has little room for increased return on
investment. The customer has to pay for the manual
work each time data is cleansed, meaning that
the economic benefits of automation are not realized.
Second, earlier data-quality vendors provided
technological solutions to data-quality problems,
and these required significant manual setup. The
reasons for the manual setup included business
analysis to understand data-quality needs of the
organization, identification of the final data-quality
work flow, and then the actual programming
and configuration to put the data-quality solution
in place. In other words, these companies were
building custom data-quality solutions using dataquality
vendor application programming interfaces
(APIs). Once the solution was put in place,
automation reduced the manual effort, so a longer
horizon for return on investment was acceptable.
These solutions worked fine for large companies
that could afford both the problem, initial enterprise
system that aggregates the data, and the solutions.
Today, sophisticated business applications
are being used by even the smallest organizations;
hence, the manual effort associated with data quality
must be in alignment with the resources of
these smaller organizations. The data-quality solutions
must leverage automation and, at the same
time, provide the business user with intuitive
access to the processing results and the ability to
override the results.
Data-quality research has also seen significant
progress with the first issue of the new ACM Journal
of Data and Information Quality published in
2009. Frameworks for researching data quality
have been introduced (Madnick 2009, Wang 1995)
as well as specific mathematical models for
addressing the record linkage problem (Fellegi and
Sunter 1969). Recent research in record linkage
includes the development and deployment of
more intelligent linkage algorithms (Moustakides
and Verykios 2009, Winkler 2006). Data linkage is
a core issue in many data-cleansing operations and
is the process of identifying whether two separate
records refer to the same entity. Linkage can be
used for both identifying duplicate records in a
database as well as identifying similar records
across disparate data sets.
Solutions
ActivePrime’s initial products and services focus on
increasing the quality