Data Preparation Tasks
Before processing data from other systems, you sometimes have to first retrieve it or validate the
content to determine your level of confidence in the data’s quality. SSIS provides a set of tasks that
can be used to retrieve data files using the files and folders available in the file system, or it can reach
out using FTP and web service protocols. The following sections explore these tasks in SSIS.
Data Profiler
Data profiling is the process of examining data and collecting metadata about the quality of the
data, about frequency of statistical patterns, interdependencies, uniqueness, and redundancy. This
type of analytical activity is important for the overall quality and health of an operational data
store (ODS) or data warehouse. In fact, you’ve most likely been doing this activity whether or not
you actually have a defined tool to perform it. Now, rather than use a set of complicated queries
or rely on a third-party product, you have a Data Profiling Task as part of the SSIS development
environment.
The Data Profiling Task is located in the SSIS Toolbox, but you probably shouldn’t attempt to use
the results to make an automated workflow decision in the SSIS package Control Flow. Rather, it
is more of an ad hoc tool for placement in a design-time package that will be run manually outside
of a scheduled process. In fact, the task doesn’t have built-in conditional workflow logic, but
technically you can use XPath queries on the results. The profiler can only report on statistics in the
data; you still need to make judgments about these statistics. For example, a column may contain
an overwhelming amount of NULL values, but the profiler doesn’t know whether this reflects a valid
business scenario.
You can view the structured output file that is produced by the Data Profiling Task in a special Data
Profiler Viewer that provides drill-downs back to the detail level. To access this viewer, select SQL
Server ➪ Integration Services from the Start menu. Once the tool is loaded, use the Open button
to browse to the output file that will be generated by the Data Profiling Task. Figure 3-7 shows an
example of an analysis of the DimCustomer table in the AdventureWorksDW database. You can see
here that the majority of the rows in the MiddleName column are null.
Figure 3-7