Dealing with the home healthcare data posed specific
problems. First, the data provided by the CDC comes in
large samples. Tens of thousands of patient profiles exist.
For this study some specific choices were made regarding
the types of conditions to be used in order to narrow
down the sample size. For this study it was decided for
example to focus on three conditions only: COPD, hip
replacement, and heart failure. These three groups represent
high volume medical and surgical conditions in
home healthcare. For each of these samples we obtained a
sample size large enough to work with. Second was the
confidential nature of information. While the information
was relatively easy to obtain (FTP via the National
Center for Health Statistics web site), there is no patient
level identifying information beyond the survey. One of
the direct consequences is that it is impossible for research
purposes to follow up on patients' status (for example a
patient case might be classified as having a "successful
outcome" but two days later has to be returned to a hospital
or nursing home). Third, the presence of missing
data: as for many other medical and nursing domains, the
data sets have missing values for certain attributes. There
are many reasons for this. For example, the complete medical
records were not available when the survey was being
performed or the agency did not provide a specific type of
service. Fourth, there were variable specific issues. For
example, the variable "length of stay" reports the number
of days on service which varies widely and thus has a wide
standard deviation. This is a common issue with utilization
data [22]. One question tested as part of the research
was whether or not it made sense to normalize the data by
using log transformations. After testing, it was decided not
to normalize it.