Data in the Real World Is Dirty:
Lots of potentially incorrect data, e.g., instrument faulty, human or
computer error, transmission error
Two major dirtiness in data
Incomplete (missing): lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
e.g., Occupation = “ ” (missing data)
Incorrect (noise): containing noise, errors, or outliers
e.g., Salary = “−10” (an error)