Introduction
The CRISP-DM methodology
1.1 Hierarchical breakdown
The CRISP-DM methodology is described in terms of a hierarchical process model, consisting of sets of tasks described at four
levels of abstraction (from general to specific): phase, generic task, specialized task, and process instance (see figure 1).
At the top level, the data mining process is organized into a number of phases; each phase consists of several second-level
generic tasks. This second level is called generic because it is intended to be general enough to cover all possible data
mining situations. The generic tasks are intended to be as complete and stable as possible. Complete means covering both
the whole process of data mining and all possible data mining applications. Stable means that the model should be valid
for yet unforeseen developments like new modeling techniques.
The third level, the specialized task level, is the place to describe how actions in the generic tasks should be carried out in
certain specific situations. For example, at the second level there might be a generic task called clean data. The third level
describes how this task differs in different situations, such as cleaning numeric values versus cleaning categorical values,
or whether the problem type is clustering or predictive modeling.
The description of phases and tasks as discrete steps performed in a specific order represents an idealized sequence of
events. In practice, many of the tasks can be performed in a different order, and it will often be necessary to repeatedly
backtrack to previous tasks and repeat certain actions. Our process model does not attempt to capture all of these possible
routes through the data mining process because this would require an overly complex process model.