RELATED WORK
A survey of outlier detection methods was given by
Hodge & Austin , focusing especially on those
developed within the Computer Science community.
Supervised outlier detection methods, are suitable for data whose characteristics do not change through time,
they have training data with normal and abnormal data objects.
There may be multiple normal and/or abnormal classes. Often, the classification problem is highly imbalanced. In semi-supervised recognition methods, the normal class is taught, and data points that do not resemble normal data are considered outliers.
Unsupervised methods process data with no prior knowledge.
Four categories of unsupervised outlier detection algorithms;
(1) In a clustering-based method, like DBSCAN (a density-based algorithm for discovering clusters in large spatial databases) outliers are by-products of the clustering process and will not be in any resulting cluster.
2. The density-based method of uses a Local Outlier Factor (LOF) to find outliers. If the object is isolated with respect to the surrounding neighborhood, the outlier degree would be high, and vice versa.
3. The distribution-based method defines, for instance, outliers to be those points p such that at most 0.02% of points are within 0.13σ of p.
(4) Distance-based outliers are those objects that do not have “enough” neighbours The problem of finding outliers can be solved by answering a nearest neighbour or range query centered at each object O
Several mathematical methods can also be applied to outlier detection. Principal component analysis (PCA) can be used to detect outliers.
PCA computes orthonormal vectors that provide a basis (scores) for the input data. Then principal components are sorted in order of decreasing “significance” or strength. The size of the data can be reduced by eliminating the weaker components which are with low variance.
The convex hull method finds outliers by peeling off the outer layers of convex hulls. Data points on shallow layers are likely to be outliers.