Given an image,
SIFT features are detected at different scales by using a scale space representation implemented as an image pyramid.
The pyramid levels are obtained by Gaussian smoothing and sub-sampling of the image resolution while interest points are selected as local extrema (min/max) in the scalespace.
These points (usually called keypoints) are extracted by applying a computable approximation of the Laplacian of Gaussian (LoG) following the same approach of the Hessian detector. In particular,
the SIFT algorithm approximates LoG by iteratively computing the difference between two nearby scales in the scale-space.
This idea is referred to as the Difference of Gaussians (DoG) approach. Once these keypoints are detected,
SIFT descriptors are computed at their locations in both image plane and scale-space. Each descriptor consists in a histogram of 128 elements,
obtained from a 16x16 pixels area around the corresponding keypoint.
The contribution of each pixel is obtained by calculating image gradient magnitude and direction in scale-space and the histogram is computed as the local statistics of gradient