4. SLAM from features
Once we are able to model 3D data, we obtain 2D features and place them in 3D, using the 3D information provided by the cameras. In the case of the SR4000, we extract features from the amplitude of the infrared image, which is similar to a grayscale image but in the range of the infrared spectrum. For the Kinect, we directly use the 2D image provided. In this work, we will use SIFT features [32]. Nevertheless, the method can be applied to any other 2D feature detector.
The SIFT method is widely used in computer vision systems to detect and describe features in an image. It performs a local pixel appearance analysis at different scales, and obtains a descriptor for each feature that can be used for different tasks, such as object recognition. The SIFT features are designed to be invariant to image scale and rotation. Currently, there are
a number of feature detectors and descriptors, like SURF [6], but it is beyond the scope of this work to determine the most efficient. For a good study of different features, see [22,39].
Once the features have been detected, they are hooked to the GNG or directly to the corresponding 3D point. For the case of GNG, the 2D coordinates of a SIFT feature are projected to 3D using the Kinect library. The SIFT feature is then attached to the closest node (using the Euclidean distance between the 3D SIFT coordinates and the 3D node coordinates) of the GNG structure. The SR4000 camera has the advantage of a confidence value, which can be used to remove those features that cannot be trusted. This represents an improvement over stereo systems and the Kinect camera, as it enhances accuracy by removing erroneous points.
To calculate the robot egomotion, we need a method to find the 3D transformation between two consecutive poses. We present two different methods that obtain the egomotion performed by the robot. Both are based on the matching information provided by the features. Feature descriptors are used to determine the matches from two consecutive poses. We now briefly describe the two methods.
The first is based on the RANSAC algorithm [16]. It is an iterative method that estimates the parameters of a mathematical model from a set of observed data which contains outliers. In our case, we look for a 3D Euclidean transformation (our model) that best explains the data (matches between 3D features). At each iteration of the algorithm, a subset of data elements (matches) is randomly selected. These elements are considered
as inliers, and are used to compute a model (3D Euclidean transformation). All other data are then tested against the computed model, and included as inliers if their error is below a threshold. If the estimated model is reasonably good (i.e. its error is low enough and it has enough matches), it is considered to be a good solution. This process is repeated a number of times, before the best solution is returned.
The second method is based on the ICP algorithm [7,37,43]. ICP is used to match two 3D point sets, but it cannot find a good alignment in the presence of outliers. A survey on ICP-based methods can be found in [38]. ICP does not give good results for long-time movements, because these produce a lot of outliers. Using features like SIFT, along with additional information, i.e. descriptors which are robust to brightness and point-of-view changes, is sufficient for this task. Hence, we use descriptors to find matches, instead of using the Euclidean distance as in the original ICP. We have decided to select features close to the camera, because greater distances result in greater 3D errors. Thus, only features with a Z distance below a threshold are considered for matching between two consecutive sets. The movement between two consecutive frames is examined so that sets of features can be selected which intersect and have enough features to enable matching. If the movement is limited to, for example, 1 meter, we select features from 1 to 2 meters in the first frame, and from 0 to 1 in